This commit is contained in:
Steven G. Johnson 2024-09-30 09:42:46 -04:00
parent 0a0e166616
commit 1898b763a6

View File

@ -1386,6 +1386,8 @@
"* **Training** data is what we use to form our fit/model — it goes into $A$ and $b$ in a least-square fit. This is usually *most* of the data\n",
"* **Test** data is a subset of the data used to *check* whether the fit is actually doing a good job on the underlying problem. Sometimes, this is further subdivided into \"validation\" data that is used *during* training while tuning hyperparameters like the degree of polynomial being fitted, versus \"test\" data that is used as a final check after everything is done.\n",
"\n",
"Similar concepts are known by many different names. The ability of a model to predict previously unseen test data is sometimes called [\"generalizability\"](https://en.wikipedia.org/wiki/Generalization_error). Splitting the data into training and test/validation data is also called [\"cross-validation\"](https://en.wikipedia.org/wiki/Cross-validation_(statistics)).\n",
"\n",
"Here, let's consider the same problem as above: 50 data points from a degree-3 polynomial $1 + 2a + 3a^2 + 4a^3$ plus noise, but\n",
"* We'll use a random subset of **20% of the data for testing**, with the remaining 80% for training/fitting.\n",
"* For each degree $n$, we'll do a least-squares fit on the training data, and then evaluate the **root-mean-square error on the test data**."
@ -1483,7 +1485,7 @@
"\n",
"* As we increase the degree (i.e. add more fit parameters), the **error on the training data decreases**. We can \"fit all the wiggles\" — if you have enough parameters [\"you can fit an elephant\"](https://en.wikipedia.org/wiki/Von_Neumann%27s_elephant).\n",
"* However, as the degree increases beyond 3 (the \"ground truth\" underlying model), the **error on the training data stops decreasing** and soon **begins to increase**. By \"fitting the wiggles\" of the training data, we have begun to \"overfit\" the problem, and the fit actually gets *worse* in between the training points.\n",
"* The test error is rather fragile, and susceptible to large random fluctuations if we repeat our experiment. (Though these fluctuations get smaller if we generate more data.) **Precisely detecting when overfitting begins can be difficult**.\n",
"* The test error is rather fragile, and susceptible to large random fluctuations if we repeat our experiment. (Though these fluctuations get smaller if we generate more data. Many cross-validation methods proceed through multiple \"rounds\" in which different subsets are used as training and test data.) **Precisely detecting when overfitting begins can be difficult**.\n",
"* Overfitting is especially likely **when the number of parameters becomes comparable to the size of the training set**. Here, we see significant overfitting even when the number of parameters is 1/4 the size of the training set! (This is a huge challenge for neural-network models, which often have a vast number of parameters.)"
]
},