From 1898b763a636f9d9b89a31400019e79720941dc7 Mon Sep 17 00:00:00 2001
From: "Steven G. Johnson" <stevenj@alum.mit.edu>
Date: Mon, 30 Sep 2024 09:42:46 -0400
Subject: [PATCH] link

---
 notes/Least-Square Fitting.ipynb | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/notes/Least-Square Fitting.ipynb b/notes/Least-Square Fitting.ipynb
index a100ea4..fa520eb 100644
--- a/notes/Least-Square Fitting.ipynb	
+++ b/notes/Least-Square Fitting.ipynb	
@@ -1386,6 +1386,8 @@
     "* **Training** data is what we use to form our fit/model — it goes into $A$ and $b$ in a least-square fit.  This is usually *most* of the data\n",
     "* **Test** data is a subset of the data used to *check* whether the fit is actually doing a good job on the underlying problem.    Sometimes, this is further subdivided into \"validation\" data that is used *during* training while tuning hyperparameters like the degree of polynomial being fitted, versus \"test\" data that is used as a final check after everything is done.\n",
     "\n",
+    "Similar concepts are known by many different names.  The ability of a model to predict previously unseen test data is sometimes called [\"generalizability\"](https://en.wikipedia.org/wiki/Generalization_error).  Splitting the data into training and test/validation data is also called [\"cross-validation\"](https://en.wikipedia.org/wiki/Cross-validation_(statistics)).\n",
+    "\n",
     "Here, let's consider the same problem as above: 50 data points from a degree-3 polynomial $1 + 2a + 3a^2 + 4a^3$ plus noise, but\n",
     "* We'll use a random subset of **20% of the data for testing**, with the remaining 80% for training/fitting.\n",
     "* For each degree $n$, we'll do a least-squares fit on the training data, and then evaluate the **root-mean-square error on the test data**."
@@ -1483,7 +1485,7 @@
     "\n",
     "* As we increase the degree (i.e. add more fit parameters), the **error on the training data decreases**.   We can \"fit all the wiggles\" — if you have enough parameters [\"you can fit an elephant\"](https://en.wikipedia.org/wiki/Von_Neumann%27s_elephant).\n",
     "* However, as the degree increases beyond 3 (the \"ground truth\" underlying model), the **error on the training data stops decreasing** and soon **begins to increase**.   By \"fitting the wiggles\" of the training data, we have begun to \"overfit\" the problem, and the fit actually gets *worse* in between the training points.\n",
-    "* The test error is rather fragile, and susceptible to large random fluctuations if we repeat our experiment.  (Though these fluctuations get smaller if we generate more data.)   **Precisely detecting when overfitting begins can be difficult**.\n",
+    "* The test error is rather fragile, and susceptible to large random fluctuations if we repeat our experiment.  (Though these fluctuations get smaller if we generate more data.  Many cross-validation methods proceed through multiple \"rounds\" in which different subsets are used as training and test data.)   **Precisely detecting when overfitting begins can be difficult**.\n",
     "* Overfitting is especially likely **when the number of parameters becomes comparable to the size of the training set**.   Here, we see significant overfitting even when the number of parameters is 1/4 the size of the training set!  (This is a huge challenge for neural-network models, which often have a vast number of parameters.)"
    ]
   },