unifying notation, to let L denote loss (instead of e)

2022-05-09 12:13:20 +02:00 · 2022-05-09 12:13:20 +02:00 · 0cf71a309b
commit 0cf71a309b
parent 437c63be41
5 changed files with 10 additions and 11 deletions
--- a/diffphys.md
+++ b/diffphys.md
@ -109,11 +109,11 @@ of the chain of function compositions of the $\mathcal P_i$ at some current stat
 E.g., for two of them
 $$
-    \frac{ \partial (\mathcal P_1 \circ \mathcal P_2) }{ \partial \mathbf{u} }|_{\mathbf{u}^n}
+    \frac{ \partial (\mathcal P_1 \circ \mathcal P_2) }{ \partial \mathbf{u} } \Big|_{\mathbf{u}^n}
    = 
-    \frac{ \partial \mathcal P_1 }{ \partial \mathbf{u} }|_{\mathcal P_2(\mathbf{u}^n)}
+    \frac{ \partial \mathcal P_1 }{ \partial \mathbf{u} } \big|_{\mathcal P_2(\mathbf{u}^n)}
    \ 
-    \frac{ \partial \mathcal P_2 }{ \partial \mathbf{u} }|_{\mathbf{u}^n} \ , 
+    \frac{ \partial \mathcal P_2 }{ \partial \mathbf{u} } \big|_{\mathbf{u}^n} \ , 
 $$
 which is just the vector valued version of the "classic" chain rule
--- a/overview-equations.md
+++ b/overview-equations.md
@ -17,7 +17,7 @@ $$ (learn-base)
 where $y^*$ denotes reference or "ground truth" solutions.
 $f^*(x)$ should be approximated with an NN representation $f(x;\theta)$. We typically determine $f$ 
-with the help of some variant of an error function $e(y,y^*)$, where $y=f(x;\theta)$ is the output
+with the help of some variant of a loss function $L(y,y^*)$, where $y=f(x;\theta)$ is the output
 of the NN.
 This gives a minimization problem to find $f(x;\theta)$ such that $e$ is minimized.
 In the simplest case, we can use an $L^2$ error, giving
@ -28,10 +28,9 @@ $$ (learn-l2)
 We typically optimize, i.e. _train_, 
 with a stochastic gradient descent (SGD) optimizer of choice, e.g. Adam {cite}`kingma2014adam`.
-We'll rely on auto-diff to compute the gradient of a scalar loss $L$ w.r.t. the weights, $\partial L / \partial \theta$,
+We'll rely on auto-diff to compute the gradient of a _scalar_ loss $L$ w.r.t. the weights, $\partial L / \partial \theta$.
-We will also assume that $e$ denotes a _scalar_ error function (also
+It is crucial for the calculation of gradients that this function is scalar,
-called cost, or objective function).
+and the loss function is often also called "error", "cost", or "objective" function.
 It is crucial for the efficient calculation of gradients that this function is scalar.
 <!-- general goal, minimize E for e(x,y) ... cf. eq. 8.1 from DLbook 
 introduce scalar loss, always(!) scalar...  (also called *cost* or *objective* function) -->
--- a/physicalloss.md
+++ b/physicalloss.md
@ -44,7 +44,7 @@ therefore help to _pin down_ the solution in certain places.
 Now our training objective becomes
 $$
-\text{arg min}_{\theta} \ \alpha_0 \sum_i \big( f(x_i ; \theta)-y_i \big)^2 + \alpha_1 R(x_i) ,
+\text{arg min}_{\theta} \ \alpha_0 \sum_i \big( f(x_i ; \theta)-y^*_i \big)^2 + \alpha_1 R(x_i) ,
 $$ (physloss-training)
 where $\alpha_{0,1}$ denote hyperparameters that scale the contribution of the supervised term and 
--- a/resources/pbdl-figures.key
+++ b/resources/pbdl-figures.key
--- a/supervised.md
+++ b/supervised.md
@ -21,12 +21,12 @@ but instead we obtain it via a minimization problem:
 by adjusting the weights $\theta$ of our NN representation of $f$ such that
 $$
-\text{arg min}_{\theta} \sum_i (f(x_i ; \theta)-y^*_i)^2 .
+\text{arg min}_{\theta} \sum_i \Big(f(x_i ; \theta)-y^*_i \Big)^2 .
 $$ (supervised-training)
 This will give us $\theta$ such that $f(x;\theta) =  y \approx y^*$ as accurately as possible given
 our choice of $f$ and the hyperparameters for training. Note that above we've assumed 
-the simplest case of an $L^2$ loss. A more general version would use an error metric $e(x,y)$
+the simplest case of an $L^2$ loss. A more general version would use an error metric $e(x,y)$ in the loss $L$
 to be minimized via $\text{arg min}_{\theta} \sum_i e( f(x_i ; \theta) , y^*_i) )$. The choice
 of a suitable metric is a topic we will get back to later on.