unifying notation, to let L denote loss (instead of e)

This commit is contained in:
NT 2022-05-09 12:13:20 +02:00
parent 437c63be41
commit 0cf71a309b
5 changed files with 10 additions and 11 deletions

View File

@ -109,11 +109,11 @@ of the chain of function compositions of the $\mathcal P_i$ at some current stat
E.g., for two of them
$$
\frac{ \partial (\mathcal P_1 \circ \mathcal P_2) }{ \partial \mathbf{u} }|_{\mathbf{u}^n}
\frac{ \partial (\mathcal P_1 \circ \mathcal P_2) }{ \partial \mathbf{u} } \Big|_{\mathbf{u}^n}
=
\frac{ \partial \mathcal P_1 }{ \partial \mathbf{u} }|_{\mathcal P_2(\mathbf{u}^n)}
\frac{ \partial \mathcal P_1 }{ \partial \mathbf{u} } \big|_{\mathcal P_2(\mathbf{u}^n)}
\
\frac{ \partial \mathcal P_2 }{ \partial \mathbf{u} }|_{\mathbf{u}^n} \ ,
\frac{ \partial \mathcal P_2 }{ \partial \mathbf{u} } \big|_{\mathbf{u}^n} \ ,
$$
which is just the vector valued version of the "classic" chain rule

View File

@ -17,7 +17,7 @@ $$ (learn-base)
where $y^*$ denotes reference or "ground truth" solutions.
$f^*(x)$ should be approximated with an NN representation $f(x;\theta)$. We typically determine $f$
with the help of some variant of an error function $e(y,y^*)$, where $y=f(x;\theta)$ is the output
with the help of some variant of a loss function $L(y,y^*)$, where $y=f(x;\theta)$ is the output
of the NN.
This gives a minimization problem to find $f(x;\theta)$ such that $e$ is minimized.
In the simplest case, we can use an $L^2$ error, giving
@ -28,10 +28,9 @@ $$ (learn-l2)
We typically optimize, i.e. _train_,
with a stochastic gradient descent (SGD) optimizer of choice, e.g. Adam {cite}`kingma2014adam`.
We'll rely on auto-diff to compute the gradient of a scalar loss $L$ w.r.t. the weights, $\partial L / \partial \theta$,
We will also assume that $e$ denotes a _scalar_ error function (also
called cost, or objective function).
It is crucial for the efficient calculation of gradients that this function is scalar.
We'll rely on auto-diff to compute the gradient of a _scalar_ loss $L$ w.r.t. the weights, $\partial L / \partial \theta$.
It is crucial for the calculation of gradients that this function is scalar,
and the loss function is often also called "error", "cost", or "objective" function.
<!-- general goal, minimize E for e(x,y) ... cf. eq. 8.1 from DLbook
introduce scalar loss, always(!) scalar... (also called *cost* or *objective* function) -->

View File

@ -44,7 +44,7 @@ therefore help to _pin down_ the solution in certain places.
Now our training objective becomes
$$
\text{arg min}_{\theta} \ \alpha_0 \sum_i \big( f(x_i ; \theta)-y_i \big)^2 + \alpha_1 R(x_i) ,
\text{arg min}_{\theta} \ \alpha_0 \sum_i \big( f(x_i ; \theta)-y^*_i \big)^2 + \alpha_1 R(x_i) ,
$$ (physloss-training)
where $\alpha_{0,1}$ denote hyperparameters that scale the contribution of the supervised term and

Binary file not shown.

View File

@ -21,12 +21,12 @@ but instead we obtain it via a minimization problem:
by adjusting the weights $\theta$ of our NN representation of $f$ such that
$$
\text{arg min}_{\theta} \sum_i (f(x_i ; \theta)-y^*_i)^2 .
\text{arg min}_{\theta} \sum_i \Big(f(x_i ; \theta)-y^*_i \Big)^2 .
$$ (supervised-training)
This will give us $\theta$ such that $f(x;\theta) = y \approx y^*$ as accurately as possible given
our choice of $f$ and the hyperparameters for training. Note that above we've assumed
the simplest case of an $L^2$ loss. A more general version would use an error metric $e(x,y)$
the simplest case of an $L^2$ loss. A more general version would use an error metric $e(x,y)$ in the loss $L$
to be minimized via $\text{arg min}_{\theta} \sum_i e( f(x_i ; \theta) , y^*_i) )$. The choice
of a suitable metric is a topic we will get back to later on.