unifying notation, to let L denote loss (instead of e)
This commit is contained in:
parent
437c63be41
commit
0cf71a309b
@ -109,11 +109,11 @@ of the chain of function compositions of the $\mathcal P_i$ at some current stat
|
||||
E.g., for two of them
|
||||
|
||||
$$
|
||||
\frac{ \partial (\mathcal P_1 \circ \mathcal P_2) }{ \partial \mathbf{u} }|_{\mathbf{u}^n}
|
||||
\frac{ \partial (\mathcal P_1 \circ \mathcal P_2) }{ \partial \mathbf{u} } \Big|_{\mathbf{u}^n}
|
||||
=
|
||||
\frac{ \partial \mathcal P_1 }{ \partial \mathbf{u} }|_{\mathcal P_2(\mathbf{u}^n)}
|
||||
\frac{ \partial \mathcal P_1 }{ \partial \mathbf{u} } \big|_{\mathcal P_2(\mathbf{u}^n)}
|
||||
\
|
||||
\frac{ \partial \mathcal P_2 }{ \partial \mathbf{u} }|_{\mathbf{u}^n} \ ,
|
||||
\frac{ \partial \mathcal P_2 }{ \partial \mathbf{u} } \big|_{\mathbf{u}^n} \ ,
|
||||
$$
|
||||
|
||||
which is just the vector valued version of the "classic" chain rule
|
||||
|
@ -17,7 +17,7 @@ $$ (learn-base)
|
||||
|
||||
where $y^*$ denotes reference or "ground truth" solutions.
|
||||
$f^*(x)$ should be approximated with an NN representation $f(x;\theta)$. We typically determine $f$
|
||||
with the help of some variant of an error function $e(y,y^*)$, where $y=f(x;\theta)$ is the output
|
||||
with the help of some variant of a loss function $L(y,y^*)$, where $y=f(x;\theta)$ is the output
|
||||
of the NN.
|
||||
This gives a minimization problem to find $f(x;\theta)$ such that $e$ is minimized.
|
||||
In the simplest case, we can use an $L^2$ error, giving
|
||||
@ -28,10 +28,9 @@ $$ (learn-l2)
|
||||
|
||||
We typically optimize, i.e. _train_,
|
||||
with a stochastic gradient descent (SGD) optimizer of choice, e.g. Adam {cite}`kingma2014adam`.
|
||||
We'll rely on auto-diff to compute the gradient of a scalar loss $L$ w.r.t. the weights, $\partial L / \partial \theta$,
|
||||
We will also assume that $e$ denotes a _scalar_ error function (also
|
||||
called cost, or objective function).
|
||||
It is crucial for the efficient calculation of gradients that this function is scalar.
|
||||
We'll rely on auto-diff to compute the gradient of a _scalar_ loss $L$ w.r.t. the weights, $\partial L / \partial \theta$.
|
||||
It is crucial for the calculation of gradients that this function is scalar,
|
||||
and the loss function is often also called "error", "cost", or "objective" function.
|
||||
|
||||
<!-- general goal, minimize E for e(x,y) ... cf. eq. 8.1 from DLbook
|
||||
introduce scalar loss, always(!) scalar... (also called *cost* or *objective* function) -->
|
||||
|
@ -44,7 +44,7 @@ therefore help to _pin down_ the solution in certain places.
|
||||
Now our training objective becomes
|
||||
|
||||
$$
|
||||
\text{arg min}_{\theta} \ \alpha_0 \sum_i \big( f(x_i ; \theta)-y_i \big)^2 + \alpha_1 R(x_i) ,
|
||||
\text{arg min}_{\theta} \ \alpha_0 \sum_i \big( f(x_i ; \theta)-y^*_i \big)^2 + \alpha_1 R(x_i) ,
|
||||
$$ (physloss-training)
|
||||
|
||||
where $\alpha_{0,1}$ denote hyperparameters that scale the contribution of the supervised term and
|
||||
|
Binary file not shown.
@ -21,12 +21,12 @@ but instead we obtain it via a minimization problem:
|
||||
by adjusting the weights $\theta$ of our NN representation of $f$ such that
|
||||
|
||||
$$
|
||||
\text{arg min}_{\theta} \sum_i (f(x_i ; \theta)-y^*_i)^2 .
|
||||
\text{arg min}_{\theta} \sum_i \Big(f(x_i ; \theta)-y^*_i \Big)^2 .
|
||||
$$ (supervised-training)
|
||||
|
||||
This will give us $\theta$ such that $f(x;\theta) = y \approx y^*$ as accurately as possible given
|
||||
our choice of $f$ and the hyperparameters for training. Note that above we've assumed
|
||||
the simplest case of an $L^2$ loss. A more general version would use an error metric $e(x,y)$
|
||||
the simplest case of an $L^2$ loss. A more general version would use an error metric $e(x,y)$ in the loss $L$
|
||||
to be minimized via $\text{arg min}_{\theta} \sum_i e( f(x_i ; \theta) , y^*_i) )$. The choice
|
||||
of a suitable metric is a topic we will get back to later on.
|
||||
|
||||
|
Loading…
Reference in New Issue
Block a user