unifying notation, to let L denote loss (instead of e)

This commit is contained in:
NT 2022-05-09 12:13:20 +02:00
parent 437c63be41
commit 0cf71a309b
5 changed files with 10 additions and 11 deletions

View File

@ -109,11 +109,11 @@ of the chain of function compositions of the $\mathcal P_i$ at some current stat
E.g., for two of them E.g., for two of them
$$ $$
\frac{ \partial (\mathcal P_1 \circ \mathcal P_2) }{ \partial \mathbf{u} }|_{\mathbf{u}^n} \frac{ \partial (\mathcal P_1 \circ \mathcal P_2) }{ \partial \mathbf{u} } \Big|_{\mathbf{u}^n}
= =
\frac{ \partial \mathcal P_1 }{ \partial \mathbf{u} }|_{\mathcal P_2(\mathbf{u}^n)} \frac{ \partial \mathcal P_1 }{ \partial \mathbf{u} } \big|_{\mathcal P_2(\mathbf{u}^n)}
\ \
\frac{ \partial \mathcal P_2 }{ \partial \mathbf{u} }|_{\mathbf{u}^n} \ , \frac{ \partial \mathcal P_2 }{ \partial \mathbf{u} } \big|_{\mathbf{u}^n} \ ,
$$ $$
which is just the vector valued version of the "classic" chain rule which is just the vector valued version of the "classic" chain rule

View File

@ -17,7 +17,7 @@ $$ (learn-base)
where $y^*$ denotes reference or "ground truth" solutions. where $y^*$ denotes reference or "ground truth" solutions.
$f^*(x)$ should be approximated with an NN representation $f(x;\theta)$. We typically determine $f$ $f^*(x)$ should be approximated with an NN representation $f(x;\theta)$. We typically determine $f$
with the help of some variant of an error function $e(y,y^*)$, where $y=f(x;\theta)$ is the output with the help of some variant of a loss function $L(y,y^*)$, where $y=f(x;\theta)$ is the output
of the NN. of the NN.
This gives a minimization problem to find $f(x;\theta)$ such that $e$ is minimized. This gives a minimization problem to find $f(x;\theta)$ such that $e$ is minimized.
In the simplest case, we can use an $L^2$ error, giving In the simplest case, we can use an $L^2$ error, giving
@ -28,10 +28,9 @@ $$ (learn-l2)
We typically optimize, i.e. _train_, We typically optimize, i.e. _train_,
with a stochastic gradient descent (SGD) optimizer of choice, e.g. Adam {cite}`kingma2014adam`. with a stochastic gradient descent (SGD) optimizer of choice, e.g. Adam {cite}`kingma2014adam`.
We'll rely on auto-diff to compute the gradient of a scalar loss $L$ w.r.t. the weights, $\partial L / \partial \theta$, We'll rely on auto-diff to compute the gradient of a _scalar_ loss $L$ w.r.t. the weights, $\partial L / \partial \theta$.
We will also assume that $e$ denotes a _scalar_ error function (also It is crucial for the calculation of gradients that this function is scalar,
called cost, or objective function). and the loss function is often also called "error", "cost", or "objective" function.
It is crucial for the efficient calculation of gradients that this function is scalar.
<!-- general goal, minimize E for e(x,y) ... cf. eq. 8.1 from DLbook <!-- general goal, minimize E for e(x,y) ... cf. eq. 8.1 from DLbook
introduce scalar loss, always(!) scalar... (also called *cost* or *objective* function) --> introduce scalar loss, always(!) scalar... (also called *cost* or *objective* function) -->

View File

@ -44,7 +44,7 @@ therefore help to _pin down_ the solution in certain places.
Now our training objective becomes Now our training objective becomes
$$ $$
\text{arg min}_{\theta} \ \alpha_0 \sum_i \big( f(x_i ; \theta)-y_i \big)^2 + \alpha_1 R(x_i) , \text{arg min}_{\theta} \ \alpha_0 \sum_i \big( f(x_i ; \theta)-y^*_i \big)^2 + \alpha_1 R(x_i) ,
$$ (physloss-training) $$ (physloss-training)
where $\alpha_{0,1}$ denote hyperparameters that scale the contribution of the supervised term and where $\alpha_{0,1}$ denote hyperparameters that scale the contribution of the supervised term and

Binary file not shown.

View File

@ -21,12 +21,12 @@ but instead we obtain it via a minimization problem:
by adjusting the weights $\theta$ of our NN representation of $f$ such that by adjusting the weights $\theta$ of our NN representation of $f$ such that
$$ $$
\text{arg min}_{\theta} \sum_i (f(x_i ; \theta)-y^*_i)^2 . \text{arg min}_{\theta} \sum_i \Big(f(x_i ; \theta)-y^*_i \Big)^2 .
$$ (supervised-training) $$ (supervised-training)
This will give us $\theta$ such that $f(x;\theta) = y \approx y^*$ as accurately as possible given This will give us $\theta$ such that $f(x;\theta) = y \approx y^*$ as accurately as possible given
our choice of $f$ and the hyperparameters for training. Note that above we've assumed our choice of $f$ and the hyperparameters for training. Note that above we've assumed
the simplest case of an $L^2$ loss. A more general version would use an error metric $e(x,y)$ the simplest case of an $L^2$ loss. A more general version would use an error metric $e(x,y)$ in the loss $L$
to be minimized via $\text{arg min}_{\theta} \sum_i e( f(x_i ; \theta) , y^*_i) )$. The choice to be minimized via $\text{arg min}_{\theta} \sum_i e( f(x_i ; \theta) , y^*_i) )$. The choice
of a suitable metric is a topic we will get back to later on. of a suitable metric is a topic we will get back to later on.