fixes Mario (arch) and equation cleanup
This commit is contained in:
parent
4589cf2860
commit
27b3940d06
19
make-pdf.sh
19
make-pdf.sh
@ -17,26 +17,9 @@ ${PYT} json-cleanup-for-pdf.py
|
||||
|
||||
/Users/thuerey/Library/Python/3.9/bin/jupyter-book build . --builder pdflatex
|
||||
|
||||
#necessary fixes for jupyter-book 1.0.3
|
||||
#open book.tex in text editor:
|
||||
#problem 1: replace all
|
||||
#begin{align} with begin{aligned}
|
||||
#end{align} with end{aligned}
|
||||
|
||||
#problem 2:
|
||||
#\begin{equation*}
|
||||
#\begin{split}
|
||||
#\begin{equation} <- aligned
|
||||
#...
|
||||
#\end{equation} <- aligned
|
||||
#\end{split}
|
||||
#\end{equation*}
|
||||
|
||||
# manual
|
||||
#xelatex book
|
||||
# manual?
|
||||
#xelatex book
|
||||
|
||||
# unused fixup-latex.py
|
||||
|
||||
|
||||
|
||||
|
@ -33,10 +33,10 @@
|
||||
"To regress the velocity field, we define the _flow matching_ objective\n",
|
||||
"\n",
|
||||
"$$\n",
|
||||
"\\begin{align}\n",
|
||||
"\\begin{aligned}\n",
|
||||
" \\mathcal{L}_\\mathrm{FM}(\\theta) = \\mathbb{E}_{t\\sim \\mathcal{U}(0,1), x \\sim p_t(x)} ~||~ v_\\theta(x,t) - u_t(x)\n",
|
||||
" ~||^2.\n",
|
||||
"\\end{align}\n",
|
||||
"\\end{aligned}\n",
|
||||
"$$ (eq-flow-matching)\n",
|
||||
"\n",
|
||||
"In order to evaluate this loss, we need to sample from the probability distribution $p_t(x)$ and we need to know the velocity $u_t(x)$.\n",
|
||||
@ -50,9 +50,9 @@
|
||||
"Flow networks can then be trained with the _conditional flow matching_ loss\n",
|
||||
"\n",
|
||||
"$$\n",
|
||||
"\\begin{align}\n",
|
||||
"\\begin{aligned}\n",
|
||||
" \\mathcal{L}_\\mathrm{CFM}(\\phi) = \\mathbb{E}_{q(z,t),p_t(x|z)} ~||~ v_\\phi(x,t)-u_t(x|z) ~||^2.\n",
|
||||
"\\end{align}\n",
|
||||
"\\end{aligned}\n",
|
||||
"$$ (conditional-flow-matching)\n",
|
||||
"\n",
|
||||
"This version is tractable and can be used for actual training runs, in contrast to the un-conditional objective from equation {eq}`eq-flow-matching`.\n",
|
||||
@ -68,10 +68,10 @@
|
||||
"Plugging it into the equations, we consider the coupling $q(z) = p_1(x)$ together with conditional probability and generating velocity\n",
|
||||
"\n",
|
||||
"$$\n",
|
||||
"\\begin{align}\n",
|
||||
"\\begin{aligned}\n",
|
||||
" p_t(x|x_1) &= \\mathcal{N}( t x_1, (1-(1-\\sigma_\\mathrm{min})t)I) \\\\\n",
|
||||
" u_t(x|x_1) &= \\frac{x_1-(1-\\sigma_\\mathrm{min})x}{1-(1-\\sigma_\\mathrm{min})t} ~.\n",
|
||||
"\\end{align}\n",
|
||||
"\\end{aligned}\n",
|
||||
"$$ (conditional_ot_prob)\n",
|
||||
"\n",
|
||||
"Conditioned on $x_1$, this coupling transports a point $x_0 \\sim \\mathcal{N}(0,I)$ from the sampling distribution to the posterior distribution on the linear trajectory $t x_1$ ending in $x_1$. At the same time it decreases the standard deviation from $1$ to a smoothing constant $\\sigma_\\mathrm{min}$. In this case, the transport path coincides with the optimal transport between two Gaussian distributions.\n",
|
||||
@ -280,35 +280,7 @@
|
||||
"Epoch 8/50, Loss: 2.0625\n",
|
||||
"Epoch 9/50, Loss: 2.0656\n",
|
||||
"Epoch 10/50, Loss: 2.0527\n",
|
||||
"Epoch 11/50, Loss: 2.0776\n",
|
||||
"Epoch 12/50, Loss: 2.0550\n",
|
||||
"Epoch 13/50, Loss: 2.0521\n",
|
||||
"Epoch 14/50, Loss: 2.0700\n",
|
||||
"Epoch 15/50, Loss: 2.0398\n",
|
||||
"Epoch 16/50, Loss: 2.0406\n",
|
||||
"Epoch 17/50, Loss: 2.0620\n",
|
||||
"Epoch 18/50, Loss: 2.0879\n",
|
||||
"Epoch 19/50, Loss: 2.0883\n",
|
||||
"Epoch 20/50, Loss: 2.0175\n",
|
||||
"Epoch 21/50, Loss: 2.0247\n",
|
||||
"Epoch 22/50, Loss: 2.0447\n",
|
||||
"Epoch 23/50, Loss: 2.0342\n",
|
||||
"Epoch 24/50, Loss: 2.0436\n",
|
||||
"Epoch 25/50, Loss: 2.0334\n",
|
||||
"Epoch 26/50, Loss: 2.0474\n",
|
||||
"Epoch 27/50, Loss: 2.0547\n",
|
||||
"Epoch 28/50, Loss: 2.0368\n",
|
||||
"Epoch 29/50, Loss: 2.0394\n",
|
||||
"Epoch 30/50, Loss: 2.0229\n",
|
||||
"Epoch 31/50, Loss: 2.0394\n",
|
||||
"Epoch 32/50, Loss: 2.0581\n",
|
||||
"Epoch 33/50, Loss: 1.9796\n",
|
||||
"Epoch 34/50, Loss: 2.0157\n",
|
||||
"Epoch 35/50, Loss: 2.0325\n",
|
||||
"Epoch 36/50, Loss: 2.0319\n",
|
||||
"Epoch 37/50, Loss: 2.0511\n",
|
||||
"Epoch 38/50, Loss: 2.0593\n",
|
||||
"Epoch 39/50, Loss: 1.9919\n",
|
||||
"...\n",
|
||||
"Epoch 40/50, Loss: 2.0522\n",
|
||||
"Epoch 41/50, Loss: 2.0709\n",
|
||||
"Epoch 42/50, Loss: 2.0460\n",
|
||||
|
@ -38,10 +38,7 @@ $
|
||||
where $\beta_r \in (0,1)$, and $Z^0 \equiv Z(t)$. Any ${Z}^r$ can be sampled directly via:
|
||||
|
||||
$$
|
||||
\begin{equation}
|
||||
{Z}^r = \sqrt{\bar{\alpha}_r} {Z}^0 + \sqrt{1-\bar{\alpha}_r} {\epsilon},
|
||||
\label{eq:noise}
|
||||
\end{equation}
|
||||
$$
|
||||
|
||||
with $\alpha_r := 1 - \beta_r$, $\bar{\alpha}_r := \prod_{s=1}^r \alpha_s$ and ${\epsilon} \sim \mathcal{N}(\mathbf{0}, \mathbf{I})$.
|
||||
@ -52,12 +49,9 @@ $
|
||||
where the mean and variance are parameterized as:
|
||||
|
||||
$$
|
||||
\begin{equation}
|
||||
{\mu}_\theta^r = \frac{1}{\sqrt{\alpha_r}} \left( {Z}^r - \frac{\beta_r}{\sqrt{1-\bar{\alpha}_r}} {\epsilon}_\theta^r \right),
|
||||
\qquad
|
||||
{\Sigma}_\theta^r = \exp\left( \mathbf{v}_\theta^r \log \beta_r + (1-\mathbf{v}_\theta^r)\log \tilde{\beta}_r \right),
|
||||
\label{eq:mu_param}
|
||||
\end{equation}
|
||||
$$
|
||||
|
||||
with $\tilde{\beta}_r := (1 - \bar{\alpha}_{r-1}) / (1 - \bar{\alpha}_r) \beta_r$. Here, ${\epsilon}_\theta^r \in \mathbb{R}^{|\mathcal{V}| \times F}$ predicts the noise ${\epsilon}$ in equation~(\ref{eq:noise}), and $\mathbf{v}_\theta^r \in \mathbb{R}^{|\mathcal{V}| \times F}$ interpolates between the two bounds of the process' entropy, $\beta_r$ and $\tilde{\beta}_r$.
|
||||
@ -65,9 +59,7 @@ with $\tilde{\beta}_r := (1 - \bar{\alpha}_{r-1}) / (1 - \bar{\alpha}_r) \beta_r
|
||||
DGNs predict ${\epsilon}_\theta^r$ and $\mathbf{v}_\theta^r$ using a regular message-passing-based GNN {cite}`sanchez2020learning`. This takes ${Z}^{r-1}$ as input, and it is conditioned on graph ${\mathcal{G}}$, its node and edge features, and the diffusion step $r$:
|
||||
|
||||
$$
|
||||
\begin{equation}
|
||||
[{\epsilon}_\theta^r, \mathbf{v}_\theta^r] \leftarrow \text{{DGN}}_\theta({Z}^{r-1}, {\mathcal{G}}, {V}_c, {E}_c, r).
|
||||
\end{equation}
|
||||
$$
|
||||
|
||||
The _DGN_ network is trained using the loss function in equation~(\ref{eq:loss}). The full denoising process requires $R$ evaluations of the DGN to transition from ${Z}^R$ to ${Z}^0$.
|
||||
@ -75,14 +67,12 @@ The _DGN_ network is trained using the loss function in equation~(\ref{eq:loss})
|
||||
DGN follows the widely used encoder-processor-decoder GNN architecture. In addition to the node and edge encoders, our encoder includes a diffusion-step encoder, which generates a vector ${r}_\text{emb} \in \mathbb{R}^{F_\text{emb}}$ that embeds the diffusion step $r$. The node encoder processes the conditional node features ${v}_i^c$, alongside ${r}_\text{emb}$. Specifically, the diffusion-step encoder and the node encoder operate as follows:
|
||||
|
||||
$$
|
||||
\begin{equation}
|
||||
{r}_\text{emb} \leftarrow
|
||||
\phi \circ {\small Linear} \circ {\small SinEmb} (r),
|
||||
\quad
|
||||
{v}_i \leftarrow {\small Linear} \left( \left[ \phi \circ {\small Linear} ({v}_i^c) \ | \ {r}_\text{emb}
|
||||
\right] \right),
|
||||
\quad \forall i \in \mathcal{V},
|
||||
\end{equation}
|
||||
$$
|
||||
|
||||
where $\phi$ denotes the activation function and ${\small SinEmb}$ is the sinusoidal embedding function. The edge encoder applies a linear layer to the conditional edge features ${e}_{ij}^c$.
|
||||
@ -110,9 +100,7 @@ In this configuration, the VGAE captures high-frequency information (e.g., spati
|
||||
For the VGAE, an encoder-decoder architecture is used with an additional condition encoder to handle conditioning inputs (Figure~\ref{fig:diagram}a). The condition encoder processes ${V}_c$ and ${E}_c$, encoding these into latent node features ${V}^\ell_c$ and edge features ${E}^\ell_c$ across $L$ graphs $\{{\mathcal{G}}^\ell := ({\mathcal{V}}^\ell, {\mathcal{E}}^\ell) {I}d 1 \leq \ell \leq L\}$, where ${\mathcal{G}}^1 \equiv {\mathcal{G}}$ and the size of the graphs decreases progressively, i.e., $|{\mathcal{V}}^1| > |{\mathcal{V}}^2| > \dots > |{\mathcal{V}}^L|$. This transformation begins by linearly projecting ${V}_c$ and ${E}_c$ to a $F_\text{ae}$-dimensional space and applying two message-passing layers to yield ${V}^1_c$ and ${E}^1_c$. Then, $L-1$ encoding blocks are applied sequentially:
|
||||
|
||||
$$
|
||||
\begin{equation}
|
||||
\left[{V}^{\ell+1}_c, {E}^{\ell+1}_c \right] \leftarrow {\small MP} \circ {\small MP} \circ {\small GraphPool} \left({V}^\ell_c, {E}^\ell_c \right), \quad \text{for} \ l = 1, 2, \dots, L-1,
|
||||
\end{equation}
|
||||
$$
|
||||
|
||||
where _MP_ denotes a message-passing layer and _GraphPool_ denotes a graph-pooling layer (see the diagram on Figure~\ref{fig:vgae}a).
|
||||
@ -120,17 +108,13 @@ where _MP_ denotes a message-passing layer and _GraphPool_ denotes a graph-pooli
|
||||
The encoder produces two $F_L$-dimensional vectors for each node $i \in {\mathcal{V}}^L$, the mean ${\mu}_i$ and standard deviation ${\sigma}_i$ that parametrize a Gaussian distribution over the latent space. It takes as input a state ${Z}(t)$, which is linearly projected to a $F_\text{ae}$-dimensional vector space and then passed through $L-1$ sequential down-sampling blocks (message passing + graph pooling), each conditioned on the outputs of the condition encoder:
|
||||
|
||||
$$
|
||||
\begin{equation}
|
||||
{V} \leftarrow {\small GraphPool} \circ {\small MP} \circ {\small MP} \left( {V} + {\small Linear}\left({V}^\ell_c \right), {\small Linear}\left({E}^\ell_c \right) \right), \ \text{for} \ l = 1, 2, \dots, L-1;
|
||||
\end{equation}
|
||||
$$
|
||||
|
||||
and a bottleneck block:
|
||||
|
||||
$$
|
||||
\begin{equation}
|
||||
{V} \leftarrow {\small MP} \circ {\small MP} \left( {V} + {\small Linear}\left({V}^L_c \right), {\small Linear}\left({E}^L_c \right) \right).
|
||||
\end{equation}
|
||||
$$
|
||||
|
||||
The output features are passed through a node-wise MLP that returns ${\mu}_i$ and ${\sigma}_i$ for each node $i \in {\mathcal{V}}^L$. The latent variables are then computed as ${\zeta}_i = {\small BatchNorm}({\mu}_i + {\sigma}_i {\epsilon}_i$), where ${\epsilon}_i \sim \mathcal{N}(0, {I})$. Finally, the decoder mirrors the encoder, employing a symmetric architecture (replacing graph pooling by graph unpooling layers) to upsample the latent features back to the original graph ${\mathcal{G}}$ (Figure~\ref{fig:vgae}c). Its blocks are also conditioned on the outputs of the condition encoder. The message passing and the graph pooling and unpooling layers in the VGAE are the same as in the (L)DGN.
|
||||
|
@ -30,7 +30,7 @@ The first case is a somewhat special one: without any information about spatial
|
||||
|
||||
If you decide to use a **neural fields** approach where the network receives the position as input, this has the same effect: the NN will not have any direct means of querying neighbors via architectural tricks ("inductive biases"). In this case, the building blocks below won't be applicable, and it's worth considering whether you can introduce more structure via a discretization.
|
||||
|
||||
Note that _physics-informed neural networks_ (PINNs) also fall into this category. We'll go into more detail here later on ({doc}`diffphys`), but generally it's advisable to consider switching to an approach that employs prior knowledge in the form or a discretization. This usually substantially improves inference accuracy and improves convergence. That PINNs can't solve real-world problems despite many years of research points to the fundamental problems of this approach.
|
||||
Note that _physics-informed neural networks_ (PINNs) also fall into this category. We'll go into more detail here later on ({doc}`diffphys`), but generally it's advisable to consider switching to an approach that employs prior knowledge in the form of a discretization. This usually substantially improves inference accuracy and improves convergence. That PINNs can't solve real-world problems despite many years of research points to the fundamental problems of this approach.
|
||||
|
||||
Focusing on dense layers still leaves a few choices concerning the number of layers, their size, and activations. The other three cases have the same choices, and these hyperparameters of the architectures are typically determined over the course of training runs. General recommendations are that _ReLU_ and smoother variants like _GELU_ are good choices, and that the number of layers should scale together with their size.
|
||||
Next, we'll focus on the remaining three cases with spatial information in the following, as differences can have a profound impact here. So, below we target cases where we have a "computational domain" specifying the region of interest in which the samples are located.
|
||||
@ -39,7 +39,7 @@ Next, we'll focus on the remaining three cases with spatial information in the f
|
||||
|
||||
Probably the most important aspect of different architectures then is the question of their _receptive field_: this means for any single sample in our domain, which neighborhood of other sample points can influence the solution at this point. This is similar to classic considerations for PDE solving, where denoting a PDE as _hyperbolic_ indicates its local, wave-like behavior in contrast to an _elliptic_ one with global behavior. Certain NN architectures such as the classic convolutional neural networks (CNNs) support only local influences and receptive fields, while hierarchies with pooling expand these receptive field to effectively global ones. An interesting variant here are spectral architectures like FNOs, which provide global receptive fields at the expense of other aspects. In addition Transformers (with attention mechanisms), provide a more complicated but scalable alternative.
|
||||
|
||||
Thus, a fundamental distinction can be made in terms of local vs global architectures, and for the latter, how they realize the global receptive field. The following table provides a first overview, and below we'll discuss the pros and cons of each variant.
|
||||
Thus, a fundamental distinction can be made in terms of spatially local vs global architectures, and for the latter, how they realize the global receptive field. The following table provides a first overview, and below we'll discuss the pros and cons of each variant.
|
||||
|
||||
| | Grid | Unstructured | Points | Non-spatial |
|
||||
|-------------|-----------------|-------------------|-------------------|-------------|
|
||||
@ -168,7 +168,7 @@ In a Transformer architecture, the attention output is used as component of a bu
|
||||
|
||||
This Transformer architecture was shown to scale extremely well to networks with huge numbers of parameters, one of the key advantages of Transformers. Note that a large part of the weights typically ends up in the matrices of the attention, and not just in the dense layers. At the same time, attention offers a powerful way for working with global dependencies in inputs. This comes at the cost of a more complicated architecture. An inherent problem of the self-attention mechanism above is that it's quadratic in the number of tokens $N$. This naturally puts a limit on the size and resolution of inputs. Under the hood, it's also surprisingly simple: the attention algorithm computes an $N \times N$ matrix, which is not too far from applying a simple dense layer (this would likewise come with an $N \times N$ weight matrix) to resolve global influences.
|
||||
|
||||
This bottleneck can be addressed with _linear attention_: it changes the algorithm above to multiply Q and (K^T V) instead, applying a non-linearity (e.g., an exponential) to both parts beforehand. This avoids the $N \times N$ matrix and scales linearly in $N$. However, this improvement comes at the cost of a more approximate attention vector.
|
||||
This bottleneck can be addressed with _linear attention_: it changes the algorithm above to multiply $Q$ and ($K^T V$) instead, applying a non-linearity (e.g., an exponential) to both parts beforehand. This avoids the $N \times N$ matrix and scales linearly in $N$. However, this improvement comes at the cost of a more approximate attention vector.
|
||||
|
||||
An interesting aspect of Transformer architectures is also that they've been applied to structured as well as unstructured inputs. I.e., they've been used for graphs, points as well as grid-based data. In all cases the differences primarily lie in how inputs are mapped to the tokens. The attention is typically still "dense" in the token space. This is a clear limitation: for problems with a known spatial structure, discarding this information will inevitably need to be compensated for, e.g., with a larger weight count or lower inference accuracy. Nonetheless, Transformers are an extremely active field within DL, and clearly a potential contender for future NN algorithms.
|
||||
|
||||
|
Loading…
x
Reference in New Issue
Block a user