pbdl-book/others-timeseries.md

Model Reduction and Time Series
=======================

An inherent challenge for many practical PDE solvers is the large dimensionality of the resulting problems.
Our model $\mathcal{P}$ is typically discretized with $\mathcal{O}(n^3)$ samples for a 3 dimensional 
problem (with $n$ denoting the number of samples along one axis), 
and for time-dependent phenomena we additionally have a discretization along
time. The latter typically scales in accordance to the spatial dimensions, giving an
overall number of samples on the order of $\mathcal{O}(n^4)$. Not surprisingly, 
the workload in these situations quickly explodes for larger $n$ (and for all practical high-fidelity applications we want $n$ to be as large as possible).

One popular way to reduce the complexity is to map a spatial state of our system $\mathbf{s_t} \in \mathbb{R}^{n^3}$
into a much lower dimensional state $\mathbf{c_t} \in \mathbb{R}^{m}$, with $m \ll n^3$. Within this latent space,
we estimate the evolution of our system by inferring a new state $\mathbf{c_{t+1}}$, which we then decode to obtain $\mathbf{s_{t+1}}$. In order for this to work, it's crucial that we can choose $m$ large enough that it captures all important structures in our solution manifold, and that the time prediction of $\mathbf{c_{t+1}}$ can be computed efficiently, such that we obtain a gain in performance despite the additional encoding and decoding steps. In practice, the explosion in terms of unknowns for regular simulations (the $\mathcal{O}(n^3)$ above) coupled with a super-linear complexity for computing a new state $\mathbf{s_t}$ directly makes this approach very expensive, while working with the latent space points $\mathbf{c}$ very quickly pays off for small $m$.

However, it's crucial that encoder and decoder do a good job at reducing the dimensionality of the problem. This is a very good task for DL approaches. Furthermore, we then need a time evolution of the latent space states $\mathbf{c}$, and for most practical model equations, we cannot find closed form solutions to evolve $\mathbf{c}$. Hence, this likewise poses a very good problem for DL. To summarize, we're facing two challenges: learning a good spatial encoding and decoding, together with learning an accurate time evolution.
Below, we will describe an approach to solve this problem following Wiewel et al.
{cite}`wiewel2019lss` & {cite}`wiewel2020lsssubdiv`, which in turn employs 
the encoder/decoder of Kim et al. {cite}`bkim2019deep`.


```{figure} resources/others-timeseries-lsp-overview.jpg
---
name: timeseries-lsp-overview
---
For time series predictions with ROMs, we encode the state of our system with an encoder $f_e$, predict 
the time evolution with $f_t$, and then decode the full spatial information with a decoder $f_d$.
```


## Reduced Order Models 

Reducing the dimension and complexity of computational models, often called _reduced order modeling_ (ROM) or _model reduction_, is a classic topic in the computational field. Traditional techniques often employ techniques such as principal component analysis to arrive at a basis for a chosen space of solution. However, being linear by construction, these approaches have inherent limitations when representing complex, non-linear solution manifolds. In practice, all "interesting" solutions are highly non-linear, and hence DL has received a substantial amount of interest as a way to learn non-linear representations. Due to the non-linearity, DL representations can potentially yield a high accuracy with fewer degrees of freedom in the reduced model compared to classic approaches.

The canonical NN for reduced models is an _autoencoder_. This denotes a network whose sole task is to reconstruct a given input $x$ while passing it through a bottleneck that is typically located in or near the middle of the stack of layers of the NN. The data in the bottleneck then represents the compressed, latent space representation $\mathbf{c}$, the part of the network leading up to it the encoder $f_e$, and the part after the bottleneck the decoder $f_d$. In combination, the learning task can be written as

$$
\text{arg min}_{\theta_e,\theta_d} | f_d( f_e(\mathbf{s};\theta_e) ;\theta_d) - \mathbf{s} |_2^2
$$

with the encoder
$f_e: \mathbb{R}^{n^3} \rightarrow \mathbb{R}^{m}$ with weights $\theta_e$,
and the decoder 
$f_d: \mathbb{R}^{m} \rightarrow \mathbb{R}^{n^3}$ with weights $\theta_d$. For this
learning objective we do not require any other data than the $\mathbf{s}$, as these represent
inputs as well as the reference outputs.

Autoencoder networks are typically realized as stacks of convolutional layers.
While the details of these layers can be chosen flexibly, a key property of all
autoencoder architectures is that no connection between encoder and decoder part may 
exist. Hence, the network has to be separable for encoder and decoder.
This is natural, as any connections (or information) shared between encoder and decoder
would prevent using the encoder or decoder in a standalone manner. E.g., the decoder has to be able to decode a full state $\mathbf{s}$ purely from a latent space point $\mathbf{c}$.

### Autoencoder variants

One popular variant of autoencoders is worth a mention here: the so-called _varational autoencoders_, or VAEs. These
autoencoders follow the structure above, but additionally employ a loss term to shape the latent space of $\mathbf{c}$.
Typically we use a normal distribution as target, which makes the latent space 
an $m$ dimensional unit cube, i.e., each dimension should have a zero mean and unit standard deviation.
This approach is especially useful if the decoder should be used as a generative model. E.g., we can then produce
$\mathbf{c}$ samples directly, and decode them to obtain full states. 
While this is very useful to, e.g., obtain generative models for faces or other types of natural images, it is less 
crucial in a simulation setting. Here we rather want to obtain a latent space that facilitates the temporal prediction,
rather than being able to easily produce samples from it.


## Time series

The goal of the temporal prediction is to compute a latent space state at time $t+1$ given one or more previous
latent space states.
The most straight-forward way to formulate the corresponding minimization problem is

$$
\text{arg min}_{\theta_p} | f_p( \mathbf{c}_{t};\theta_p) - \mathbf{c}_{t+1} |_2^2
$$

where the prediction network is denoted by $f_p$ to distinguish it from encoder and decoder, above.
This already implies that we're facing a recurrent task: any $ith$ step is
the result of $i$ evaluations of $f_p$, i.e. $\mathbf{c}_{t+i} = f_p^{(i)}( \mathbf{c}_{t};\theta_p)$.
As there is an inherent per-evaluation error, it is typically important to train this process
for more than a single step, such that the $f_p$ network "sees" the drift it produces in terms
of the latent space states over time.

```{admonition} Koopman operators
:class: tip

In classical dynamical systems literature, a data-driven prediction of future states
is typically formulated in terms of the so-called _Koopman operator_, which usually takes
the form of a matrix, i.e. uses a linear approach.

Traditional works have focused on obtaining good _Koopman operators_ that yield
a high accuracy in combination with a basis to span the space of solutions. In the approach
outlined above the $f_p$ network can be seen as a non-linear Koopman operator.
```
In order for this approach to work, we either need an appropriate history of previous 
states to uniquely identify the right next state, or our network has to internally
store the previous history of states it has seen.

For the former variant, the prediction network $f_p$ receives more than 
a single $\mathbf{c}_{t}$. For the latter variant, we can turn to algorithms
from the subfield of _recurrent neural networks_ (RNNs). A variety of architectures 
have been proposed to encode and store temporal states of a sytem, the most
popular ones being 
_long short-term memory_ (LSTM) network,
_gated recurrent units_ (GRUs), or
lately attenion-based _transformer_ networks.
No matter which variant is used, these approaches always work with fully-connected layers
as the latent space vectors do not exhibit any spatial structure, but typically represent 
a seemingly random collection of values.
Due to the fully-connected layers, the prediction networks quickly grow in terms
of their parameter count, and thus require relatively a small latent-space dimension $m$.
Luckily, this is in line with our main goals, as outlined at the top.

## End-to-end training

In the formulation above we have clearly split the en- / decoding and the time prediction parts.
However, in practice an _end-to-end_ training of all networks involved in a certain task
is usually preferable, as the networks can adjust their behavior in accordance with the other
components involved in the task.

For the time prediction, we can formulate the objective in terms of $\mathbf{s}$, and use en- and decoder in the
time prediction to compute the loss:

$$
\text{arg min}_{\theta_e,\theta_p,\theta_d} | f_d( f_p( f_e( \mathbf{s}_{t} ;\theta_e)  ;\theta_p) ;\theta_d) - \mathbf{s}_{t+1} |_2^2
$$

Ideally, this step is furthermore unrolled over time to stabilize the evolution over time.
The resulting training will be significantly more expensive, as more weights need to be trained at once,
and a much larger number of intermediate states needs to be processed. However, the increased 
cost typically pays off with a reduced overall inference error.


```{figure} resources/others-timeseries-lss-subdiv-prediction.jpg
---
height: 300px
name: timeseries-lss-subdiv-prediction
---
Several time frames of an example prediction from {cite}`wiewel2020lsssubdiv`, which additionally couples the
learned time evolution with an numerically solved advection step. 
The learned prediction is shown at the top, the reference simulation at the bottom.
```


## Source code

In order to make practical experiments in this area of deep learning, we can 
recommend this
[latent space simulation code](https://github.com/wiewel/LatentSpaceSubdivision),
which realizes an end-to-end training for encoding and prediction.
Alternatively, this
[learned model reduction code](https://github.com/byungsook/deep-fluids) focuses on the
encoding and decoding aspects.

Both are available as open source and use a combination of TensorFlow and mantaflow
as DL and fluid simulation frameworks.
cleanup, added other topics chapter 2021-04-14 13:08:51 +02:00			`Model Reduction and Time Series`
			`=======================`

first version of timeseries 2021-04-17 07:05:42 +02:00			`An inherent challenge for many practical PDE solvers is the large dimensionality of the resulting problems.`
integrated RL, spell check 2021-04-15 10:20:17 +02:00			`Our model $\mathcal{P}$ is typically discretized with $\mathcal{O}(n^3)$ samples for a 3 dimensional`
			`problem (with $n$ denoting the number of samples along one axis),`
			`and for time-dependent phenomena we additionally have a discretization along`
			`time. The latter typically scales in accordance to the spatial dimensions, giving an`
			`overall number of samples on the order of $\mathcal{O}(n^4)$. Not surprisingly,`
first version of timeseries 2021-04-17 07:05:42 +02:00			`the workload in these situations quickly explodes for larger $n$ (and for all practical high-fidelity applications we want $n$ to be as large as possible).`
integrated RL, spell check 2021-04-15 10:20:17 +02:00
			`One popular way to reduce the complexity is to map a spatial state of our system $\mathbf{s_t} \in \mathbb{R}^{n^3}$`
			`into a much lower dimensional state $\mathbf{c_t} \in \mathbb{R}^{m}$, with $m \ll n^3$. Within this latent space,`
first version of timeseries 2021-04-17 07:05:42 +02:00			we estimate the evolution of our system by inferring a new state $\mathbf{c_{t+1}}$, which we then decode to obtain $\mathbf{s_{t+1}}$. In order for this to work, it's crucial that we can choose $m$ large enough that it captures all important structures in our solution manifold, and that the time prediction of $\mathbf{c_{t+1}}$ can be computed efficiently, such that we obtain a gain in performance despite the additional encoding and decoding steps. In practice, the explosion in terms of unknowns for regular simulations (the $\mathcal{O}(n^3)$ above) coupled with a super-linear complexity for computing a new state $\mathbf{s_t}$ directly makes this approach very expensive, while working with the latent space points $\mathbf{c}$ very quickly pays off for small $m$.
integrated RL, spell check 2021-04-15 10:20:17 +02:00
first version of timeseries 2021-04-17 07:05:42 +02:00			However, it's crucial that encoder and decoder do a good job at reducing the dimensionality of the problem. This is a very good task for DL approaches. Furthermore, we then need a time evolution of the latent space states $\mathbf{c}$, and for most practical model equations, we cannot find closed form solutions to evolve $\mathbf{c}$. Hence, this likewise poses a very good problem for DL. To summarize, we're facing two challenges: learning a good spatial encoding and decoding, together with learning an accurate time evolution.
integrated RL, spell check 2021-04-15 10:20:17 +02:00			`Below, we will describe an approach to solve this problem following Wiewel et al.`
			{cite}`wiewel2019lss` & {cite}`wiewel2020lsssubdiv`, which in turn employs
			the encoder/decoder of Kim et al. {cite}`bkim2019deep`.


picture update 2021-04-18 06:52:29 +02:00			```{figure} resources/others-timeseries-lsp-overview.jpg
integrated RL, spell check 2021-04-15 10:20:17 +02:00			`---`
			`name: timeseries-lsp-overview`
			`---`
			`For time series predictions with ROMs, we encode the state of our system with an encoder $f_e$, predict`
			`the time evolution with $f_t$, and then decode the full spatial information with a decoder $f_d$.`
			```


			`## Reduced Order Models`

first version of timeseries 2021-04-17 07:05:42 +02:00			Reducing the dimension and complexity of computational models, often called _reduced order modeling_ (ROM) or _model reduction_, is a classic topic in the computational field. Traditional techniques often employ techniques such as principal component analysis to arrive at a basis for a chosen space of solution. However, being linear by construction, these approaches have inherent limitations when representing complex, non-linear solution manifolds. In practice, all "interesting" solutions are highly non-linear, and hence DL has received a substantial amount of interest as a way to learn non-linear representations. Due to the non-linearity, DL representations can potentially yield a high accuracy with fewer degrees of freedom in the reduced model compared to classic approaches.
integrated RL, spell check 2021-04-15 10:20:17 +02:00
first version of timeseries 2021-04-17 07:05:42 +02:00			The canonical NN for reduced models is an _autoencoder_. This denotes a network whose sole task is to reconstruct a given input $x$ while passing it through a bottleneck that is typically located in or near the middle of the stack of layers of the NN. The data in the bottleneck then represents the compressed, latent space representation $\mathbf{c}$, the part of the network leading up to it the encoder $f_e$, and the part after the bottleneck the decoder $f_d$. In combination, the learning task can be written as
integrated RL, spell check 2021-04-15 10:20:17 +02:00
first version of timeseries 2021-04-17 07:05:42 +02:00			`$$`
			`\text{arg min}_{\theta_e,\theta_d} \| f_d( f_e(\mathbf{s};\theta_e) ;\theta_d) - \mathbf{s} \|_2^2`
			`$$`
integrated RL, spell check 2021-04-15 10:20:17 +02:00
first version of timeseries 2021-04-17 07:05:42 +02:00			`with the encoder`
			`$f_e: \mathbb{R}^{n^3} \rightarrow \mathbb{R}^{m}$ with weights $\theta_e$,`
			`and the decoder`
			`$f_d: \mathbb{R}^{m} \rightarrow \mathbb{R}^{n^3}$ with weights $\theta_d$. For this`
			`learning objective we do not require any other data than the $\mathbf{s}$, as these represent`
			`inputs as well as the reference outputs.`
integrated RL, spell check 2021-04-15 10:20:17 +02:00
first version of timeseries 2021-04-17 07:05:42 +02:00			`Autoencoder networks are typically realized as stacks of convolutional layers.`
			`While the details of these layers can be chosen flexibly, a key property of all`
			`autoencoder architectures is that no connection between encoder and decoder part may`
			`exist. Hence, the network has to be separable for encoder and decoder.`
			`This is natural, as any connections (or information) shared between encoder and decoder`
			`would prevent using the encoder or decoder in a standalone manner. E.g., the decoder has to be able to decode a full state $\mathbf{s}$ purely from a latent space point $\mathbf{c}$.`
integrated RL, spell check 2021-04-15 10:20:17 +02:00
first version of timeseries 2021-04-17 07:05:42 +02:00			`### Autoencoder variants`
integrated RL, spell check 2021-04-15 10:20:17 +02:00
first version of timeseries 2021-04-17 07:05:42 +02:00			`One popular variant of autoencoders is worth a mention here: the so-called _varational autoencoders_, or VAEs. These`
			`autoencoders follow the structure above, but additionally employ a loss term to shape the latent space of $\mathbf{c}$.`
			`Typically we use a normal distribution as target, which makes the latent space`
			`an $m$ dimensional unit cube, i.e., each dimension should have a zero mean and unit standard deviation.`
			`This approach is especially useful if the decoder should be used as a generative model. E.g., we can then produce`
			`$\mathbf{c}$ samples directly, and decode them to obtain full states.`
			`While this is very useful to, e.g., obtain generative models for faces or other types of natural images, it is less`
			`crucial in a simulation setting. Here we rather want to obtain a latent space that facilitates the temporal prediction,`
			`rather than being able to easily produce samples from it.`
integrated RL, spell check 2021-04-15 10:20:17 +02:00

first version of timeseries 2021-04-17 07:05:42 +02:00			`## Time series`
integrated RL, spell check 2021-04-15 10:20:17 +02:00
first version of timeseries 2021-04-17 07:05:42 +02:00			`The goal of the temporal prediction is to compute a latent space state at time $t+1$ given one or more previous`
			`latent space states.`
			`The most straight-forward way to formulate the corresponding minimization problem is`
integrated RL, spell check 2021-04-15 10:20:17 +02:00
first version of timeseries 2021-04-17 07:05:42 +02:00			`$$`
			`\text{arg min}_{\theta_p} \| f_p( \mathbf{c}_{t};\theta_p) - \mathbf{c}_{t+1} \|_2^2`
			`$$`
cleanup, added other topics chapter 2021-04-14 13:08:51 +02:00
first version of timeseries 2021-04-17 07:05:42 +02:00			`where the prediction network is denoted by $f_p$ to distinguish it from encoder and decoder, above.`
			`This already implies that we're facing a recurrent task: any $ith$ step is`
			`the result of $i$ evaluations of $f_p$, i.e. $\mathbf{c}_{t+i} = f_p^{(i)}( \mathbf{c}_{t};\theta_p)$.`
			`As there is an inherent per-evaluation error, it is typically important to train this process`
			`for more than a single step, such that the $f_p$ network "sees" the drift it produces in terms`
			`of the latent space states over time.`
cleanup, added other topics chapter 2021-04-14 13:08:51 +02:00
first version of timeseries 2021-04-17 07:05:42 +02:00			```{admonition} Koopman operators
			`:class: tip`

			`In classical dynamical systems literature, a data-driven prediction of future states`
			`is typically formulated in terms of the so-called _Koopman operator_, which usually takes`
			`the form of a matrix, i.e. uses a linear approach.`

			`Traditional works have focused on obtaining good _Koopman operators_ that yield`
			`a high accuracy in combination with a basis to span the space of solutions. In the approach`
			`outlined above the $f_p$ network can be seen as a non-linear Koopman operator.`
			```
			`In order for this approach to work, we either need an appropriate history of previous`
			`states to uniquely identify the right next state, or our network has to internally`
			`store the previous history of states it has seen.`

			`For the former variant, the prediction network $f_p$ receives more than`
			`a single $\mathbf{c}_{t}$. For the latter variant, we can turn to algorithms`
			`from the subfield of _recurrent neural networks_ (RNNs). A variety of architectures`
			`have been proposed to encode and store temporal states of a sytem, the most`
			`popular ones being`
			`_long short-term memory_ (LSTM) network,`
			`_gated recurrent units_ (GRUs), or`
			`lately attenion-based _transformer_ networks.`
			`No matter which variant is used, these approaches always work with fully-connected layers`
			`as the latent space vectors do not exhibit any spatial structure, but typically represent`
			`a seemingly random collection of values.`
			`Due to the fully-connected layers, the prediction networks quickly grow in terms`
			`of their parameter count, and thus require relatively a small latent-space dimension $m$.`
			`Luckily, this is in line with our main goals, as outlined at the top.`

			`## End-to-end training`

			`In the formulation above we have clearly split the en- / decoding and the time prediction parts.`
			`However, in practice an _end-to-end_ training of all networks involved in a certain task`
			`is usually preferable, as the networks can adjust their behavior in accordance with the other`
			`components involved in the task.`

			`For the time prediction, we can formulate the objective in terms of $\mathbf{s}$, and use en- and decoder in the`
			`time prediction to compute the loss:`

			`$$`
			`\text{arg min}_{\theta_e,\theta_p,\theta_d} \| f_d( f_p( f_e( \mathbf{s}_{t} ;\theta_e) ;\theta_p) ;\theta_d) - \mathbf{s}_{t+1} \|_2^2`
			`$$`

			`Ideally, this step is furthermore unrolled over time to stabilize the evolution over time.`
			`The resulting training will be significantly more expensive, as more weights need to be trained at once,`
			`and a much larger number of intermediate states needs to be processed. However, the increased`
			`cost typically pays off with a reduced overall inference error.`

added timeseries example img 2021-04-17 08:25:10 +02:00
picture update 2021-04-18 06:52:29 +02:00			```{figure} resources/others-timeseries-lss-subdiv-prediction.jpg
added timeseries example img 2021-04-17 08:25:10 +02:00			`---`
			`height: 300px`
			`name: timeseries-lss-subdiv-prediction`
			`---`
			Several time frames of an example prediction from {cite}`wiewel2020lsssubdiv`, which additionally couples the
			`learned time evolution with an numerically solved advection step.`
			`The learned prediction is shown at the top, the reference simulation at the bottom.`
			```


first version of timeseries 2021-04-17 07:05:42 +02:00			`## Source code`

			`In order to make practical experiments in this area of deep learning, we can`
			`recommend this`
			`[latent space simulation code](https://github.com/wiewel/LatentSpaceSubdivision),`
			`which realizes an end-to-end training for encoding and prediction.`
			`Alternatively, this`
			`[learned model reduction code](https://github.com/byungsook/deep-fluids) focuses on the`
			`encoding and decoding aspects.`

			`Both are available as open source and use a combination of TensorFlow and mantaflow`
			`as DL and fluid simulation frameworks.`