diff --git a/probmodels-graph.md b/probmodels-graph.md
index 97508f2..a1bfa6e 100644
--- a/probmodels-graph.md
+++ b/probmodels-graph.md
@@ -49,11 +49,11 @@ $
 where the mean and variance are parameterized as:
 
 $$
-\begin{equation}
+\begin{aligned}
 {\mu}_\theta^r = \frac{1}{\sqrt{\alpha_r}} \left( {Z}^r - \frac{\beta_r}{\sqrt{1-\bar{\alpha}_r}} {\epsilon}_\theta^r \right),
 \qquad
 {\Sigma}_\theta^r = \exp\left( \mathbf{v}_\theta^r \log \beta_r + (1-\mathbf{v}_\theta^r)\log \tilde{\beta}_r \right),
-\end{equation}
+\end{aligned}
 $$
 
 with $\tilde{\beta}_r := (1 - \bar{\alpha}_{r-1}) / (1 - \bar{\alpha}_r) \beta_r$. Here, ${\epsilon}_\theta^r \in \mathbb{R}^{|\mathcal{V}| \times F}$ predicts the noise ${\epsilon}$ in equation (1), and $\mathbf{v}_\theta^r \in \mathbb{R}^{|\mathcal{V}| \times F}$ interpolates between the two bounds of the process' entropy,  $\beta_r$ and $\tilde{\beta}_r$.
@@ -61,9 +61,9 @@ with $\tilde{\beta}_r := (1 - \bar{\alpha}_{r-1}) / (1 - \bar{\alpha}_r) \beta_r
 DGNs predict ${\epsilon}_\theta^r$ and $\mathbf{v}_\theta^r$ using a regular message-passing-based GNN {cite}`sanchez2020learning`. This takes ${Z}^{r-1}$ as input, and it is conditioned on the graph ${\mathcal{G}}$, its node and edge features, and the diffusion step $r$:
 
 $$
-\begin{equation}
+\begin{aligned}
 [{\epsilon}_\theta^r, \mathbf{v}_\theta^r] \leftarrow \text{{DGN}}_\theta({Z}^{r-1}, {\mathcal{G}}, {V}_c, {E}_c, r).
-\end{equation}
+\end{aligned}
 $$
 
 The _DGN_ network is trained using the hybrid loss function proposed in *"Improved Denoising Diffusion Probabilistic Models"* by Nichol et al. The full denoising process requires $R$ evaluations of the DGN to transition from ${Z}^R$ to ${Z}^0$.
@@ -71,14 +71,14 @@ The _DGN_ network is trained using the hybrid loss function proposed in *"Improv
 DGN follows the widely used encoder-processor-decoder GNN architecture. In addition to the node and edge encoders, the encoder includes a diffusion-step encoder, which generates a vector ${r}_\text{emb} \in \mathbb{R}^{F_\text{emb}}$ that embeds the diffusion step $r$. The node encoder processes the conditional node features ${v}_i^c$, alongside ${r}_\text{emb}$. Specifically, the diffusion-step encoder and the node encoder operate as follows:
 
 $$
-\begin{equation}
+\begin{aligned}
 {r}_\text{emb} \leftarrow
     \phi \circ {\small \text{Linear}} \circ {\small \text{SinEmb}} (r),
 \quad
 {v}_i \leftarrow {\small \text{Linear}} \left( \left[ \phi \circ {\small \text{Linear}} ({v}_i^c) \ | \ {r}_\text{emb} 
     \right] \right), 
 \quad \forall i \in \mathcal{V},
-\end{equation}
+\end{aligned}
 $$
 
 where $\phi$ denotes the activation function and ${\small \text{SinEmb}}$ is the sinusoidal embedding function. The edge encoder applies a linear layer to the conditional edge features ${e}_{ij}^c$. 
@@ -123,9 +123,9 @@ In this configuration, the VGAE captures high-frequency information (e.g., spati
 For the VGAE, an encoder-decoder architecture is used with an additional condition encoder to handle conditioning inputs. The condition encoder processes ${V}_c$ and ${E}_c$, encoding these into latent node features ${V}^\ell_c$ and edge features ${E}^\ell_c$ across $L$ graphs $\{{\mathcal{G}}^\ell := ({\mathcal{V}}^\ell, {\mathcal{E}}^\ell) {I}d 1 \leq \ell \leq L\}$, where ${\mathcal{G}}^1 \equiv {\mathcal{G}}$ and the size of the graphs decreases progressively, i.e., $|{\mathcal{V}}^1| > |{\mathcal{V}}^2| > \dots > |{\mathcal{V}}^L|$. This transformation begins by linearly projecting ${V}_c$ and ${E}_c$ to a $F_\text{ae}$-dimensional space and applying two message-passing layers to yield ${V}^1_c$ and ${E}^1_c$. Then, $L-1$ encoding blocks are applied sequentially:
 
 $$
-\begin{equation}
+\begin{aligned}
 \left[{V}^{\ell+1}_c, {E}^{\ell+1}_c \right] \leftarrow {\small MP} \circ {\small MP} \circ {\small \text{GraphPool}} \left({V}^\ell_c, {E}^\ell_c \right), \quad \text{for} \ l = 1, 2, \dots, L-1, 
-\end{equation}
+\end{aligned}
 $$
 
 where _MP_ denotes a message-passing layer and _GraphPool_ denotes a graph-pooling layer.
@@ -133,17 +133,17 @@ where _MP_ denotes a message-passing layer and _GraphPool_ denotes a graph-pooli
 The encoder produces two $F_L$-dimensional vectors for each node $i \in {\mathcal{V}}^L$, the mean ${\mu}_i$ and standard deviation ${\sigma}_i$ that parametrize a Gaussian distribution over the latent space. It takes as input a state ${Z}(t)$, which is linearly projected to a $F_\text{ae}$-dimensional vector space and then passed through $L-1$ sequential down-sampling blocks (message passing + graph pooling), each conditioned on the outputs of the condition encoder:
 
 $$
-\begin{equation}
+\begin{aligned}
     {V} \leftarrow {\small \text{GraphPool}} \circ {\small MP} \circ {\small MP} \left( {V} + {\small \text{Linear}}\left({V}^\ell_c \right), {\small \text{Linear}}\left({E}^\ell_c \right) \right), \ \text{for} \ l = 1, 2, \dots, L-1;
-\end{equation}
+\end{aligned}
 $$
 
 and a bottleneck block:
 
 $$
-\begin{equation}
+\begin{aligned}
     {V} \leftarrow {\small MP} \circ {\small MP} \left( {V} + {\small \text{Linear}}\left({V}^L_c \right), {\small \text{Linear}}\left({E}^L_c \right) \right).
-\end{equation}
+\end{aligned}
 $$
 
 The output features are passed through a node-wise MLP that returns ${\mu}_i$ and ${\sigma}_i$ for each node $i \in {\mathcal{V}}^L$. The latent variables are then computed as ${\zeta}_i = {\small \text{BatchNorm}}({\mu}_i + {\sigma}_i {\epsilon}_i$), where ${\epsilon}_i \sim \mathcal{N}(0, {I})$. Finally, the decoder mirrors the encoder, employing a symmetric architecture (replacing graph pooling by graph unpooling layers) to upsample the latent features back to the original graph ${\mathcal{G}}$. Its blocks are also conditioned on the outputs of the condition encoder. The message passing and the graph pooling and unpooling layers in the VGAE are the same as in the (L)DGN.