diff --git a/resources/overview-arch-tblock.jpg b/resources/overview-arch-tblock.jpg
new file mode 100644
index 0000000..80f4084
Binary files /dev/null and b/resources/overview-arch-tblock.jpg differ
diff --git a/resources/pbdl-figures.key b/resources/pbdl-figures.key
index 3a1bdc2..a23cacf 100755
Binary files a/resources/pbdl-figures.key and b/resources/pbdl-figures.key differ
diff --git a/supervised-arch.md b/supervised-arch.md
index fa2914c..7cbc7f9 100644
--- a/supervised-arch.md
+++ b/supervised-arch.md
@@ -164,7 +164,16 @@ A newer and exciting development in the deep learning field are attention mechan
 
 Transformers generally work in two steps: the input is encoded into _tokens_ with an encoder-decoder network. This step can take many forms, and usually primarily serves to reduce the number of inputs, e.g., to work with pieces of an image rather than individual pixels. The attention mechanism then computes a weighting for a collection if incoming tokens. This is a floating point number for each token, traditionally interpreted as indicating which parts of the input are important, and which aren't. In modern architectures, the floating point weighting of the attention are directly used to modify an input. In _self-attention_, the weighting is computed from each input towards all other input tokens. This is a mechanism to handle **global dependencies**, and hence directly fits into the discussion above. In practice, the attention is computed via three matrices: the query $Q$, the key matrix $K$, and a value matrix $V$. For $N$ tokens, the outer product $Q K^T$ produces an $N \times N$ matrix, and runs through a Softmax layer, after which it is multiplied with $V$ (containing a linear projection of the input tokens) to produce the attention output vector. 
 
-In a Transformer architecture, the attention output is used as component of a building block: the attention is calculated and used as a residual (added to the input), stabilized with a layer normalization, and then processed in a two-layer _feed forward_ network. The latter is simply a combination of two dense layers with an activation in between. This Transformer block is applied multiple times before the final output is decoded into the original space.
+In a Transformer architecture, the attention output is used as component of a building block: the attention is calculated and used as a residual (added to the input), stabilized with a layer normalization, and then processed in a two-layer _feed forward_ network (FFN). The latter is simply a combination of two dense layers with an activation in between. This _Transformer block_, summarized below visually, is applied multiple times before the final output is decoded into the original space.
+
+```{figure} resources/overview-arch-tblock.jpg
+---
+height: 150px
+name: overview-arch-transformer-block
+---
+Visual summary of a single transformer block. A full network repeats this structure several times to infer the result.
+```
+
 
 This Transformer architecture was shown to scale extremely well to networks with huge numbers of parameters, one of the key advantages of Transformers. Note that a large part of the weights typically ends up in the matrices of the attention, and not just in the dense layers. At the same time, attention offers a powerful way for working with global dependencies in inputs. This comes at the cost of a more complicated architecture. An inherent problem of the self-attention mechanism above is that it's quadratic in the number of tokens $N$. This naturally puts a limit on the size and resolution of inputs. Under the hood, it's also surprisingly simple: the attention algorithm computes an $N \times N$ matrix, which is not too far from applying a simple dense layer (this would likewise come with an $N \times N$ weight matrix) to resolve global influences.