⚠️ Cross- or Self-?
We will talk about Attention and explain the difference between 2 paradigms

Self-Attention vs Cross-Attention
In modern computer vision and transformer models, attention mechanisms are the 🦸 superheroes 🦸 behind understanding context. Let's break down the difference between self-attention and cross-attention!
Self-Attention
Who looks at whom: all tokens look at each other within the same set.
Input: a single set of vectors (words, image patches, or video frames). The model figures out how elements relate within the same sequence.
Example: You have the phrase "The cat eats fish" Self-attention lets “eats” look at “cat” and “fish” to understand the meaning.
Formula:
\[ Q, K, V \gets \text{same set of tokens} \] \[ \text{Attention} = \text{softmax}\Big(\frac{Q K^\top}{\sqrt{d}}\Big) V \]
Cross-Attention
Who looks at whom: tokens from one set (queries) look at tokens from another set (keys/values).
Input: two different sets of vectors (e.g., text and image). This lets the model mix information across modalities.
Example: You have the text "What is the cat doing?" and an image of a cat eating fish. Cross-attention allows text tokens to peek at visual features from the image and understand it better.
Formula:
\[ Q \gets \text{tokens from set A (text)}, \] \[ K, V \gets \text{tokens from set B (image),} \] \[ \text{Attention} = \text{softmax}\Big(\frac{Q K^\top}{\sqrt{d}}\Big) V \]
📝 In short: self-attention understands relationships within one set, while cross-attention merges information between sets. These mechanisms are crucial for multimodal AI like CLIP, where text meets images, or video transformers where frames need to “talk” to each other.