⚠️ Cross- or Self-?

We will talk about Attention and explain the difference between 2 paradigms

Self-Attention vs Cross-Attention

In modern computer vision and transformer models, attention mechanisms are the 🦸 superheroes 🦸 behind understanding context. Let's break down the difference between self-attention and cross-attention!

Self-Attention

Who looks at whom: all tokens look at each other within the same set.

Input: a single set of vectors (words, image patches, or video frames). The model figures out how elements relate within the same sequence.

Example: You have the phrase "The cat eats fish" Self-attention lets “eats” look at “cat” and “fish” to understand the meaning.

Formula:

\[ Q, K, V \gets \text{same set of tokens} \] \[ \text{Attention} = \text{softmax}\Big(\frac{Q K^\top}{\sqrt{d}}\Big) V \]

Cross-Attention

Who looks at whom: tokens from one set (queries) look at tokens from another set (keys/values).

Input: two different sets of vectors (e.g., text and image). This lets the model mix information across modalities.

Example: You have the text "What is the cat doing?" and an image of a cat eating fish. Cross-attention allows text tokens to peek at visual features from the image and understand it better.

Formula:

\[ Q \gets \text{tokens from set A (text)}, \] \[ K, V \gets \text{tokens from set B (image),} \] \[ \text{Attention} = \text{softmax}\Big(\frac{Q K^\top}{\sqrt{d}}\Big) V \]

📝 In short: self-attention understands relationships within one set, while cross-attention merges information between sets. These mechanisms are crucial for multimodal AI like CLIP, where text meets images, or video transformers where frames need to “talk” to each other.

Published on August 22, 2025 Author: Vitaly