⚠️ Cross- or Self-?
We will talk about Attention and explain the difference between 2 paradigms
Self-Attention vs Cross-Attention
In modern computer vision and transformer models, attention mechanisms are the 🦸 superheroes 🦸 behind understanding context. Let's break down the difference between self-attention and cross-attention!
Self-Attention
Who looks at whom: all tokens look at each other within the same set.
Input: a single set of vectors (words, image patches, or video frames). The model figures out how elements relate within the same sequence.
Example: You have the phrase "The cat eats fish" Self-attention lets “eats” look at “cat” and “fish” to understand the meaning.
Formula:
Cross-Attention
Who looks at whom: tokens from one set (queries) look at tokens from another set (keys/values).
Input: two different sets of vectors (e.g., text and image). This lets the model mix information across modalities.
Example: You have the text "What is the cat doing?" and an image of a cat eating fish. Cross-attention allows text tokens to peek at visual features from the image and understand it better.
Formula:
📝 In short: self-attention understands relationships within one set, while cross-attention merges information between sets. These mechanisms are crucial for multimodal AI like CLIP, where text meets images, or video transformers where frames need to “talk” to each other.