Attention Mechanism

"Which Inputs Matter?" @ 130-160 Hz โ€ข The Transformer Foundation
Configuration
150 Hz
๐Ÿ” Query: "it"
"The cat sat on the mat because it was tired"
The
cat
sat
on
the
mat
because
it
Click a token to set as query position
Attention(Q, K, V) = softmax(QKT/โˆšd)V
๐Ÿ’ก What Attention Does

For each position, attention computes how much to attend to every other position. "it" should attend strongly to "cat" (its antecedent), weakly to "mat" and "sat".

Attention Computation
Q (Query)
0.0
0.0
0.0
0.0
K (Keys)
โ€”
โ€”
โ€”
โ€”
V (Values)
โ€”
โ€”
โ€”
โ€”
Attention Scores (QยทKแต€ / โˆšd โ†’ softmax)
Output = ฮฃ(attention ร— value)
โ€”
โ€”
โ€”
โ€”
READY Select query position and press Compute โ€”
Attention State
Attention Weights (after softmax)
Performance
Phase
โ€”
Total Cycles
0
Time
0.00s
Seq Length
8
Ready 0.00s
โš  Scaling Warning
This demo (seq=8, d=4) ~2,400 cycles
GPT-2 attention (seq=1024) ~50M cycles
Full transformer layer ~500M cycles
Time @ 150 Hz ~38 days/layer
๐Ÿ”‘ Why This Matters

Attention scales as O(nยฒ) with sequence length. Double the sequence โ†’ 4ร— the compute. This is why context windows are so expensive.

๐Ÿง 
Your Brain Has Performed
0
operations since arriving