Configuration
150 Hz
๐ Query: "it"
"The cat sat on the mat because it was tired"
The
cat
sat
on
the
mat
because
it
Click a token to set as query position
Attention(Q, K, V) = softmax(QKT/โd)V
๐ก What Attention Does
For each position, attention computes how much to attend to every other position. "it" should attend strongly to "cat" (its antecedent), weakly to "mat" and "sat".
Attention Computation
Q (Query)
0.0
0.0
0.0
0.0
K (Keys)
โ
โ
โ
โ
V (Values)
โ
โ
โ
โ
Attention Scores (QยทKแต / โd โ softmax)
Output = ฮฃ(attention ร value)
โ
โ
โ
โ
READY
Select query position and press Compute
โ
Attention State
Attention Weights (after softmax)
Performance
Phase
โ
Total Cycles
0
Time
0.00s
Seq Length
8
Ready
0.00s
โ Scaling Warning
This demo (seq=8, d=4)
~2,400 cycles
GPT-2 attention (seq=1024)
~50M cycles
Full transformer layer
~500M cycles
Time @ 150 Hz
~38 days/layer
๐ Why This Matters
Attention scales as O(nยฒ) with sequence length. Double the sequence โ 4ร the compute. This is why context windows are so expensive.