Attention Mechanism @ 130-160 Hz | The Transformer Foundation

Configuration

Clock Frequency

150 Hz

🔍 Query: "it"

"The cat sat on the mat because it was tired"

Input Sequence

The

cat

sat

the

mat

because

Click a token to set as query position

Attention(Q, K, V) = softmax(QK^T/√d)V

💡 What Attention Does

For each position, attention computes how much to attend to every other position. "it" should attend strongly to "cat" (its antecedent), weakly to "mat" and "sat".

Attention Computation

Q (Query)

0.0

K (Keys)

—

V (Values)

—

Attention Scores (Q·Kᵀ / √d → softmax)

Output = Σ(attention × value)

—

READY Select query position and press Compute —

Attention State

Attention Weights (after softmax)

Performance

Phase

—

Total Cycles

Time

0.00s

Seq Length

Ready 0.00s

⚠ Scaling Warning

This demo (seq=8, d=4) ~2,400 cycles

GPT-2 attention (seq=1024) ~50M cycles

Full transformer layer ~500M cycles

Time @ 150 Hz ~38 days/layer

🔑 Why This Matters

Attention scales as O(n²) with sequence length. Double the sequence → 4× the compute. This is why context windows are so expensive.