Neural Network Evolution

Cycle-Accurate Simulations @ 130-160 Hz
What if you ran a neural network on a 1950s-era processor?

This interactive demonstration suite explores the fundamental operations of neural computation—from Rosenblatt's 1958 perceptron to the 2017 transformer—constrained to 130-160 Hz clock speeds. Each demo provides cycle-accurate simulations showing exactly how many operations each architecture requires, and why modern AI needed modern hardware.

Why slow it down? At 1015 operations per second, modern AI is incomprehensible—a black box executing billions of calculations faster than thought. But at 150 Hz, we can watch it think. Each multiply-accumulate becomes visible. Each attention weight computation can be traced. The architecture reveals itself through slowness. These demos make the invisible visible by returning to speeds where human cognition can follow along.

The historical insight: GPT-4 is fundamentally perceptrons—billions of them arranged with attention. The atomic operation (weighted sum + nonlinearity) is identical to 1958. What changed is scale: from single operations per second to 1015 ops/sec. By stripping away that scale, we can see that the "magic" of modern AI is not algorithmic complexity—it's the same simple operations, repeated at incomprehensible speed and parallelism.

Historical Timeline

1958
The Perceptron
Frank Rosenblatt at Cornell Aeronautical Laboratory demonstrates the Mark I Perceptron—the first machine capable of learning by example. Simulated on an IBM 704, later built as custom hardware. The New York Times predicted it would "walk, talk, see, write, reproduce itself and be conscious."
[1] Rosenblatt, F. (1958). The perceptron: A probabilistic model for information storage and organization in the brain. Psychological Review, 65(6), 386-408.
1986
Backpropagation
Rumelhart, Hinton, and Williams publish the backpropagation algorithm for training multi-layer networks in Nature, enabling deep neural networks to learn internal representations. This paper revived neural network research after the "AI winter."
[2] Rumelhart, D.E., Hinton, G.E., & Williams, R.J. (1986). Learning representations by back-propagating errors. Nature, 323, 533-536. doi:10.1038/323533a0
1990
Recurrent Neural Networks
Jeffrey Elman introduces the Simple Recurrent Network (Elman Network) in "Finding Structure in Time," enabling neural networks to process sequential data by maintaining a hidden state that carries information through time.
[3] Elman, J.L. (1990). Finding structure in time. Cognitive Science, 14(2), 179-211.
1997
Long Short-Term Memory (LSTM)
Hochreiter and Schmidhuber solve the vanishing gradient problem with LSTM, introducing gated memory cells that can learn dependencies over 1000+ time steps. LSTM became the dominant architecture for sequence modeling until transformers.
[4] Hochreiter, S., & Schmidhuber, J. (1997). Long Short-Term Memory. Neural Computation, 9(8), 1735-1780. doi:10.1162/neco.1997.9.8.1735
2007
NVIDIA CUDA
NVIDIA releases CUDA (Compute Unified Device Architecture), transforming GPUs from graphics processors into general-purpose parallel computing engines. This enabled the deep learning revolution by providing the computational substrate neural networks required.
[5] NVIDIA Corporation. (2007). CUDA: Compute Unified Device Architecture. Initial SDK released February 15, 2007. developer.nvidia.com/cuda-zone
2012
AlexNet & The Deep Learning Revolution
Krizhevsky, Sutskever, and Hinton win ImageNet with AlexNet, reducing error rates by 10+ percentage points. Trained on two NVIDIA GTX 580 GPUs using CUDA, this demonstrated that deep neural networks could achieve superhuman performance given sufficient compute.
[6] Krizhevsky, A., Sutskever, I., & Hinton, G.E. (2012). ImageNet classification with deep convolutional neural networks. Advances in Neural Information Processing Systems, 25. NeurIPS
2014
Attention Mechanism
Bahdanau, Cho, and Bengio introduce the attention mechanism for neural machine translation, allowing models to dynamically focus on relevant parts of the input rather than compressing everything into a fixed-length vector.
[7] Bahdanau, D., Cho, K., & Bengio, Y. (2014). Neural Machine Translation by Jointly Learning to Align and Translate. arXiv:1409.0473. arxiv.org/abs/1409.0473
2017
The Transformer
Vaswani et al. introduce the Transformer architecture in "Attention Is All You Need," replacing recurrence with self-attention. This enabled massive parallelization and became the foundation for GPT, BERT, and virtually all modern large language models.
[8] Vaswani, A., et al. (2017). Attention Is All You Need. Advances in Neural Information Processing Systems, 30. arxiv.org/abs/1706.03762

💡 The Research Question

The human brain operates at ~1000 Hz per neuron—only 10× faster than our simulation. Yet it performs tasks no transformer can match. The difference isn't raw speed—it's 86 billion parallel units with 100 trillion connections, operating at ~20 watts. What's the minimum parallelism needed for "intelligent" behavior?

Interactive Demonstrations

01 ~63 cycles
Single Perceptron
The atomic unit of neural computation (1958)
Q8.8 Fixed-Point Step-by-Step Execution Binary Visualization 4 Preset Scenarios
02 279-927 cycles
Multi-Layer Perceptron
Stacking neurons for nonlinear power (1986)
Adjustable Width Layer-by-Layer View ReLU Activation Scaling Comparison
03 Static Analysis
Scaling Analysis
Why modern AI needs modern compute
Width Scaling Charts Depth Scaling Charts Architecture Timeline GPT-2 Projections
04 ~3,960 cycles
Recurrent Neural Network
Sequential memory through time (1990)
Hidden State Memory Sequence Processing Unrolled Visualization Timestep History
05 ~2,400 cycles
Attention Mechanism
"Which inputs matter?" (2014)
Q·K·V Computation Softmax Visualization Attention Matrix O(n²) Scaling Demo
06 ~6,300 cycles
Transformer Block
The complete architecture (2017)
LayerNorm + Attention Feed-Forward Network Residual Connections Phase-by-Phase View
07 Comparison
Parallel vs Sequential
Why GPUs changed everything
Side-by-Side Race Adjustable Core Count GPU Utilization Viz Speedup Metrics
08 Explorer
Quantization Explorer
Bit precision vs performance tradeoffs
32/16/8/4/2-bit Error Visualization Number Line Display Accuracy Analysis
09 Comparison
Brain vs Silicon
What would it take to match 86 billion neurons?
Race Simulation Efficiency Calculator Power Comparison Scale Visualization
10 Deep Dive
Brain-Scale Calculator
Build your own brain-scale AI system
Hardware Calculator GPU Model Comparison Timeline Projections Research Insights

References

  1. [1] Rosenblatt, F. (1958). The perceptron: A probabilistic model for information storage and organization in the brain. Psychological Review, 65(6), 386-408. doi:10.1037/h0042519
  2. [2] Rumelhart, D.E., Hinton, G.E., & Williams, R.J. (1986). Learning representations by back-propagating errors. Nature, 323, 533-536. doi:10.1038/323533a0
  3. [3] Elman, J.L. (1990). Finding structure in time. Cognitive Science, 14(2), 179-211. doi:10.1207/s15516709cog1402_1
  4. [4] Hochreiter, S., & Schmidhuber, J. (1997). Long Short-Term Memory. Neural Computation, 9(8), 1735-1780. doi:10.1162/neco.1997.9.8.1735
  5. [5] NVIDIA Corporation. (2007). CUDA: Compute Unified Device Architecture. Initial SDK released February 15, 2007. developer.nvidia.com/cuda-zone
  6. [6] Krizhevsky, A., Sutskever, I., & Hinton, G.E. (2012). ImageNet classification with deep convolutional neural networks. Advances in Neural Information Processing Systems, 25. NeurIPS Proceedings
  7. [7] Bahdanau, D., Cho, K., & Bengio, Y. (2014). Neural Machine Translation by Jointly Learning to Align and Translate. arXiv:1409.0473. arxiv.org/abs/1409.0473
  8. [8] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., & Polosukhin, I. (2017). Attention Is All You Need. Advances in Neural Information Processing Systems, 30. arxiv.org/abs/1706.03762
🧠
Your Brain Has Performed
0
operations since arriving