Neural Network Evolution

Cycle-Accurate Simulations @ 130-160 Hz

What if you ran a neural network on a 1950s-era processor?

This interactive demonstration suite explores the fundamental operations of neural computation—from Rosenblatt's 1958 perceptron to the 2017 transformer—constrained to 130-160 Hz clock speeds. Each demo provides cycle-accurate simulations showing exactly how many operations each architecture requires, and why modern AI needed modern hardware.

Why slow it down? At 10¹⁵ operations per second, modern AI is incomprehensible—a black box executing billions of calculations faster than thought. But at 150 Hz, we can watch it think. Each multiply-accumulate becomes visible. Each attention weight computation can be traced. The architecture reveals itself through slowness. These demos make the invisible visible by returning to speeds where human cognition can follow along.

The historical insight: GPT-4 is fundamentally perceptrons—billions of them arranged with attention. The atomic operation (weighted sum + nonlinearity) is identical to 1958. What changed is scale: from single operations per second to 10¹⁵ ops/sec. By stripping away that scale, we can see that the "magic" of modern AI is not algorithmic complexity—it's the same simple operations, repeated at incomprehensible speed and parallelism.

Historical Timeline

1958

The Perceptron

Frank Rosenblatt at Cornell Aeronautical Laboratory demonstrates the Mark I Perceptron—the first machine capable of learning by example. Simulated on an IBM 704, later built as custom hardware. The New York Times predicted it would "walk, talk, see, write, reproduce itself and be conscious."

[1] Rosenblatt, F. (1958). The perceptron: A probabilistic model for information storage and organization in the brain. Psychological Review, 65(6), 386-408.

1986

Backpropagation

Rumelhart, Hinton, and Williams publish the backpropagation algorithm for training multi-layer networks in Nature, enabling deep neural networks to learn internal representations. This paper revived neural network research after the "AI winter."

[2] Rumelhart, D.E., Hinton, G.E., & Williams, R.J. (1986). Learning representations by back-propagating errors. Nature, 323, 533-536. doi:10.1038/323533a0

1990

Recurrent Neural Networks

Jeffrey Elman introduces the Simple Recurrent Network (Elman Network) in "Finding Structure in Time," enabling neural networks to process sequential data by maintaining a hidden state that carries information through time.

[3] Elman, J.L. (1990). Finding structure in time. Cognitive Science, 14(2), 179-211.

1997

Long Short-Term Memory (LSTM)

Hochreiter and Schmidhuber solve the vanishing gradient problem with LSTM, introducing gated memory cells that can learn dependencies over 1000+ time steps. LSTM became the dominant architecture for sequence modeling until transformers.

[4] Hochreiter, S., & Schmidhuber, J. (1997). Long Short-Term Memory. Neural Computation, 9(8), 1735-1780. doi:10.1162/neco.1997.9.8.1735

2007

NVIDIA CUDA

NVIDIA releases CUDA (Compute Unified Device Architecture), transforming GPUs from graphics processors into general-purpose parallel computing engines. This enabled the deep learning revolution by providing the computational substrate neural networks required.

[5] NVIDIA Corporation. (2007). CUDA: Compute Unified Device Architecture. Initial SDK released February 15, 2007. developer.nvidia.com/cuda-zone

2012

AlexNet & The Deep Learning Revolution

Krizhevsky, Sutskever, and Hinton win ImageNet with AlexNet, reducing error rates by 10+ percentage points. Trained on two NVIDIA GTX 580 GPUs using CUDA, this demonstrated that deep neural networks could achieve superhuman performance given sufficient compute.

[6] Krizhevsky, A., Sutskever, I., & Hinton, G.E. (2012). ImageNet classification with deep convolutional neural networks. Advances in Neural Information Processing Systems, 25. NeurIPS

2014

Attention Mechanism

Bahdanau, Cho, and Bengio introduce the attention mechanism for neural machine translation, allowing models to dynamically focus on relevant parts of the input rather than compressing everything into a fixed-length vector.

[7] Bahdanau, D., Cho, K., & Bengio, Y. (2014). Neural Machine Translation by Jointly Learning to Align and Translate. arXiv:1409.0473. arxiv.org/abs/1409.0473

2017

The Transformer

Vaswani et al. introduce the Transformer architecture in "Attention Is All You Need," replacing recurrence with self-attention. This enabled massive parallelization and became the foundation for GPT, BERT, and virtually all modern large language models.

[8] Vaswani, A., et al. (2017). Attention Is All You Need. Advances in Neural Information Processing Systems, 30. arxiv.org/abs/1706.03762

💡 The Research Question

The human brain operates at ~1000 Hz per neuron—only 10× faster than our simulation. Yet it performs tasks no transformer can match. The difference isn't raw speed—it's 86 billion parallel units with 100 trillion connections, operating at ~20 watts. What's the minimum parallelism needed for "intelligent" behavior?

Interactive Demonstrations

01 ~63 cycles

Single Perceptron

The atomic unit of neural computation (1958)

Q8.8 Fixed-Point Step-by-Step Execution Binary Visualization 4 Preset Scenarios

02 279-927 cycles

Multi-Layer Perceptron

Stacking neurons for nonlinear power (1986)

Adjustable Width Layer-by-Layer View ReLU Activation Scaling Comparison

03 Static Analysis

Scaling Analysis

Why modern AI needs modern compute

Width Scaling Charts Depth Scaling Charts Architecture Timeline GPT-2 Projections

04 ~3,960 cycles

Recurrent Neural Network

Sequential memory through time (1990)

Hidden State Memory Sequence Processing Unrolled Visualization Timestep History

05 ~2,400 cycles

Attention Mechanism

"Which inputs matter?" (2014)

Q·K·V Computation Softmax Visualization Attention Matrix O(n²) Scaling Demo

06 ~6,300 cycles

Transformer Block

The complete architecture (2017)

LayerNorm + Attention Feed-Forward Network Residual Connections Phase-by-Phase View

07 Comparison

Parallel vs Sequential

Why GPUs changed everything

Side-by-Side Race Adjustable Core Count GPU Utilization Viz Speedup Metrics

08 Explorer

Quantization Explorer

Bit precision vs performance tradeoffs

32/16/8/4/2-bit Error Visualization Number Line Display Accuracy Analysis

09 Comparison

Brain vs Silicon

What would it take to match 86 billion neurons?

Race Simulation Efficiency Calculator Power Comparison Scale Visualization

10 Deep Dive

Brain-Scale Calculator

Build your own brain-scale AI system

Hardware Calculator GPU Model Comparison Timeline Projections Research Insights

References

[1] Rosenblatt, F. (1958). The perceptron: A probabilistic model for information storage and organization in the brain. Psychological Review, 65(6), 386-408. doi:10.1037/h0042519
[2] Rumelhart, D.E., Hinton, G.E., & Williams, R.J. (1986). Learning representations by back-propagating errors. Nature, 323, 533-536. doi:10.1038/323533a0
[3] Elman, J.L. (1990). Finding structure in time. Cognitive Science, 14(2), 179-211. doi:10.1207/s15516709cog1402_1
[4] Hochreiter, S., & Schmidhuber, J. (1997). Long Short-Term Memory. Neural Computation, 9(8), 1735-1780. doi:10.1162/neco.1997.9.8.1735
[5] NVIDIA Corporation. (2007). CUDA: Compute Unified Device Architecture. Initial SDK released February 15, 2007. developer.nvidia.com/cuda-zone
[6] Krizhevsky, A., Sutskever, I., & Hinton, G.E. (2012). ImageNet classification with deep convolutional neural networks. Advances in Neural Information Processing Systems, 25. NeurIPS Proceedings
[7] Bahdanau, D., Cho, K., & Bengio, Y. (2014). Neural Machine Translation by Jointly Learning to Align and Translate. arXiv:1409.0473. arxiv.org/abs/1409.0473
[8] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., & Polosukhin, I. (2017). Attention Is All You Need. Advances in Neural Information Processing Systems, 30. arxiv.org/abs/1706.03762