1958
The Perceptron
Frank Rosenblatt at Cornell Aeronautical Laboratory demonstrates the Mark I Perceptron—the first machine capable of learning by example. Simulated on an IBM 704, later built as custom hardware. The New York Times predicted it would "walk, talk, see, write, reproduce itself and be conscious."
[1] Rosenblatt, F. (1958). The perceptron: A probabilistic model for information storage and organization in the brain. Psychological Review, 65(6), 386-408.
1986
Backpropagation
Rumelhart, Hinton, and Williams publish the backpropagation algorithm for training multi-layer networks in Nature, enabling deep neural networks to learn internal representations. This paper revived neural network research after the "AI winter."
[2] Rumelhart, D.E., Hinton, G.E., & Williams, R.J. (1986). Learning representations by back-propagating errors.
Nature, 323, 533-536.
doi:10.1038/323533a0
1990
Recurrent Neural Networks
Jeffrey Elman introduces the Simple Recurrent Network (Elman Network) in "Finding Structure in Time," enabling neural networks to process sequential data by maintaining a hidden state that carries information through time.
[3] Elman, J.L. (1990). Finding structure in time. Cognitive Science, 14(2), 179-211.
1997
Long Short-Term Memory (LSTM)
Hochreiter and Schmidhuber solve the vanishing gradient problem with LSTM, introducing gated memory cells that can learn dependencies over 1000+ time steps. LSTM became the dominant architecture for sequence modeling until transformers.
2007
NVIDIA CUDA
NVIDIA releases CUDA (Compute Unified Device Architecture), transforming GPUs from graphics processors into general-purpose parallel computing engines. This enabled the deep learning revolution by providing the computational substrate neural networks required.
2012
AlexNet & The Deep Learning Revolution
Krizhevsky, Sutskever, and Hinton win ImageNet with AlexNet, reducing error rates by 10+ percentage points. Trained on two NVIDIA GTX 580 GPUs using CUDA, this demonstrated that deep neural networks could achieve superhuman performance given sufficient compute.
[6] Krizhevsky, A., Sutskever, I., & Hinton, G.E. (2012). ImageNet classification with deep convolutional neural networks.
Advances in Neural Information Processing Systems, 25.
NeurIPS
2014
Attention Mechanism
Bahdanau, Cho, and Bengio introduce the attention mechanism for neural machine translation, allowing models to dynamically focus on relevant parts of the input rather than compressing everything into a fixed-length vector.
[7] Bahdanau, D., Cho, K., & Bengio, Y. (2014). Neural Machine Translation by Jointly Learning to Align and Translate.
arXiv:1409.0473.
arxiv.org/abs/1409.0473
2017
The Transformer
Vaswani et al. introduce the Transformer architecture in "Attention Is All You Need," replacing recurrence with self-attention. This enabled massive parallelization and became the foundation for GPT, BERT, and virtually all modern large language models.
[8] Vaswani, A., et al. (2017). Attention Is All You Need.
Advances in Neural Information Processing Systems, 30.
arxiv.org/abs/1706.03762
💡 The Research Question
The human brain operates at ~1000 Hz per neuron—only 10× faster than our simulation. Yet it performs tasks no transformer can match. The difference isn't raw speed—it's 86 billion parallel units with 100 trillion connections, operating at ~20 watts. What's the minimum parallelism needed for "intelligent" behavior?