Before 2017, the best language models processed text one word at a time — like reading a sentence letter by letter, keeping a running memory of what came before. This paper proposed throwing out that sequential approach entirely. Instead, every word in a sentence looks at every other word simultaneously, deciding how much attentionattentionconceptA mechanism that computes a weighted sum of values, where weights reflect the compatibility of a query with a set of keys. to pay to each. The result was faster to train, more accurate, and became the foundation for GPT, BERT, and essentially every major language model since. It is hard to overstate how completely this paper restructured the field. If you have used a language model in the last five years, you have used a descendant of this paper.
Attention Is All You Need
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, Illia Polosukhin
The dominant sequence transduction models are based on complex recurrent or convolutional neural networks that include an encoder and a decoder. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and…
“The architecture that rewrote the decade — every major language model traces its lineage here.”
Building any sequence model where parallelism matters, or where long-range dependencies are critical — translation, summarization, generation, or any NLP task. The architecture is now so universal that the better question is when not to use it.
You need to process very long sequences on memory-constrained hardware — the O(n²) attention complexity is a hard wall without variants like Longformer, Mamba, or Flash Attention. Also skip if you need strict causal guarantees with formal proofs.
LSTM and GRU models dominated before this. The Transformer outperformed them in quality and training speed while eliminating sequential bottlenecks that prevented parallelism.
What It Claims
As told for the curious
Key Ideas
6 contributions · The core concepts in plain terms
Read Next
Papers and articles that extend or critique these ideas
Related Work
Papers and codebases in the same intellectual neighbourhood
Export & Share
Take the field notes with you
Terminology
Hover the dotted terms above for definitions in context
attention
conceptA mechanism that computes a weighted sum of values, where weights reflect the compatibility of a query with a set of keys.
label smoothing
conceptReplacing one-hot targets with a soft distribution (ε on non-targets) to prevent overconfidence and improve calibration.
multi-head attention
conceptRunning h independent attention heads in parallel over subspaces, then concatenating — giving the model multiple 'perspectives' on the input.
positional encoding
conceptA signal added to embeddings to inject token position information into an otherwise order-invariant architecture.
self-attention
conceptAttention where queries, keys, and values all come from the same input sequence — every position attends to every other.
transformer
conceptA neural network architecture built entirely from attention mechanisms, without recurrence or convolution.