Archaeologist·Research Dispatch — Attention Is All You Need
Vol. I · Research Dispatch · NeurIPS 2017

Attention Is All You Need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, Illia Polosukhin

2017·NeurIPS 2017·95,000 citations·15 pages·arXiv:1706.03762
Abstract

The dominant sequence transduction models are based on complex recurrent or convolutional neural networks that include an encoder and a decoder. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and…

Reading Posture
From the Field
The architecture that rewrote the decade — every major language model traces its lineage here.
Verdict:Reach for it
Reach for it when

Building any sequence model where parallelism matters, or where long-range dependencies are critical — translation, summarization, generation, or any NLP task. The architecture is now so universal that the better question is when not to use it.

Look elsewhere when

You need to process very long sequences on memory-constrained hardware — the O(n²) attention complexity is a hard wall without variants like Longformer, Mamba, or Flash Attention. Also skip if you need strict causal guarantees with formal proofs.

In context

LSTM and GRU models dominated before this. The Transformer outperformed them in quality and training speed while eliminating sequential bottlenecks that prevented parallelism.

Complexity●●Medium
Read time~45 minutes

What It Claims

As told for the curious

Before 2017, the best language models processed text one word at a time — like reading a sentence letter by letter, keeping a running memory of what came before. This paper proposed throwing out that sequential approach entirely. Instead, every word in a sentence looks at every other word simultaneously, deciding how much attention to pay to each. The result was faster to train, more accurate, and became the foundation for GPT, BERT, and essentially every major language model since. It is hard to overstate how completely this paper restructured the field. If you have used a language model in the last five years, you have used a descendant of this paper.

Key Ideas

6 contributions · The core concepts in plain terms

Read Next

Papers and articles that extend or critique these ideas

Related Work

Papers and codebases in the same intellectual neighbourhood

Related Expeditions
Attention Is All Yo…Dynamo
 

Export & Share

Take the field notes with you

Terminology

Hover the dotted terms above for definitions in context

attention

concept

A mechanism that computes a weighted sum of values, where weights reflect the compatibility of a query with a set of keys.

label smoothing

concept

Replacing one-hot targets with a soft distribution (ε on non-targets) to prevent overconfidence and improve calibration.

multi-head attention

concept

Running h independent attention heads in parallel over subspaces, then concatenating — giving the model multiple 'perspectives' on the input.

positional encoding

concept

A signal added to embeddings to inject token position information into an otherwise order-invariant architecture.

self-attention

concept

Attention where queries, keys, and values all come from the same input sequence — every position attends to every other.

transformer

concept

A neural network architecture built entirely from attention mechanisms, without recurrence or convolution.

Attention Is All You Need (NeurIPS 2017) · Archaeologist