“All you need is attention” (Translated to 8th grade literacy by ChatGPT; Prompted by JinJa Birkenbeuel)
“All You Need Is Attention.” Image created using MidJourney by a prompt written by JinJa Birkenbeuel
Original https://arxiv.org/pdf/1706.03762.pdf?
I’ve been trying to understand Google’s “transformer” which is apparently the technology foundation of ChatGPT. I had heard about this paper called “All You Need Is Attention” in various chat servers. I found it, tried to read it, to understand, and no way, had no idea what any of it meant. I decided to plug it into ChatGPT and ask it to break the research paper down for me to 8th grade literacy. It’s still tough, but way more manageable. Once again, Google has provides the foundation of so much opportunity for others. The company has created success and revenue for people, startups, educators, students, grandmas and business owners, including even giant brands that wouldn’t have been able to sell so many burgers or cars without Google. *I did prompt ChatGPT to bring it down to 6th grade, and I was stunned at the degradation from 8th to 6th.
Have a look at this “translation” to help you appreciate the evolution of our new AI era. We are in it. Right now.
And after you read this, do you feel Google should have kept this open source or kept it closed, and then perhaps licensed it out for people to then build from? Share you thoughts in the comments. I’m so curious!
Abstract
Most popular sequence transformation models use complicated recurrent or convolutional neural networks with an encoder and decoder. The best models also link the encoder and decoder using an attention mechanism. We introduce a new, simpler network structure called the Transformer, which relies only on attention mechanisms and eliminates the need for recurrence and convolutions. Tests on two machine translation tasks demonstrate that these models have better quality, are easier to parallelize, and need significantly less time for training. Our model achieves a 28.4 BLEU score on the WMT 2014 English-to-German translation task, improving the existing best results, including ensembles, by over 2 BLEU points. On the WMT 2014 English-to-French translation task, our model sets a new single-model state-of-the-art BLEU score of 41.8 after training for 3.5 days on eight GPUs, which is a much shorter training time compared to the best models from the literature. We also prove that the Transformer can be effectively applied to other tasks, such as English constituency parsing, with both large and limited training data.
1 Introduction
Recurrent neural networks, like long short-term memory (LSTM) and gated recurrent neural networks, are really good at solving problems that involve sequences, like language translation. However, these networks have a limitation because they process information step-by-step, making it harder to speed up the process by working on multiple things at once.
Attention mechanisms have been added to these networks to help them understand relationships between different parts of a sequence without worrying about how far apart they are. But, most of the time, attention mechanisms are still used with a recurrent network.
In this work, the authors propose a new model called the Transformer. This model doesn't use recurrence (step-by-step processing) and instead relies entirely on attention mechanisms to understand relationships between different parts of the input and output. The Transformer can work on many things at once, making it faster and more effective at tasks like translation.
2 Background
Other models have tried to reduce the need for step-by-step processing by using convolutional neural networks, which can process information in parallel (all at once). However, these models still struggle to understand relationships between parts of a sequence that are far apart from each other.
The Transformer solves this problem by using self-attention, which is a way for the model to look at different parts of a single sequence and figure out their relationships. Self-attention has been successful in various tasks like reading comprehension, summarizing text, and understanding sentences.
3 Model Architecture
Many good neural network models for translating sequences have an encoder-decoder structure. The encoder takes an input sequence (a series of symbols) and turns it into a continuous representation. The decoder then generates an output sequence of symbols one by one, using the previously generated symbols as extra input.
The Transformer follows this overall structure but uses self-attention and other techniques to improve its speed and effectiveness.
3.1 Encoder and Decoder Stacks
Encoder: The encoder is made up of six identical layers stacked on top of each other. Each layer has two parts. The first part is a multi-head self-attention mechanism, which helps the model focus on different parts of the input. The second part is a simple network that connects different positions. The authors use a special technique called residual connections, followed by layer normalization, to help connect the layers. This means that the output of each part is adjusted and combined with the input to that part. All parts of the model create outputs of a specific size (512 dimensions).
Decoder: The decoder also has six identical layers. Besides the two parts found in each encoder layer, the decoder adds a third part that focuses on the encoder's output. Just like the encoder, the authors use residual connections and layer normalization for the decoder. They also make some changes to the self-attention part in the decoder to make sure that each position in the output can only depend on known outputs at earlier positions.
3.2 Attention
Attention is a way for the model to decide which parts of the input are important. It works by connecting a question to a set of key-value pairs and creating an output. The question, keys, values, and output are all lists of numbers. The output is calculated as a mix of the values, where the importance of each value is determined by how well the question matches the related key.
3.2.1 Scaled Dot-Product Attention
Our attention method is called "Scaled Dot-Product Attention." The input has queries and keys of a certain size (dimension dk) and values of another size (dimension dv). We calculate the dot products of the query with all keys, divide each by the square root of dk, and use a softmax function to find the weights on the values.
In real situations, we work on a group of queries at the same time, organizing them into a matrix Q. The keys and values are also organized into matrices K and V. We calculate the matrix of outputs like this:
Attention(Q, K, V) = softmax(QKT / (square root of dk)) * V
There are two common attention functions: additive attention and dot-product (multiplicative) attention. Our method is like dot-product attention but with a scaling factor. Additive attention uses a simple network to calculate compatibility. Dot-product attention is faster and more efficient because it uses matrix multiplication.
Additive attention works better than dot-product attention without scaling for larger values of dk. We think that's because the dot products become too large, causing issues with the softmax function. To fix this, we scale the dot products by the square root of dk.
3.2.2 Multi-Head Attention
Instead of using a single attention function with dmodel-dimensional keys, values, and queries, we project the queries, keys, and values multiple times (h times) with different learned projections. We perform the attention function in parallel on each of these projections, creating dv-dimensional output values. These are combined and projected again to get the final values.
Multi-head attention lets the model pay attention to different types of information from different positions. A single attention head wouldn't allow this.
In our work, we use 8 parallel attention layers or heads. For each, we use dk = dv = dmodel/h = 64. Because each head has a smaller size, the total calculation cost is similar to a single-head attention with full size.
3.2.3 Attention Uses in Our Model
Our model, called the Transformer, uses multi-head attention in three ways:
"Encoder-decoder attention" layers have queries from the decoder's previous layer and memory keys and values from the encoder's output. This allows the decoder to pay attention to all positions in the input sequence. This is similar to other models like those in sequence-to-sequence tasks.
The encoder has self-attention layers. In these layers, keys, values, and queries all come from the same place: the previous encoder layer's output. Each position in the encoder can pay attention to all positions in the layer before it.
The decoder also has self-attention layers. These let each position in the decoder pay attention to all positions up to and including itself. To keep the model's predictions in order, we prevent information from flowing leftward in the decoder. We do this by masking out some values in the softmax input.
3.3 Position-wise Feed-Forward Networks
Each layer in our encoder and decoder has a feed-forward network applied to each position separately. This network has two linear transformations with a ReLU activation in between.
3.4 Embeddings and Softmax
Like other models, we use learned embeddings to change input and output tokens into vectors of a certain size. We also use a learned linear transformation and softmax function to change the decoder output into probabilities for the next token. In our model, we share the same weight matrix between the two embedding layers and the pre-softmax transformation.
3.5 Positional Encoding
Our model doesn't use recurrence or convolution, so we need to add information about the position of tokens in the sequence. We do this by adding "positional encodings" to the input embeddings in the encoder and decoder stacks. We use sine and cosine functions of different frequencies for our positional encodings.
We also tried using learned positional embeddings and found that the results were nearly the same. We chose the sinusoidal version because it might let the model work with sequence lengths longer than those seen during training.
4 Why Self-Attention
We use self-attention layers in our model and compare them to recurrent and convolutional layers often used in other models. We consider three important factors:
Total computational complexity per layer.
Amount of computation that can be parallelized (measured by the minimum number of sequential operations).
Path length between long-range dependencies in the network (shorter paths make it easier to learn long-range dependencies).
As shown in our comparison, self-attention layers connect all positions with a constant number of sequential operations, while recurrent layers need more operations. Self-attention layers are faster when sequence length is smaller than the representation size, which is often the case in sentence representations used in machine translation.
Convolutional layers don't connect all pairs of input and output positions in just one layer. They need multiple layers to do so, which increases the path length between positions. They are also generally more expensive than recurrent layers.
Self-attention can also make models more understandable because they can show the relationships between different parts of a sentence.
5 Training
We trained our model on standard English-German and English-French datasets. We used byte-pair encoding for English-German and word-piece vocabulary for English-French. We batched sentence pairs by approximate sequence length, with each batch containing about 25,000 source tokens and 25,000 target tokens.
5.2 Hardware and Schedule
We trained our models on a machine with 8 NVIDIA P100 GPUs. The base models took about 0.4 seconds per training step and were trained for 100,000 steps or 12 hours. The big models took 1.0 second per step and were trained for 300,000 steps (3.5 days).
5.3 Optimizer
We used the Adam optimizer and changed the learning rate throughout training according to a formula. The learning rate increased linearly for the first few steps and then decreased proportionally to the inverse square root of the step number.
5.4 Regularization
We used three types of regularization during training: Residual Dropout, where we applied dropout to the output of each sub-layer, and to the sums of the embeddings and positional encodings; and Label Smoothing, which made the model more unsure but improved accuracy and BLEU score.
6 Results
6.1 Machine Translation
Our big transformer model achieved a new state-of-the-art BLEU score of 28.4 on English-to-German translation, outperforming previous models. The base model also surpassed previous models at a fraction of the training cost. On English-to-French translation, our big model achieved a BLEU score of 41.0, outperforming previous models at less than 1/4 the training cost.
6.2 Model Variations
We tested different variations of our base model on English-to-German translation. Changing the number of attention heads and their dimensions affected the translation quality. Too few or too many heads resulted in lower quality translations.
Table 3 shows different variations of the Transformer architecture and their performance on English-to-German translation. The results show that the base model performs well, but changing certain factors like the number of attention heads and their dimensions can affect translation quality. Bigger models generally perform better, and using dropout helps avoid over-fitting. Replacing sinusoidal positional encoding with learned positional embeddings results in similar performance.
[ My prompt didn’t even include this graphic, I just copied and pasted the image to text: “Is there a way to explain this, 8th grade literacy? At least summarize this:”
Table 3: Variations on the Transformer architecture. Unlisted values are identical to those of the base
model. All metrics are on the English-to-German translation development set, newstest2013. Listed
perplexities are per-wordpiece, according to our byte-pair encoding, and should not be compared to
per-word perplexities.
N dmodel dff h dk dv Pdrop ls
train PPL BLEU params
steps (dev) (dev) ×106
base 6 512 2048 8 64 64 0.1 0.1 100K 4.92 25.8 65
(A)
1 512 512 5.29 24.9
4 128 128 5.00 25.5
16 32 32 4.91 25.8
32 16 16 5.01 25.4
(B) 16 5.16 25.1 58
32 5.01 25.4 60
(C)
2 6.11 23.7 36
4 5.19 25.3 50
8 4.88 25.5 80
256 32 32 5.75 24.5 28
1024 128 128 4.66 26.0 168
1024 5.12 25.4 53
4096 4.75 26.2 90
(D)
0.0 5.77 24.6
0.2 4.95 25.5
0.0 4.67 25.3
0.2 5.47 25.7
(E) positional embedding instead of sinusoids 4.92 25.7
big 6 1024 4096 16 0.3 300K 4.33 26.4 213
Table 4: The Transformer generalizes well to English constituency parsing (Results are on Section 23
of WSJ)
Parser Training WSJ 23 F1
Vinyals & Kaiser el al. (2014) [37] WSJ only, discriminative 88.3
Petrov et al. (2006) [29] WSJ only, discriminative 90.4
Zhu et al. (2013) [40] WSJ only, discriminative 90.4
Dyer et al. (2016) [8] WSJ only, discriminative 91.7
Transformer (4 layers) WSJ only, discriminative 91.3
Zhu et al. (2013) [40] semi-supervised 91.3
Huang & Harper (2009) [14] semi-supervised 91.3
McClosky et al. (2006) [26] semi-supervised 92.1
Vinyals & Kaiser el al. (2014) [37] semi-supervised 92.1
Transformer (4 layers) semi-supervised 92.7
Luong et al. (2015) [23] multi-task 93.0
Dyer et al. (2016) [8] generative 93.3
In Table 3 rows (B), we observe that reducing the attention key size dk hurts model quality. This
suggests that determining compatibility is not easy and that a more sophisticated compatibility
function than dot product may be beneficial. We further observe in rows (C) and (D) that, as expected,
bigger models are better, and dropout is very helpful in avoiding over-fitting. In row (E) we replace our
sinusoidal positional encoding with learned positional embeddings [9], and observe nearly identical
results to the base model.
ChatGPT Table 3 shows different variations of the Transformer architecture and their performance on English-to-German translation. The results show that the base model performs well, but changing certain factors like the number of attention heads and their dimensions can affect translation quality. Bigger models generally perform better, and using dropout helps avoid over-fitting. Replacing sinusoidal positional encoding with learned positional embeddings results in similar performance.
Table 4 shows that the Transformer model generalizes well to English constituency parsing, achieving high F1 scores compared to other models. The performance improves further when using semi-supervised learning or multi-task learning approaches.
Researchers created a new model called the Transformer to help with tasks like translating languages. This model is special because it only uses attention, which is different from other models that use things called recurrent layers. The Transformer can learn much faster than other models for translating languages.
They tested the Transformer on translating English to German and English to French and got the best results so far. They also tried it on another task, called English Constituency Parsing, and it did very well, even without changing it specifically for that task.
The researchers are excited about using the Transformer for more tasks in the future, like working with images, audio, and video. They also want to make the model better at creating new things that don't follow a specific order. The code they used is available for others to use and learn from.
The code we used to train and evaluate our models is available at https://github.com/tensorflow/tensor2tensor.
Acknowledgements We are grateful to Nal Kalchbrenner and Stephan Gouws for their fruitful comments, corrections and inspiration.
References
[1] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. arXiv pr eprintarXiv:1607.06450, 2016.
[2] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. CoRR, abs/1409.0473, 2014.
[3] Denny Britz, Anna Goldie, Minh-Thang Luong, and Quoc V. Le. Massive exploration of neural machine translation architectures. CoRR, abs/1703.03906, 2017.
[4] Jianpeng Cheng, Li Dong, and Mirella Lapata. Long short-term memory-networks for machine reading. arXiv preprint arXiv:1601.06733, 2016.
[5] Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using rnn encoder-decoder for statistical machine translation. CoRR, abs/1406.1078, 2014.
[6] Francois Chollet. Xception: Deep learning with depthwise separable convolutions. arXiv preprint arXiv:1610.02357, 2016. 10
[7] Junyoung Chung, Çaglar Gülçehre, Kyunghyun Cho, and Yoshua Bengio. Empirical evaluation of gated recurrent neural networks on sequence modeling. CoRR, abs/1412.3555, 2014.
[8] Chris Dyer, Adhiguna Kuncoro, Miguel Ballesteros, and Noah A. Smith. Recurrent neural network grammars. In Proc. of NAACL, 2016.
[9] Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, and Yann N. Dauphin. Convolutional sequence to sequence learning. arXiv preprint arXiv:1705.03122v2, 2017.
[10] Alex Graves. Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850, 2013.
[11] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 770–778, 2016.
[12] Sepp Hochreiter, Yoshua Bengio, Paolo Frasconi, and Jürgen Schmidhuber. Gradient flow in recurrent nets: the difficulty of learning long-term dependencies, 2001.
[13] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
[14] Zhongqiang Huang and Mary Harper. Self-training PCFG grammars with latent annotations across languages. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, pages 832–841. ACL, August 2009.
[15] Rafal Jozefowicz, Oriol Vinyals, Mike Schuster, Noam Shazeer, and Yonghui Wu. Exploring the limits of language modeling. arXiv preprint arXiv:1602.02410, 2016.
[16] Łukasz Kaiser and Samy Bengio. Can active memory replace attention? In Advances in Neural Information Processing Systems, (NIPS), 2016.
[17] Łukasz Kaiser and Ilya Sutskever. Neural GPUs learn algorithms. In International Conference on Learning Representations (ICLR), 2016.
[18] Nal Kalchbrenner, Lasse Espeholt, Karen Simonyan, Aaron van den Oord, Alex Graves, and Koray Kavukcuoglu. Neural machine translation in linear time. arXiv preprint arXiv:1610.10099v2, 2017.
[19] Yoon Kim, Carl Denton, Luong Hoang, and Alexander M. Rush. Structured attention networks. In International Conference on Learning Representations, 2017.
[20] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR, 2015.
[21] Oleksii Kuchaiev and Boris Ginsburg. Factorization tricks for LSTM networks. arXiv preprint arXiv:1703.10722, 2017.
[22] Zhouhan Lin, Minwei Feng, Cicero Nogueira dos Santos, Mo Yu, Bing Xiang, Bowen Zhou, and Yoshua Bengio. A structured self-attentive sentence embedding. arXiv preprint arXiv:1703.03130, 2017.
[23] Minh-Thang Luong, Quoc V. Le, Ilya Sutskever, Oriol Vinyals, and Lukasz Kaiser. Multi-task sequence to sequence learning. arXiv preprint arXiv:1511.06114, 2015.
[24] Minh-Thang Luong, Hieu Pham, and Christopher D Manning. Effective approaches to attentionbased neural machine translation. arXiv preprint arXiv:1508.04025, 2015.
[25] Mitchell P Marcus, Mary Ann Marcinkiewicz, and Beatrice Santorini. Building a large annotated corpus of english: The penn treebank. Computational linguistics, 19(2):313–330, 1993.
[26] David McClosky, Eugene Charniak, and Mark Johnson. Effective self-training for parsing. In Proceedings of the Human Language Technology Conference of the NAACL, Main Conference, pages 152–159. ACL, June 2006. 11
[27] Ankur Parikh, Oscar Täckström, Dipanjan Das, and Jakob Uszkoreit. A decomposable attention model. In Empirical Methods in Natural Language Processing, 2016.
[28] Romain Paulus, Caiming Xiong, and Richard Socher. A deep reinforced model for abstractive summarization. arXiv preprint arXiv:1705.04304, 2017.
[29] Slav Petrov, Leon Barrett, Romain Thibaux, and Dan Klein. Learning accurate, compact, and interpretable tree annotation. In Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the ACL, pages 433–440. ACL, July 2006.
[30] Ofir Press and Lior Wolf. Using the output embedding to improve language models. arXiv preprint arXiv:1608.05859, 2016.
[31] Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural machine translation of rare words with subword units. arXiv preprint arXiv:1508.07909, 2015.
[32] Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538, 2017.
[33] Nitish Srivastava, Geoffrey E Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15(1):1929–1958, 2014.
[34] Sainbayar Sukhbaatar, Arthur Szlam, Jason Weston, and Rob Fergus. End-to-end memory networks. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems 28, pages 2440–2448. Curran Associates, Inc., 2015.
[35] Ilya Sutskever, Oriol Vinyals, and Quoc VV Le. Sequence to sequence learning with neural networks. In Advances in Neural Information Processing Systems, pages 3104–3112, 2014.
[36] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, and Zbigniew Wojna. Rethinking the inception architecture for computer vision. CoRR, abs/1512.00567, 2015.
[37] Vinyals & Kaiser, Koo, Petrov, Sutskever, and Hinton. Grammar as a foreign language. In Advances in Neural Information Processing Systems, 2015.
[38] Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144, 2016.
[39] Jie Zhou, Ying Cao, Xuguang Wang, Peng Li, and Wei Xu. Deep recurrent models with fast-forward connections for neural machine translation. CoRR, abs/1606.04199, 2016.
[40] Muhua Zhu, Yue Zhang, Wenliang Chen, Min Zhang, and Jingbo Zhu. Fast and accurate shift-reduce constituent parsing. In Proceedings of the 51st Annual Meeting of the ACL (Volume 1: Long Papers), pages 434–443. ACL, August 2013.
Authors
Ashish Vaswani∗
Google Brain
avaswani@google.com
Noam Shazeer∗
Google Brain
noam@google.com
Niki Parmar∗
Google Research
nikip@google.com
Jakob Uszkoreit∗
Google Research
usz@google.com
Llion Jones∗
Google Research
llion@google.com
Aidan N. Gomez∗ †
University of Toronto
aidan@cs.toronto.edu
Łukasz Kaiser∗
Google Brain
lukaszkaiser@google.com
Illia Polosukhin∗ ‡
illia.polosukhin@gmail.com