In the 40s, Alan Turing had some vision about an intelligent machine, whose abilities include talking indistinguishably from humans. The Turing test was born to capture that vision. Fast forward to today, some experts believe that the Turing test has been passed by ChatGPT and other LLMs. What happened?

This blog post is a story from early days of NLP until ChatGPT from my perspective as a beneficiary of those technologies, and now a student studying those very constructs. This blog post will first introduced RNN and its LSTM variant. Then it told the invention of pretraining and unsupervised sequence learning, revolving around a researcher that I consider my childhood idol. Next, it talks about the explosion of Transformer and its variants that capture the whole world’s attention. Finally, it ends with some concluding remarks.

The Birth of RNN

After the “unwelcome” birth of Feed-Forward Neural Network (FFNN) somewhere around the 70s(?), Recurrent Neural Network (RNN) was born somewhere around the 80s(?) as a complement. Theoretically, it nicely generalizes the flow of information inside to allow cycles (to some certain extent!). It also allows removing the i.i.d. restriction imposed on FFNN earlier1, thus, it can be practically used to deal with sequential data, especially with the text in our own languages, in a more natural way. Previously, a piece of text is dealt as bag of words only, which worked to some limited extend as a method to obtain features for text data, but surely falls short in the Turing test mostly by ignoring the order of tokens. Even with static word embeddings and CNN, there is something uneasy about the inability to capture the textual dependency. RNNs is arguably the first satisfactory model to produced the so-called contextualized (or dynamic) embeddings of text sequences.

However, vanilla RNN was still too naive of a language model to approximate linguistics logics. Therefore, in 1997, Sepp Hochreiter (TUM, Germany) and Jürgen Schmidhuber (Switzerland) published a paper called “Long Short-term Memory” (LSTM) in the journal “Neural computation” (not sure if it still exists now) [1]. And it was a big success! Why? Recall that RNN goes through each token and tries its best to keep a vector as the “state” of the whole preceding sentences. The idea is very nice, but the vanishing/exploding gradients renders it challenging to train vanilla RNNs. LSTM is a clever instantiation of the general (not vanilla) RNN that splits the state into two parts – long-term and short-term memories – and a clever way to maintain those two (using three gates). This 1993 idea is so ahead of time that the architecture stayed on top of the world, especially used in Google Translate(?). Even until 2018, a strong language model from Allen Institute for AI and UW called ELMo (stands for “Embeddings from Language Models”) was still based on LSTM [2].

On a side note, probably during this time, tokenization is already an extensively thought problem. Personally, I think this is still a very tricky problem, requires a lot of adhoc processing. (May publish a separate blog about tokenization later.)

In hindsight, data-driven learning is a perfect approach for languages (and many other problems). Each language evolves everyday, affected by and mixed with other languages (e.g., code switching, word borrowing, cultural influences, etc.) Any attempt to logically capture language in its entirity would fail miserably. The only way to learn languages (for both humans and computers) is to observe it in the wild a lot, and potentially generate it. That’s is why we have the success of modern NLP in the next section.

Pretraining and Unsupervising Learning for Languages

My selfie with Quoc Le, 2023

My selfie with Quoc Le, 2023 at Fulbright. Finally met my childhood idol.

I still remember my days in secondary school (2012-2014) when I was using Google Translate. It was a nifty application, but the translations were so pretty bad that they were used as jokes a lot. Back then I felt the translations were done via word-by-word mapping via a dictionary. But in 2015-16, it suddenly became much better. It translated sentences much more smoothly. Me and my friends delightedly used it to translate long English sentences to Vietnamese and vice versa.

Also around that time, I saw Le Viet Quoc on Vietnamese television, touted as one of the world’s top inventors for some impactful work. 8 years later, I now know what contribution Quoc Le and his team at Google Brain has brought to the world. In 2014, Quoc first worked with Tomas Mikolov (the one who invented word2vec, the earliest and most popular word embedding method [3]) to introduce “para2vec”, a nice way to vectorize sentences and documents [4]. It has probably improved Google Search engine quite a bit because the it made the problem of calculating similarity between a webpage and a search query as simple as the consine similarity between their vectorized representation.

But that’s not it yet. In 2015, Quoc Le with Andrew Dai published the paper “Semi-supervised Sequence Learning” in NeurIPS that, for the first time, introduced the two notions of pretraining and unsupervising training on raw text [5]. They wrote:

(Abstract) We present two approaches to use unlabeled data to improve Sequence Learning with recurrent networks. The first approach is to predict what comes next in a sequence, which is a language model in NLP. The second approach is to use a sequence autoencoder, which reads the input sequence into a vector and predicts the input sequence again. These two algorithms can be used as a “pretraining” algorithm for a later supervised sequence learning algorithm. In other words, the parameters obtained from the pretraining step can then be used as a starting point for other supervised training models.

This is revolutionizing. Although their wordings may focus on parameter initialization, their idea of pretraining essentially changed how we today think about how train a neural network. You don’t have to train it once. Like a kid, a neural network can first be a normal person learning about the world in its generality. Then, if you want to force the kid to do well at some task, allow it to bring its existing world knowledge to the classroom!

Another quote:

(In Introduction section) Another important result from our experiments is that it is possible to use unlabeled data from related tasks to improve the generalization of a subsequent supervised model. […] This evidence supports the thesis that it is possible to use unsupervised learning with more unlabeled data to improve supervised learning.

Based on this claim, we may learn that these two reseachers are the first people on earth to know that you can just crawl a lot of text on the internet and feed it to a reasonable language model (at such time, they used LSTM) to substantially improve its linguistic ability. Today, this finding is considered historic because it allows us to make great use of the huge internet to train our language models.

The Transformer Race

In an AMA session at my undergraduate university, Quoc Le said that:

(Paraphrased) The fact that we don’t know what the language models are doing is actually good to me. Our brains are smart, and we also don’t understand how it work. Then why do we want to limit our neural network technology within the explainability realm?

He is definitely referring to the explosion of transformers models – outstandingly powerful yet hard to understand neural networks. After Dai and Le’s 2015 paper on pretraining and semi-supervised sequence learning, people probably felt that they needed a more expressive model to store the general knowledge after pretraining with huge data. And that was exactly what the team of 8 scientists at Google Brain did in the infamous “Attention Is All You Need” paper [6].

The paper introduces the Transformer model, which is not to be unfolded along a sequence like an RNN, but instead takes a whole sequence of up to a few thousand tokens at the same time. This very first point is impactful because it allows the model to be trained paralelly across tokens, which was not a thing in RNN and its LSTM version. Inside the Transformer, the main structure is attention. According to Wikipedia, attention in neural networks is a very old concept because it was (somehow) presented in LSTM already. The Google Brain team also contributed by the use of self-attention and cross-attention in its multi-head version in the Transformer architecture, along with sinuisoidal encoding of token positions.

As of today, the Transformer paper has over 85,000 citations. To be honest, I found the paper to be particularly hard to understand when it stands alone because there are very little explanation and contextualization. Last year, two researchers from DeepMind somewhat cleared up my confusion by their explanation paper “Formal Algorithms for Transformers” [7].

But it was hard for me only. It was definitely much easier for the field’s experts to read, understand and improve the Transformer. Just look at how “aggresive” they have been in leveraging the Transformer architecture for NLP tasks. Below is what I call the “Transformer race”, where each new Transformer variant partially or totally outperformed the existing models:

  • In 2018, OpenAI (4 researchers) released GPT (stands for “Generative Pre-Training”, the grandfarther of chatGPT) [8].
  • In 2019, Google AI Language (4 researchers) released BERT (stands for “Bidirectional Encoder Representations from Transformers”, a word play to match with ELMo!) [9]. Improvement came from bidirectional encoding.
  • Also in that year, Facebook and UW released RoBERTa (“Robustly optimized BERT approach”) by essentially training the BERT architecture more carefully [10].
  • In 2020, Facebook released BART (not sure that it stands for, maybe just a mutation from BERT) with a new(?) training objective called “denoising” [11].
  • Also in that year, UW, Princeton, Allen Institute and Facebook released SpanBERT, whose improvements came from the novel training objective of predicting “spans of text” [12].
  • Still in 2020 (!!), Google released T5 (stands for “Text-to-Text Transfer Transformer”). Published in the pretigous Journal of Machine Learning Research, this paper casting all NLP tasks into text-to-text problems(!), which allows extensive empirical considerations of various aspects in transformer training, which eventually led to a highly optimized T5 model. [13]
  • In 2022, ChatGPT was released, powered by GPT3, 3.5 and 4 of Open AI, which convinced many people that the Turing test has been passed and AI is actually a big deal to humanity.

There are a lot of papers along the way that I cannot include here or even have not noticed. Some models become multilingual (like mT5 [14]), some are applications to other data types like source code ([15]), images, and definitely more.

It is also worth to mention the datasets (e.g., C4), benchmarks (e.g., GLUE, BLUE, SuperGLUE), training techniques (e.g., Adam), hardware inventions (e.g., TPU), invented along the way that are irreplaceable to the developments of current LLMs.

Concluding Remarks

These innovations are not just alien technical advancements. With the release of ChatGPT, my generation is the first one in Vietnamese education history to be able to “generate” the essays of 5 mandatory political courses in 15 minutes. We should be called the ChatGPT generation for that, and absolutely many other ways the LLMs have alter our lives. Many jobs have been lost, new jobs are merging (hello prompt engineers!), many writings have been produced without a human author, and much more money are being invested to this flourishing research and engineering field. And guess what? It is probably just the beginning.

References

[1] Hochreiter, Sepp, and Jürgen Schmidhuber. “Long Short-Term Memory.” Neural Computation 9 (December 1, 1997): 1735–80. https://doi.org/10.1162/neco.1997.9.8.1735.

[2] Peters, Matthew E., Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. “Deep Contextualized Word Representations.” In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), 2227–37. New Orleans, Louisiana: Association for Computational Linguistics, 2018. https://doi.org/10.18653/v1/N18-1202.

[3] Mikolov, Tomas, Kai Chen, Greg Corrado, and Jeffrey Dean. “Efficient Estimation of Word Representations in Vector Space.” ArXiv:1301.3781 [Cs], September 6, 2013. http://arxiv.org/abs/1301.3781.

[4] Le, Quoc V., and Tomas Mikolov. “Distributed Representations of Sentences and Documents.” ArXiv:1405.4053 [Cs], May 22, 2014. http://arxiv.org/abs/1405.4053.

[5] Dai, Andrew M, and Quoc V Le. “Semi-Supervised Sequence Learning.” In Advances in Neural Information Processing Systems, Vol. 28. Curran Associates, Inc., 2015. https://papers.nips.cc/paper_files/paper/2015/hash/7137debd45ae4d0ab9aa953017286b20-Abstract.html.

[6] Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. “Attention Is All You Need.” In Advances in Neural Information Processing Systems, Vol. 30. Curran Associates, Inc., 2017. https://papers.nips.cc/paper_files/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html.

[7] Phuong, Mary, and Marcus Hutter. “Formal Algorithms for Transformers.” arXiv, July 19, 2022. https://doi.org/10.48550/arXiv.2207.09238.

[8] Radford, Alec, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. “Improving Language Understanding by Generative Pre-Training,” n.d.

[9] Devlin, Jacob, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. “BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding.” arXiv, May 24, 2019. https://doi.org/10.48550/arXiv.1810.04805.

[10] Liu, Yinhan, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. “RoBERTa: A Robustly Optimized BERT Pretraining Approach.” arXiv, July 26, 2019. https://doi.org/10.48550/arXiv.1907.11692.

[11] Lewis, Mike, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Veselin Stoyanov, and Luke Zettlemoyer. “BART: Denoising Sequence-to-Sequence Pre-Training for Natural Language Generation, Translation, and Comprehension.” In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 7871–80. Online: Association for Computational Linguistics, 2020. https://doi.org/10.18653/v1/2020.acl-main.703.

[12] Joshi, Mandar, Danqi Chen, Yinhan Liu, Daniel S. Weld, Luke Zettlemoyer, and Omer Levy. “SpanBERT: Improving Pre-Training by Representing and Predicting Spans.” Transactions of the Association for Computational Linguistics 8 (January 1, 2020): 64–77. https://doi.org/10.1162/tacl_a_00300.

[13] Raffel, Colin, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. “Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer,” n.d.

[14] Xue, Linting, Noah Constant, Adam Roberts, Mihir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Barua, and Colin Raffel. “MT5: A Massively Multilingual Pre-Trained Text-to-Text Transformer.” In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 483–98. Online: Association for Computational Linguistics, 2021. https://doi.org/10.18653/v1/2021.naacl-main.41.

[15] Feng, Zhangyin, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong, Linjun Shou, et al. “CodeBERT: A Pre-Trained Model for Programming and Natural Languages.” ArXiv:2002.08155 [Cs], September 18, 2020. http://arxiv.org/abs/2002.08155.

Footnotes

  1. But we would soon know that data for FFNN doesn’t have to be i.i.d., with Transformers.