Logo PTI
Polish Information Processing Society
Logo FedCSIS

Annals of Computer Science and Information Systems, Volume 21

Proceedings of the 2020 Federated Conference on Computer Science and Information Systems

Overview of the Transformer-based Models for NLP Tasks

, , ,

DOI: http://dx.doi.org/10.15439/2020F20

Citation: Proceedings of the 2020 Federated Conference on Computer Science and Information Systems, M. Ganzha, L. Maciaszek, M. Paprzycki (eds). ACSIS, Vol. 21, pages 179183 ()

Full text

Abstract. In 2017, Vaswani et al. proposed a new neural network architecture named Transformer. That modern architecture quickly revolutionized the natural language processing world. Models like GPT and BERT relying on this Transformer architecture have fully outperformed the previous state-of-the-art networks. It surpassed the earlier approaches by such a wide margin that all the recent cutting edge models seem to rely on these Transformer-based architectures. In this paper, we provide an overview and explanations of the latest models. We cover the auto-regressive models such as GPT, GPT-2 and XLNET, as well as the auto-encoder architecture such as BERT and a lot of post-BERT models like RoBERTa, ALBERT, ERNIE 1.0/2.0.


  1. F. J. Och and H. Ney, “The Alignment Template Approach to Statistical Machine Translation,” Computational Linguistics, vol. 30, pp. 417–449, Dec. 2004.
  2. A. M. Rush, S. Chopra, and J. Weston, “A Neural Attention Model for Abstractive Sentence Summarization,” https://arxiv.org/abs/1509.00685 [cs], Sept. 2015. https://arxiv.org/abs/ 1509.00685.
  3. L. Yu, W. Zhang, J. Wang, and Y. Yu, “SeqGAN: Sequence Generative Adversarial Nets with Policy Gradient,” https://arxiv.org/abs/1609.05473 [cs], Aug. 2017. https://arxiv.org/abs/ 1609.05473.
  4. S. Hochreiter and J. Schmidhuber, “Long Short-Term Memory,” Neural Computation, vol. 9, pp. 1735–1780, Nov. 1997.
  5. K. Cho, B. van Merrienboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio, “Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation,” https://arxiv.org/abs/1406.1078 [cs, stat], Sept. 2014. arXiv: 1406.1078.
  6. K. Greff, R. K. Srivastava, J. Koutník, B. R. Steunebrink, and J. Schmidhuber, “LSTM: A Search Space Odyssey,” IEEE Transactions on Neural Networks and Learning Systems, vol. 28, pp. 2222–2232, Oct. 2017. https://arxiv.org/abs/ 1503.04069.
  7. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention Is All You Need,” https://arxiv.org/abs/1706.03762 [cs], Dec. 2017. arXiv: 1706.03762.
  8. A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever, “Improving Language Understanding by Generative Pre-Training,” p. 12, Nov. 2018.
  9. J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pretraining of Deep Bidirectional Transformers for Language Understanding,” https://arxiv.org/abs/1810.04805 [cs], May 2019.
  10. Y. Sun, S. Wang, Y. Li, S. Feng, H. Tian, H. Wu, and H. Wang, “ERNIE 2.0: A Continual Pre-training Framework for Language Understanding,” https://arxiv.org/abs/1907.12412 [cs], Nov. 2019. arXiv: 1907.12412.
  11. T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient Estimation of Word Representations in Vector Space,” https://arxiv.org/abs/1301.3781 [cs], Sept. 2013. https://arxiv.org/abs/ 1301.3781.
  12. J. Pennington, R. Socher, and C. Manning, “Glove: Global Vectors for Word Representation,” in Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), (Doha, Qatar), pp. 1532–1543, Association for Computational Linguistics, 2014.
  13. R. Sennrich, B. Haddow, and A. Birch, “Neural Machine Translation of Rare Words with Subword Units,” https://arxiv.org/abs/1508.07909 [cs], June 2016. https://arxiv.org/abs/ 1508.07909.
  14. A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever, “Language Models are Unsupervised Multitask Learners,” p. 24, Nov. 2019.
  15. T. Kudo and J. Richardson, “SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing,” https://arxiv.org/abs/1808.06226 [cs], Aug. 2018. arXiv: 1808.06226.
  16. Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov, “RoBERTa: A Robustly Optimized BERT Pretraining Approach,” https://arxiv.org/abs/1907.11692 [cs], July 2019. arXiv:1907.11692 version: 1.
  17. T. H. Trinh and Q. V. Le, “A Simple Method for Commonsense Reasoning,” https://arxiv.org/abs/1806.02847 [cs], Sept. 2019. arXiv: 1806.02847.
  18. Y. Zhu, R. Kiros, R. Zemel, R. Salakhutdinov, R. Urtasun, A. Torralba, and S. Fidler, “Aligning Books and Movies: Towards Story-like Visual Explanations by Watching Movies and Reading Books,” https://arxiv.org/abs/1506.06724 [cs], June 2015. arXiv: 1506.06724.
  19. R. Parker, D. Graff, and J. Kong, “English gigaword,” Linguistic Data Consortium, Jan. 2011.
  20. J. Callan, M. Hoy, C. Yoo, and L. Zhao, “The ClueWeb09 Dataset - Dataset Information and Sample Files,” Jan. 2009.
  21. A. Gokaslan and V. Cohen, OpenWebText Corpus. Jan. 2019.
  22. R. Zellers, A. Holtzman, H. Rashkin, Y. Bisk, A. Farhadi, F. Roesner, and Y. Choi, “Defending Against Neural Fake News,” https://arxiv.org/abs/1905.12616 [cs], Oct. 2019. https://arxiv.org/abs/ 1905.12616.
  23. A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. R. Bowman, “GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding,” https://arxiv.org/abs/1804.07461 [cs], Feb. 2019. arXiv:1804.07461.
  24. A. Wang, Y. Pruksachatkun, N. Nangia, A. Singh, J. Michael, F. Hill, O. Levy, and S. R. Bowman, “SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems,” https://arxiv.org/abs/1905.00537 [cs], July 2019. https://arxiv.org/abs/ 1905.00537.
  25. P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang, “SQuAD: 100,000+ Questions for Machine Comprehension of Text,” https://arxiv.org/abs/1606.05250 [cs], Oct. 2016. https://arxiv.org/abs/ 1606.05250.
  26. P. Rajpurkar, R. Jia, and P. Liang, “Know What You Don’t Know: Unanswerable Questions for SQuAD,” https://arxiv.org/abs/1806.03822 [cs], June 2018. https://arxiv.org/abs/ 1806.03822.
  27. G. Lai, Q. Xie, H. Liu, Y. Yang, and E. Hovy, “RACE: Large-scale ReAding Comprehension Dataset From Examinations,” https://arxiv.org/abs/1704.04683 [cs], Dec. 2017. arXiv: 1704.04683.
  28. Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. Salakhutdinov, and Q. V. Le, “XLNet: Generalized Autoregressive Pretraining for Language Understanding,” https://arxiv.org/abs/1906.08237 [cs], June 2019. arXiv: 1906.08237.
  29. Z. Dai, Z. Yang, Y. Yang, J. Carbonell, Q. V. Le, and R. Salakhutdinov, “Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context,” https://arxiv.org/abs/1901.02860 [cs, stat], June 2019. arXiv: 1901.02860.
  30. C. Sun, A. Myers, C. Vondrick, K. Murphy, and C. Schmid, “VideoBERT: A Joint Model for Video and Language Representation Learning,” https://arxiv.org/abs/1904.01766 [cs], Sept. 2019. arXiv: 1904.01766.
  31. A. Wang and K. Cho, “BERT has a Mouth, and It Must Speak: BERT as a Markov Random Field Language Model,” https://arxiv.org/abs/1902.04094 [cs], Apr. 2019. https://arxiv.org/abs/ 1902.04094 version: 2.
  32. A. Baevski, S. Edunov, Y. Liu, L. Zettlemoyer, and M. Auli, “Cloze-driven Pretraining of Self-attention Networks,” https://arxiv.org/abs/1903.07785 [cs], Mar. 2019. https://arxiv.org/abs/ 1903.07785.
  33. V. Sanh, L. Debut, J. Chaumond, and T. Wolf, “DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter,” https://arxiv.org/abs/1910.01108 [cs], Oct. 2019. https://arxiv.org/abs/ 1910.01108.
  34. Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, and R. Soricut, “ALBERT: A Lite BERT for Self-supervised Learning of Language Representations,” https://arxiv.org/abs/1909.11942 [cs], Oct. 2019. arXiv: 1909.11942 version: 3.
  35. X. Liu, P. He, W. Chen, and J. Gao, “Multi-Task Deep Neural Networks for Natural Language Understanding,” https://arxiv.org/abs/1901.11504 [cs], May 2019. https://arxiv.org/abs/ 1901.11504.
  36. W. de Vries, A. van Cranenburgh, A. Bisazza, T. Caselli, G. van Noord, and M. Nissim, “BERTje: A Dutch BERT Model,” https://arxiv.org/abs/1912.09582 [cs], Dec. 2019. https://arxiv.org/abs/ 1912.09582.
  37. M. Polignano, P. Basile, and M. de Gemmis, “ALBERTO: Italian BERT Language Understanding Model for NLP Challenging Tasks Based on Tweets,” p. 6, 2019.
  38. L. Martin, B. Muller, P. J. O. Suárez, Y. Dupont, L. Romary, E. V. de la Clergerie, D. Seddah, and B. Sagot, “CamemBERT: a Tasty French Language Model,” https://arxiv.org/abs/1911.03894 [cs], May 2020. arXiv: 1911.03894.
  39. H. Le, L. Vial, J. Frej, V. Segonne, M. Coavoux, B. Lecouteux, A. Allauzen, B. Crabbé, L. Besacier, and D. Schwab, “FlauBERT: Unsupervised Language Model Pre-training for French,” https://arxiv.org/abs/1912.05372 [cs], Mar. 2020. https://arxiv.org/abs/ 1912.05372.
  40. G. Lample and A. Conneau, “Cross-lingual Language Model Pretraining,” https://arxiv.org/abs/1901.07291 [cs], Jan. 2019. arXiv: 1901.07291.
  41. N. Kitaev, L. Kaiser, and A. Levskaya, “Reformer: The Efficient Transformer,” https://arxiv.org/abs/2001.04451 [cs, stat], Jan. 2020. arXiv: 2001.04451.
  42. D. R. So, C. Liang, and Q. V. Le, “The Evolved Transformer,” https://arxiv.org/abs/1901.11117 [cs, stat], May 2019. arXiv: 1901.11117.