Temporal Language Modeling for Short Text Document Classification with Transformers

Jakub Pokrywka; Filip Graliński

Temporal Language Modeling for Short Text Document Classification with Transformers

Jakub Pokrywka, Filip Graliński

DOI: http://dx.doi.org/10.15439/2022F174

Citation: Proceedings of the 17th Conference on Computer Science and Intelligence Systems, M. Ganzha, L. Maciaszek, M. Paprzycki, D. Ślęzak (eds). ACSIS, Vol. 30, pages 121–128 (2022)

Full text

Abstract. Language models are typically trained on solely text data, not utilizing document timestamps, which are available in most internet corpora. In this paper, we examine the impact of incorporating timestamp into transformer language model in terms of downstream classification task and masked language modeling on 2 short texts corpora. We examine different timestamp components: day of the month, month, year, weekday. We test different methods of incorporating date into the model: prefixing date components into text input and adding trained date embeddings. Our study shows, that such a temporal language model performs better than a regular language model for both documents from training data time span and unseen time span. That holds true for classification and language modeling. Prefixing date components into text performs no worse than training special date components embeddings.

References

B. Dhingra, J. R. Cole, J. M. Eisenschlos, D. Gillick, J. Eisenstein, and W. W. Cohen. Time-aware language models as temporal knowledge bases, 2021.
A. Go, R. Bhayani, and L. Huang. Twitter sentiment classification using distant supervision. Processing, 150, 2009.
F. Graliński, R. Jaworski, Ł. Borchmann, and P. Wierzchoń. Gonito.net – open platform for research competition, cooperation and reproducibility. In A. Branco, N. Calzolari, and K. Choukri, editors, Proceedings of the 4REAL Workshop: Workshop on Research Results Reproducibility and Resources Citation in Science and Technology of Language, pages 13–20. 2016.
F. Graliński, A. Wróblewska, T. Stanisławek, K. Grabowski, and T. Górecki. GEval: Tool for debugging NLP datasets and models. In Proceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 254–262, Florence, Italy, 2019. Association for Computational Linguistics.
F. Graliński. (Temporal) language models as a competitive challenge. In Z. Vetulani and P. Paroubek, editors, Proceedings of the 8th Language & Technology Conference, pages 141–146. Fundacja Uniwersytetu im. Adama Mickiewicza w Poznaniu, 2017.
S. A. Hombaiah, T. Chen, M. Zhang, M. Bendersky, and M. Najork. Dynamic language models for continuously evolving content. ArXiv preprint, abs/2106.06297, 2021.
X. Huang and M. J. Paul. Examining temporality in document classification. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 694–699, Melbourne, Australia, 2018. Association for Computational Linguistics.
X. Huang and M. J. Paul. Neural temporality adaptation for document classification: Diachronic word embeddings and domain adaptation models. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4113–4123, Florence, Italy, 2019. Association for Computational Linguistics.
S. M. Kazemi, R. Goel, S. Eghbali, J. Ramanan, J. Sahota, S. Thakur, S. Wu, C. Smyth, P. Poupart, and M. A. Brubaker. Time2vec: Learning a vector representation of time. ArXiv, abs/1907.05321, 2019.
N. S. Keskar, B. McCann, L. R. Varshney, C. Xiong, and R. Socher. Ctrl: A conditional transformer language model for controllable generation. ArXiv: abs/1909.05858, 2019.
V. Kulkarni, Y. Tian, P. Dandiwala, and S. Skiena. Simple neologism based domain independent models to predict year of authorship. In Proceedings of the 27th International Conference on Computational Linguistics, pages 202–212, Santa Fe, New Mexico, USA, 2018. Association for Computational Linguistics.
G. Lample and A. Conneau. Cross-lingual language model pretraining. In NeurIPS, 2019.
A. Lazaridou, A. Kuncoro, E. Gribovskaya, D. Agrawal, A. Liska, T. Terzi, M. Gimenez, C. de Masson d’Autume, S. Ruder, D. Yogatama, K. Cao, T. Kociský, S. Young, and P. Blunsom. Pitfalls of static language modelling. ArXiv, abs/2102.01951, 2021.
Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov. RoBERTa: A robustly optimized BERT pretraining approach. ArXiv preprint, abs/1907.11692, 2019.
J. Pokrywka, F. Graliński, K. Jassem, K. Kaczmarek, K. Jurkiewicz, and P. Wierzchoń. Challenging America: Modeling language in longer time scales. Findings of North American Chapter of the Association for Computational Linguistics, 2022. forthcoming.
G. D. Rosin, I. Guy, and K. Radinsky. Time masking for temporal language models, 2021.
P. Röttger and J. Pierrehumbert. Temporal adaptation of BERT and performance on downstream document classification: Insights from social media. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 2400–2412, Punta Cana, Dominican Republic, 2021. Association for Computational Linguistics.
T. Szymanski and G. Lynch. UCD : Diachronic text classification with character, word, and syntactic n-grams. In Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015), pages 879–883, Denver, Colorado, 2015. Association for Computational Linguistics.
M. Zhang and E. Choi. SituatedQA: Incorporating extra-linguistic contexts into QA. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 7371–7387, Online and Punta Cana, Dominican Republic, 2021. Association for Computational Linguistics.