Punctuation Prediction for Polish Texts using Transformers

Jakub Pokrywka

Punctuation Prediction for Polish Texts using Transformers

Jakub Pokrywka

DOI: http://dx.doi.org/10.15439/2023F1633

Citation: Proceedings of the 18th Conference on Computer Science and Intelligence Systems, M. Ganzha, L. Maciaszek, M. Paprzycki, D. Ślęzak (eds). ACSIS, Vol. 35, pages 1251–1254 (2023)

Full text

Abstract. Speech recognition systems typically output text lacking punctuation. However, punctuation is crucial for written text comprehension. To tackle this problem, Punctuation Prediction models are developed. This paper described a solution for Poleval 2022 Task 1: Punctuation Prediction for Polish Texts, which scores 71.44 Weighted F1. The method utilizes a single HerBERT model finetuned to the competition data and an external dataset.

References

R. Mroczkowski, P. Rybak, A. Wróblewska, and I. Gawlik, “HerBERT: Efficiently pretrained transformer-based language model for Polish,” in Proceedings of the 8th Workshop on Balto-Slavic Natural Language Processing, (Kiyv, Ukraine), pp. 1–10, Association for Computational Linguistics, Apr. 2021.
A. Mikołajczyk, A. Wawrzynski, P. Pezik, M. Adamczyk, A. Kaczmarek, and W. Janowski, “Poleval 2021 task 1: Punctuation restoration from read text,” Proceedings ofthePolEval2021Workshop, p. 21.
K. Wróbel, “Punctuation restoration with transformers,” Proceedings ofthePolEval2021Workshop, pp. 33–37.
N. Ropiak, M. Pogoda, J. Radom, K. Gawron, M. Swędrowski, and B. Bojanowski, “Comparison of translation and classification approaches for punctuation recovery,” Proceedings ofthePolEval2021Workshop, pp. 39–46.
M. Marcińczuk, “Punctuation restoration with ensemble of neural network classifier and pre-trained transformers,” Proceedings ofthePolEval2021Workshop, pp. 47–53.
T. Ziętkiewicz, “Punctuation restoration from read text with transformer-based tagger,” Proceedings ofthePolEval2021Workshop, pp. 55–60.
M. Attia, M. Al-Badrashiny, and M. Diab, “Gwu-hasp: Hybrid arabic spelling and punctuation corrector,” in Proceedings of the EMNLP 2014 workshop on Arabic natural language processing (ANLP), pp. 148–154, 2014.
X. Che, C. Wang, H. Yang, and C. Meinel, “Punctuation prediction for unsegmented transcript based on word vector,” in Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), pp. 654–658, 2016.
M. Sunkara, S. Ronanki, K. Dixit, S. Bodapati, and K. Kirchhoff, “Robust prediction of punctuation and truecasing for medical asr,” arXiv preprint https://arxiv.org/abs/2007.02025, 2020.
D. Tuggener and A. Aghaebrahimian, “The sentence end and punctuation prediction in nlg text (sepp-nlg) shared task 2021,” in Swiss Text Analytics Conference–SwissText 2021, Online, 14-16 June 2021, CEUR Workshop Proceedings, 2021.
O. Guhr, A.-K. Schumann, F. Bahrmann, and H. J. Böhme, “Fullstop: Multilingual deep models for punctuation prediction,” June 2021.
P. P ̨ezik, G. Krawentek, S. Karasińska, P. Wilk, P. Rybińska, A. Cichosz, A. Peljak-Łapińska, M. Deckert, and M. Adamczyk, “DiaBiz,” 2022. CLARIN-PL digital repository.
P. Pęzik, “Spokes- a search and exploration service for conversational corpus data,” pp. 99–109, Selected papers from the CLARIN 2014 Conference, 2014.
S. Karasińska, S. Cichosz, and P. Pęzik, “Evaluating punctuation prediction in conversational language,” Forthcoming.
P. Koehn, “Europarl: A parallel corpus for statistical machine translation,” in Proceedings of Machine Translation Summit X: Papers, (Phuket, Thailand), pp. 79–86, Sept. 13-15 2005.
F. Graliński, A. Wróblewska, T. Stanisławek, K. Grabowski, and T. Górecki, “GEval: Tool for debugging NLP datasets and models,” in Proceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, (Florence, Italy), pp. 254–262, Association for Computational Linguistics, Aug. 2019.
F. Graliński, R. Jaworski, Ł. Borchmann, and P. Wierzchoń, “Go-nito.net – open platform for research competition, cooperation and reproducibility,” in Proceedings of the 4REAL Workshop: Workshop on Research Results Reproducibility and Resources Citation in Science and Technology of Language (A. Branco, N. Calzolari, and K. Choukri, eds.), pp. 13–20, 2016.
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. u. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances in Neural Information Processing Systems (I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, eds.), vol. 30, Curran Associates, Inc., 2017.
S. Dadas, M. Perełkiewicz, and R. Poświata, “Pre-training polish transformer-based language models at scale,” in Artificial Intelligence and Soft Computing, pp. 301–314, Springer International Publishing, 2020.
A. Conneau, K. Khandelwal, N. Goyal, V. Chaudhary, G. Wenzek, F. Guzmán, E. Grave, M. Ott, L. Zettlemoyer, and V. Stoyanov, “Unsupervised cross-lingual representation learning at scale,” CoRR, vol. abs/1911.02116, 2019.