Punctuation Prediction for Polish Texts Using Transformers

Speech recognition systems typically output text lacking punctuation. However, punctuation is crucial for written text comprehension. To tackle this problem, Punctuation Prediction models are developed. This paper describes a solution for Poleval 2022 Task 1: Punctuation Prediction for Polish Texts, which scores 71.44 Weighted F1. The method utilizes a single HerBERT model finetuned to the competition data and an external dataset.


I. INTRODUCTION
A UTOMATIC Speech Recognition (ASR) systems pro- duce speech transcripts, which typically do not contain punctuation.This may negatively impact the overall clarity of the transcribed text.For several reasons, punctuation is important: • Punctuation reduces ambiguity in communication.The Sentences "Let's eat, children" and "Let's eat children" have completely different meanings, but they only vary in a comma.
• Punctuation helps in clarifying the intended meaning of a text.It provides cues to understand the structure of the text.Punctuation marks like commas, periods, question marks, and exclamation marks indicate pauses, sentence endings, and changes in tone or intent.
• Punctuation conveys tone and emotion behind the text.E.g., an exclamation mark may indicate excitement and a question mark may denote uncertainty.
• Punctuation enhances the readability of the written words.
Breaking down complex sentences into smaller parts with the use of commas, colons, and semicolons creates pauses, which aids in understanding the text Many post-processing steps may be taken to circumvent this problem and the lack of capitalization problem.Such tasks are: The task of Punctuation Restoration is defined as the act of reinstating the original punctuation found in read speech transcripts.
This work describes the solution to Poleval 2022 Task 1: Punctuation Prediction from conversational language.The solution is based on the HerBERT model [1] fine-tuned to the competition data and an external dataset.

II. RELATED WORK
In the previous PolEval edition, a task similar to Punctuation Prediction was assigned, precisely PolEval 2021 Task: Punctuation restoration from read text [2].The challenge unveiled WikiPunct, a fresh collection of text and audio corpus comprising 39 hours of audio and approximately 38,000 text transcripts.Four submissions [3], [4], [5], [6] applied transformer-based methods for token classification, from which two authors utilized ensembles.Additionally, one author explored the integration of a bi-LSTM layer at the top of the transformer, along with vectors acquired from a wave2vec model.
When it comes down to other languages, authors of [7] developed a method on Support Vector Machines with Conditional Random Field (CRF) classifiers, using part-of-speech (POS) and morphological data for Arabic texts.Authors of [8] used Deep Neural Networks and Convolutional Neural Networks for English texts and authors of [9] used transformers for English medical texts.
Recently, The Sentence End and Punctuation Prediction for many languages shared task was launched [10].All of the teams explored neural network models, particularly transformers.The winning team described their solution in [11].

III. COMPETITION DESCRIPTION
The three datasets are provided for in the competition: train, dev, and test.For each dataset, input audio WAV files with text transcribed by an ASR system are delivered.The input text is segmented, where a single space separates each word.Each word is prepended by a word start timestamp and word end timestamp in milliseconds.
The missing punctuation symbols are as in table I.The dataset is split into three subsets as described in Table III.The annotation scheme is not publicly available during the competition and will described in [14].
There is one sample data from the training dataset in the subsection below.

B. Utilized Data
In our final solution, we did not use any audio data.Additionally, we decided not to include start and stop timestamps as we did not observe any significant improvement in their score after conducting multiple experiments.Throughout the training process, we experimented with four different sources.• europarl-v7.pl-en.pl[15] Regrettably, the europarl-v7.pl-en.pldataset did not lead to a score improvement.Therefore, it was not utilized in our final solution.

C. Metric
The challenge metric is the Weighted F1 score.The evaluation script is implemented in the GEval evaluation tool [16].The challenge was hosted on the gonito platform [17].The final evaluation is done on the test-B dataset on all the domains.The metric definition is meticulously described in Poleval 2021 Task1 summary paper [2].

IV. METHOD
Our method was based on FullStop: Multilingual Deep Models for Punctuation Prediction [11] library.We slightly modified the library to work on a different set of punctuation marks than it was intended to.The final solution model was based on a single HerBERT [1], a neural model of transformer architecture [18] trained on a corpus of Polish texts.The model was finetuned to the data described in Section III-B with the aforementioned text preprocessing steps.We used scripts available at https: //github.com/oliverguhr/fullstop-deep-punctuation-prediction/blob/main/other_languages/readme.md.The Polish RoBERTa [19] model was evaluated as well, but not used for the final solution due to worse results.Both evaluations are available in Tables V and VI.We also conducted experiments with XLM-RoBERTa [20], but unfortunately, we did not achieve better results again.

V. RESULTS
The final model using achieved a third-place score of 71.44 in the competition's Weighted F1 category.While it falls behind the first-place score of 83.30 and the second-place score, it still surpasses the baseline score of 35.30.Frequent punctuation symbols like full stops and commas (occurring above ten times per 1000 words) consistently scored between 70 and 80 in F1.However, the F1 scores varied greatly for less frequent symbols, with scores of 16.67, 100.00, and 43.72.
The subsections below illustrate some correct and incorrect predictions from the test-B dataset.

Table II .
Proceedings of the 18 th Conference on Computer Science and Intelligence Systems pp.1251-1254 The competition dataset is based on three resources summarized in DOI: 10.15439/2023F1633 ISSN 2300-5963 ACSIS, Vol.35 IEEE Catalog Number: CFP2385N-ART ©2023, PTI 1251 Thematic track: Challenges for Natural Language Processing Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

TABLE II THE
FULL COMPETITION DATASET (TRAIN, DEV, TEST) STATISTICS.

TABLE III COMPETITION
DATASET STATISTICS SPLIT INTO TRAIN, DEV, TEST.

Table
IV presents the statistics for the training datasets used and competition final test data: test-B.Some punctuation marks are more popular than others, which is consequent in all the datasets.There are some differences between training and testing datasets, but they are insignificant.E.g., the Fullstop character is more common in the test-B dataset than in the train dataset (104.022 vs. 78.338).The same stays true for Comma (133.303 vs. 112.923).The PolEval 2022 dataset exhibits much more significant differences than the PolEval 2021 dataset.This is particularly evident in the Mean Words per Sample metric, as well as in most punctuation characters.While some characters like Fullstop, Comma, and Ellipsis are more prevalent in the PolEval 2022 dataset, Hyphen is less frequent, and the Exclamation mark remains relatively unchanged.Below are samples of golden truths from each dataset, with the last two examples shortened.

TABLE IV DATATASETS
STATISTICS.THE NUMBER OF PUNCTUATION SYMBOLS IS NORMALIZED PER 1000 WORDS.