Abbreviation Disambiguation in Polish Press News Using Encoder-Decoder Models

The disambiguation of abbreviations and acronyms is a longstanding problem in Natural Language Processing (NLP) that has garnered significant attention from researchers. Previous approaches have employed statistical methods, semantic similarity metrics, and machine learning algorithms. Various languages and document types have been explored, with English being the most commonly studied language. Recent advances have been driven by the application of pre-trained transformer models. Standardization and addressing the challenges of multilingual and multi-document type disambiguation remain ongoing goals in the field of NLP. This paper presents an in-depth exploration of abbreviation disambiguation using state-of-the-art neural Encoder-Decoder models, specifically the ByT5 and plT5 architectures. Advanced synthetic data generation techniques are introduced and their effect on model performance is analysed. The methods are evaluated in the context of the PolEval abbreviation disambiguation competition, where the authors achieve top ranking.


I. INTRODUCTION
T HE problem of disambiguation of both acronyms and abbreviations has been the subject of interest for many researchers in the field of Natural Language Processing (NLP) for many years.Even before the era of widely used machine learning algorithms and text recognition using Deep Neural Networks (DNN), methods based solely on statistics were used.An example of such work is the paper [1] from 2004, which used a semantic similarity metric.The author determines the adequacy of abbreviation expansion candidates based on the similarity between the context of the target abbreviation and that of its expansion candidate.
The motivation for recognizing abbreviations often stemmed from the need to understand passages in documents such as provisions in law or medical notes.However, due to the richness, diversity, and uniqueness of languages, it is difficult to generalize the solution for expanding abbreviations or acronyms.Articles [2], [3] examine Jewish Law documents written in Hebrew, while [4], [5] present research on clinical papers using methods such as Support Vector Machines (SVM).Scientific papers are usually dedicated to only one language for which the datasets were prepared and the most widely used language in datasets is English.But another example of a different language is the research analysis [6] in Chinese, where the authors present their unconventional method based on Integer Linear Programming (ILP) and decode abbreviations from the generated candidates.Meanwhile, [7] analyze the Russian language by comparing methods such as SVM, Random Forest (RF), and Gradient Boosting (GB).Research has also been conducted in the Polish language, for example, in the paper [8], which utilized the bidirectional long short-term memory (LSTM) neural network architecture and compared two methods: automatically selecting all words in a text and using clustering of abbreviation occurrences.
Another aspect worth noting is the diversity of approaches to solving the disambiguation problem.As it turns out, supervised methods such as Convolutional Neural Networks (CNN), which are primarily used for image analysis, can also be used for this purpose in NLP, as evidenced by the article [9].A diffrent example is the utilization and combination of pretrained models such as RoBERTa and SciBERT, based on the transformer architecture, to create their own model named hdBERT, as presented in the research [10].In this study, the authors also compared many state-of-the-art non-deep and deep learning methods up to 2017.
In 2020, the article [11] presented Google's T5 model as the Unified Text-to-Text Transformer.Since then, models with this architecture have found applications not only in text translation but also in expanding abbreviations, as shown in the publication [12].In another article [13], expansions of acronyms were presented using pre-trained language models such as BERT and T5 for datasets consisting of four categories: Legal English, Scientific English, French, and Spanish.
The authors of the article [14] built their end-to-end acronym expander system named AcX and compared various existing methods such as Cosine Similarity (Cossim), RF, Logistic Regression (LR), and SVM.Based on this, they also prepared a benchmark on various types of datasets, including those from biomedical document, scientific papers and Wikipedia.
Based on the above considerations, research on abbreviation disambiguation over the years can be divided into three main categories: • the method used, which is almost always related to various machine learning algorithms Former articles mainly focused on one type of document, one language, and one or two methods.However, today we increasingly encounter works related to multilingual models such as mT5 (multilingual T5), trained over 101 languages.However, these models undoubtedly require more memory (T5-base model -220M parameters, mT5-base model -580M parameters).The area of recognizing abbreviations is moving towards standardization and dealing with multiple languages and document types at once, but achieving satisfactory results in this area still poses a challenge for the field of NLP.Currently, promising models for this task appear to be encoderdecoder models like T5, pre-trained on a multi-task mixture of unsupervised and supervised tasks.
This article presents an attempt to standardize and formalize different aspects of abbreviation disambiguation, methods, challenges and limitations.State-of-the-art methods are evaluated on the PolEval 2022/23 competition, specifically in Task 2: Abbreviation disambiguation1 .Additional dataset augmentation techniques are described, such as dictionarylookup and algorithmic generation of arbitrary abbreviations.A unified training framework for abbreviation disambiguation is provided in a public online code repository.The additional created datasets are also provided.By combining the above methods, the authors achieve number one ranking on the PolEval contest.

A. Training, validation and test datasets
The training, validation, and test datasets have been provided by the organizers of PolEval.As part of the PolEval competition, a training set called train and a validation set called dev-0 with expected output were created, along with two test sets: test-A and test-B with implicit output.The collection and preparation of the datasets were carried out by: • Michał Marcińczuk (Wrocław University of Science and Technology) • Łukasz Kobyliński (Institute of Computer Science, Polish Academy of Sciences / Sages) 1) Assumptions: During the preparation, the authors of the reference corpus based their work on three assumptions: • focus on abbreviations of common words or phrases ending with a dot (excluding initials, acronyms, and proper names) • the context and common knowledge should be sufficient to expand the abbreviation (excluding incomplete or confusing examples) • the base forms should follow the guidelines of phrase lemmatization from PolEval 2019 Task 22 with some exceptions, such as abbreviations joined with other abbreviations or phrases Table I presents several examples of abbreviations disambiguation found in these datasets.
2) Input and expected output: In the system, the input data in the in.tsvfile consists of two columns as we can see in Table II: the first column contains a phrase to be analyzed, and the second contains the context in which the phrase appears.If a masked token in an input is not an abbreviation then both columns in expected file are the same as masked token.The occurrence of the phrase is marked by the keyword <mask>.The output file expected.tsvis also composed of two columns: the first column contains the inflected form, and the second contains the base form.
3) Data processing: The corpus was created based on the following four steps: 1.The datasets were built based on a collection of press news.2. Regular expressions were used to collect potential abbreviations.3.Each candidate was represented as a matched phrase and the context with several words before and after the match.4. Candidates were selected and manually annotated.5) Challenges: The task poses several challenges that need to be addressed in order to solve it.a) Challenge -ambiguous forms: When looking at oneword abbreviations, there is a large pool of words to which these abbreviations can be expanded.It can be observed in Table III that practically every word can be shortened to a single letter, and based on the context, one can infer the intended word.However, the challenge lies in correctly identifying this word."#%#()*+,(-" "#),%.""#%+/%*" "#0" "#$%&$%1" "#$%&" "#23$" 4*5%+/ 1%*26752+*0741 The Figure 1 shows that there are multiple potential forms for each abbreviation, and the distribution is not such that one form constitutes 90% of expansions.What makes this task interesting is that there are many expansions for each abbreviation.In each case, there is a dominant word, for example, for the abbreviation w., the most frequent expansion is wieku which accounts for 44.0% of expansions, while the remaining cases below 1.6% each sum up to 21.0%.
b) Challenge -unbounded space of abbreviations: The second element that affects the complexity of this task is the practically unbounded set of phrases that are subject to analysis.This task needs to be approached creatively because many cases not present in the training set may appear in the test set and should be handled correctly.
Based on Table IV, it can be assumed that about half of the phrases are non-abbreviated elements, while the other half are actual abbreviations that require expansion.c) Challenge -mixed abbreviations and nonabbreviations: Another challenge is distinguishing whether a given phrase is an abbreviation or not.There are also cases where a phrase can be both an abbreviation and a regular word that does not require expansion, which introduces an additional level of complexity to the task.Furthermore, certain phrases are a combination of abbreviations and non-abbreviations, marked as others in Table V, for example, replacing mln.ton.with miliona ton.
6) Special cases: There are several specific cases that have appeared to some extent in the corpus and go beyond the previous assumptions.
a) Special case -ambiguous forms: There are instances where certain abbreviations can be expanded into multiple words, as in the example presented in Table VI, where m. can be expanded into both miejscowości and miasta.Wherever this ambiguity can be resolved through context, such as certain signals indicating that it refers to one specific form, we expect only one form to appear in the output.However, in cases where there is no ambiguity, we assume that both forms should appear in the output, and they will be compared in this way.
b) Special case -non-abbreviation: In situations exposed in Table VII where a given phrase is not an abbreviation, we expect the output to repeat the phrase, which will be recognized as a non-abbreviation.Therefore, in this case the second part of answer is not a base form.
Both the base form and any inflected forms should be inflected in the same way as in the input, including the dot.

KRZYSZTOF WRÓBEL ET AL: ABBREVIATION DISAMBIGUATION IN POLISH PRESS NEWS USING ENCODER-DECODER MODELS 1257
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

TABLE VII AN EXAMPLE OF A CASE WITH A NON-ABBREVIATION
in.tsv expected.tsvGoi.
c) Special case -abbr and non-abbr: When there is a combination of an abbreviation and a non-abbreviated element, the abbreviated fragment should be expanded, while the nonabbreviated one should be preserved in the same form as in the input phrase.w sprawie T.
In the example shown in Table VIII, these are initials, which we also do not want to expand.

d) Special case -lemmatization of joined abbreviations:
In yet another case, there may be a combination of abbreviations that affects lemmatization.In the example below, two abbreviations appear consecutively, and each of these abbreviations should be expanded separately.IX, the one-element abbreviation mm. is expanded to millimeters (it is an annotation error, it should be in singular: millimeter), and the two-element abbreviation Temp.maks is expanded to Maximum temperature which are then joined together.

B. Dictionary-based additional data
From dictionaries we extracted abbreviations with expanded forms, e.g.bdb.-> bardzo dobry.Then from Polish corpus CC100 [15] we extracted text fragments with inflected expanded forms of abbreviations and replaced them with abbreviations.An example is in Table X.
Only samples with unique inflected forms were taken into consideration giving 1982 new data points.The process creates also incorrect samples, e.g.najlepszym is abbreviated to db..This dataset lacks non-abbreviation examples.1) Morfeusz: Morfeusz [16] is a morphological analysis tool for the Polish language.With its help, all the abbreviations with their expanded form were filtered out from the dictionary.The pairs were selected on the basis of the morphosyntactic tags brev:pun or brev:npun.The brev feature indicates the base form of an abbreviation expansion, while pun and npun denote the presence or absence of a dot after the abbreviation.Table XI shows that there are more abbreviations without a dot at the end.Due to the problem posed in the task, pun abbreviations may be more useful since they can occur anywhere in a sentence, whereas npun abbreviations only appear at the end of a sentence, to be followed by a dot.
2) Wiktionary: Based on the free, multilingual dictionary Wiktionary3 , 554 different abbreviations were extracted with their meaning or meanings serving as an extension of the abbreviation.An example of multiple meanings in this dataset is, for example, wyd., which can mean: wydanie, wydawca, wydawnictwo, wydawniczy.As we can see in Table XII, most of the found abbreviations have only one meaning, while in some cases, there are as many as 14 meanings for n. or 19 for a..
In addition to the meanings, examples were also extracted from the Wiktionary, serving as context; however, this information was not utilized in solving the task.
3) SJP: More abbreviations than in previous dictionaries, a total of 1199, were found in the Polish language dictionary SJP.Just like in Wiktionary, some abbreviations had multiple meanings, for example, woj.could stand for województwo, wojewoda, wojewódzki, wojenny, wojskowy.The abbreviations were selected by reviewing all the words or phrases in the dictionary and then filtering out those that had periods at the end.Table XIII shows that there are no abbreviations with as many meanings as sometimes found in Wiktionary, but the number 5 can be considered the maximum number of meanings that almost always appeared in SJP and Wiktionary.

C. Synthetic additional data
Collecting a sufficiently large and diverse dataset with accurately annotated abbreviations can be a challenging and timeconsuming task.In this section, we describe the methodology used to generate synthetic data.
1) Data collection and preprocessing: The source corpus was the Polish Wikipedia.A sliding window is used to randomly select a context of 140 to 200 characters.This context length is representative of the PolEval abbreviation disambiguation dataset.Within each such context, a continuous span of words is randomly selected.These words are then processed with algorithmic abbreviation.
Choose 1 to 4 of the first characters.
Choose the first and one middle character.

abbr_first_mid_last: profesor → pfr
Choose the first, one middle and last characters.The algorithm a random strategy to each word in the span.It must be noted that the generated abbreviations are not guaranteed to be grammatically correct.
3) Base form prediction: The base forms of all words from the span before abbreviation are generated with the spaCy pl_core_news_lg model [17].Each such context containing an abbreviated span is used as a dataset sample.Table XIV shows one generated example.The process repeats until reaching the end of the Wikipedia corpus.
4) Considerations: This synthetic dataset applies to a broader task of corrupted text restoration, where abbreviation disambiguation can be seen as a sub-task.In the context of abbreviation disambiguation, it is a low-quality dataset.This disadvantage is countered by its vast size of 14 million examples, which is 3375 times larger than the PolEval training dataset.
The exact version of the created dataset cannot be deterministically reproduced because of multi-process random number generation used during processing.A snapshot of the dataset is provided online 4 .

III. EVALUATION
The process of abbreviation disambiguation i.e. replacing abbreviations with their appropriate expansions in text can be divided into two stages.
The first stage is to find the abbreviation in a dictionary and replace it with an appropriate word or phrase that is its base forms.
The second stage is to transform the result from the first stage into its correctly inflected grammatical form based on the context in which the abbreviation appears in the text.
It is worth noting that sometimes there is more than one meaning for a given abbreviation, so in the first stage it is also important to take the context into account in order to better predict which meaning is intended in a specified example.
Two metrics were used to objectively evaluate the replacement of abbreviation with their appropriate expansion in text: • Af -the accuracy of provided expanded forms of abbreviations • Ab -the accuracy of provided base forms of abbreviations The matching check for both metrics was case-insensitive.
Based on the above metrics, the ultimate formula was defined to determine the final score: Therefore, the task of finding the appropriate base form is three times more important than the task of finding its appropriate expansion.

IV. METHODS
The solutions are based on a sequence to sequence model using the T5 [18] architecture.Krzysztof Wróbel's submission used the ByT5 [19] model, while Jakub Karbowski used the plT5 [20] model.
Both submissions used a similar workflow.input to the transformer encoder is the context with the abbreviation.The transformer decoder generates both the base and inflected forms.Multiple methods of encoding the input and output of the model were used.They are described in detail in their corresponding sections.
In order to improve the results, majority voting with multiple models has been applied.The final decision is determined by the majority vote, where each model's prediction contributes one vote, and the outcome with the most votes is selected as the final prediction.This approach leverages the collective knowledge and expertise of multiple models to improve the accuracy and robustness of predictions in scientific studies.For this task, majority voting has been applied separately for inflected and base form.

A. PolEval submissions
Initial experiments were carried out with limited time because of the competition deadlines.They were the base for further post-competition research.First, the exact methods used to produce the competition submissions are described.
1) Krzysztof Wróbel submissions: Proper validation is very important in every competition.The original validation (dev-0) dataset has only 300 samples which is insufficient for tracking scores with a precision of 0.1 percentage points.Therefore, 1000 samples from the training data were moved to the validation set.
Initial experiments using Adafactor as an optimizer showed that the plT5 models performed slightly worse than the ByT5 models.
The training dataset was augmented by extracting abbreviations from dictionaries and applying them into sentences sourced from a corpus.
The final submission was created using majority voting on 3 models: • trained on the training data and dictionary-based additional data using the development data for selecting the best model • trained on the training data, development data, and dictionary-based additional data with two different seeds The training parameters were as follows: • model: byt5-base  The second model, named as 5 was trained on the training data along with additional dictionary-based data.This model performed better by 0.5 percentage points than model trained only on the training data.
The next two models, named as 8 and 9 were trained on the training data, development data, and dictionary-based additional data, using different random seeds for each.Model 8 achieved a score of 92.18  The final model, named as 11 is the result of majority voting on three models: 5, 8, and 9.This model achieved the highest scores among all the submissions, with a score of 92.76 on the test-A dataset and 92.01 on the test-B dataset.
2) Jakub Karbowski submissions: Input data was encoded in a similar way to Krzysztof Wróbel's submission.The only difference is that <abbrev> and </abbrev> are not added as special tokens and are tokenized as raw text by the model's tokenizer.Instead of <abbrev> they are called <mask>.
The output format was the same as in Krzysztof Wróbel's submission, except the output of the model does not differ between abbreviations and non-abbreviations.The output format is: inflected form; base form, e.g.były; być.
Although ByT5 was considered because of its high performance on noisy data, plT5 was chosen because of limited training hardware available to the author of the submission.Training was performed on single GTX 1080 GPU with 8 GB of VRAM within a single day.Training ByT5 on this hardware would not be feasible.
First, pre-training was carried out on the Wikipedia dataset with synthetic abbreviations.
The pre-trained model was then fine-tuned on the PolEval train dataset.
Training parameters: • model: plt5-base (wiki pre-trained) • batch size: 8 • gradient accumulation: 32 • epochs: 223 • learning rate: 0.000015 • scheduler: linear with warmup 10% • optimizer: AdamW • weight decay: 0.0001 The per-device batch size could be increased because of a decrease in sequence length compared to the pre-training dataset.The score of the final submission with pre-training was 91.75% on test-A and 91.27% on test-B.

B. Post-competition experiments
After the announcement of the competition results, the top two contestants combined their work to evaluate the performance of their methods, with respect to: • model architectures • used datasets and their combinations • a broad range of hyperparameters • original optimizations and solutions 1) Setup: A unified codebase for training was created 5 .It combines all of the methods and datasets used: • ByT5 and plT5 models • PolEval, dictionary-based and synthetic Wikipedia datasets • majority voting It also contains the hyperparameters and sweep configurations used during experimentation.
2) Configurations: Eight different configurations were chosen for final assessment.All combinations of the following options were used: • Base model: -ByT5 -plT5 • Pre-training dataset: -None -Wikipedia • Fine-tuning dataset: -PolEval train -PolEval train with additional dictionary-based data 3) Pre-training: As pre-training on the large synthetic Wikipedia dataset was computationally expensive, sweeps on this dataset were not conducted.Instead, results from finetuning runs and manual experimentation provided the hyperparameters for pre-training.
4) Fine-tuning: For each configuration, a hyperparameter sweep was conducted.The sweeps considered: learning rate, weight decay, epochs, optimizer (AdamW or Adafactor).To provide a fair comparison between the two model architectures, each sweep was given 24h of computational time on an A100 GPU.The four sweeps with ByT5 managed to perform 16 training runs each, while plT5 sweeps performed 80 runs each.
5) Voting: Experiments involving majority voting were conducted to evaluate the performance of the best plT5 and ByT5 models.For each model, a set of 10 models was trained using identical parameters but different random seeds.

VI. RESULTS
Table XVI shows scores for experiments using plT5 and ByT5 models trained on different datasets.The highest scores are obtained using pre-training on synthetic data and then fine-tuned on train data with dictionary-based data.The models are shared at Hugging Face6 .ByT5 consistently achieves higher scores than plT5.The results on the dev dataset do not correlate with the test data due to the small size of the dev dataset.
Using the synthetic Wikipedia dataset for pre-training improves the performance of both models.For plT5, the test-B score improves by around 1% when considering both the pure PolEval train dataset and the additional dictionary data.For ByT5, the improvement is under 1%.Using additional dictionary data improves the scores by around 0.5%.Table XVII provides the test-A and test-B scores for different combinations of plT5 and ByT5 models.The row labeled 1 represents the score obtained when only the first model is used.Subsequent rows, labeled 1-2, 1-3, and so on, indicate the scores obtained when additional models are included in the majority voting process.The maximum improvement observed through this process is approximately 0.5 percentage points.Krzysztof Wróbel emerged as the highest scorer, surpassing Jakub Karbowski by 0.74 percentage points in the test-B metric.Krzysztof Wróbel's success can be attributed to the implementation of the ByT5 model, majority voting, and the utilization of a larger validation dataset.Incorporating the pretraining step and utilizing the AdamW optimizer, as introduced in Jakub Karbowski's solution, has the potential to yield scores higher by more than 1 percentage point.

A. Error analysis
The error analysis of 50 randomly selected errors made by the ByT5 model in the wiki-train-dict variant revealed that the model correctly predicted the answers for half of them.The dataset annotation needs to be improved.
More technical issues apply to about 1.5% of the examples.Approximately 0.76% of the examples in the dataset are annotated with multiple possible answers separated by a semicolon, such as przeciw; przeciwko.These cases were not properly taken into account during evaluation, About 0.72% of the examples in the dataset consist of multiword abbreviations with tokens separated by more than one space.

VII. CONCLUSIONS
In this paper, we addressed the problem of abbreviation disambiguation in Polish press news using encoder-decoder models.The task involved replacing abbreviations with their appropriate expansions in text, taking into account the context.
Our experiments included submissions to the PolEval competition and post-competition research.In the PolEval competition, we achieved first and second place rankings.In the post-competition experiments, we conducted evaluations using different configurations, including pre-training on synthetic Wikipedia data and fine-tuning on additional data, which achieved a new state-of-the-art on the PolEval competition.
In conclusion, our study contributes valuable insights into the abbreviation disambiguation task in Polish press news.We emphasize the importance of proper validation, the tradeoff between optimizer choice and memory usage, importance of pre-training, and the effectiveness of majority voting as a simple technique for improving results.Further research can build upon these findings to explore more advanced architectures, optimizations, and techniques for even better performance in Polish abbreviation disambiguation tasks.
Our approach can be easily applied to other languages and various types of texts.

VIII. APPENDIX
Table XIX presents errors of the ByT5 model in the wiki-train-dict variant.

IX. ACKNOWLEDGMENT
The competition submissions made by the team of Krzysztof Wróbel and Paweł Lewkowicz were completely independent of Jakub Karbowski's submissions.However, after the competition, the authors decided to collaborate and write a joint article due to the similarities in the methods used.
We gratefully acknowledge Polish high-performance computing infrastructure PLGrid (HPC Centers: ACK Cyfronet AGH) for providing computer facilities and support within computational grant no.PLG/2023/016304 The research has been supported by a grant from the Faculty of Management and Social Communication under the Strategic Programme Excellence Initiative at Jagiellonian University.

4 )
Dataset cleanup: During the review, the authors removed any examples where the text was somehow corrupted, making it difficult to analyze.

TABLE IV THE
NUMBER OF DISTINCT PHRASES AND ABBREVIATIONS TO BE Table XV presents the results of Krzysztof Wróbel's submissions to the PolEval competition.The table includes different models and their corresponding scores on the test-A and test-B datasets.
on the test-A dataset and 91.69 on the test-B dataset, while model 9 achieved a score of 92.14 on the test-A dataset and 91.65 on the test-B dataset.

TABLE XVIII POLEVAL
BEST RESULTS AND SCORES BY DIFFERENT SUBMISSIONS.Table XVIII presents the best results and scores achieved by different submissions in the PolEval competition.The table includes two test metrics: test-A and test-B.