PolEval 2022/23 Challenge Tasks and Results

This paper summarizes the 2022/2023 edition of PolEval — an evaluation campaign for natural language processing tools for Polish. We describe the tasks organized in this edition, which are: Punctuation prediction from conversational language, Abbreviation disambiguation and Passage Retrieval. We also discuss the datasets prepared for each of the tasks, evaluation metrics chosen to rank the submissions and also sum up the approaches chosen by the participants to tackle the tasks.


I. INTRODUCTION
P OLEVAL1  [14] is a SemEval-inspired evaluation cam- paign for natural language processing tools for Polish.Submitted tools compete against one another within certain tasks selected by organizers, using available data and are evaluated according to pre-established procedures.
The 2022/2023 edition of Poleval was the sixth event in a series of challenges organized since 2017.During this edition three tasks have been proposed: 1) Punctuation prediction from conversational language, 2) Abbreviation disambiguation, 3) Passage Retrieval.The participants of this edition have been very active, as we have received more than 400 submissions from 23 teams.The submissions were made through our evaluation platform 2 , which has been introduced last year.
In the following part of the paper we describe each of the tasks in detail, present the datasets created for the particular challenges, discuss the evaluation metrics and we give the overview of submissions made by the participants.

A. Problem statement
Speech transcripts generated by Automatic Speech Recognition (ASR) systems typically do not contain any punctuation or capitalization.In longer stretches of automatically recognized speech, lack of punctuation affects the general clarity of the output text [24].The primary purpose of punctuation restoration (PR), punctuation prediction (PP), and capitalization restoration (CR) as a distinct natural language processing (NLP) task is to improve the legibility of ASR-generated text and possibly other types of texts without punctuation.For the purposes of this task, we define PR as restoration of originally available punctuation from read speech transcripts (which was the goal of a separate task in the PolEval 2021 competition) [10] and PP as prediction of possible punctuation in transcripts of spoken/ conversational language.Aside from their intrinsic value, PR, PP, and CR may improve the performance of other NLP aspects such as Named Entity Recognition (NER), part-of-speech (POS), and semantic parsing or spoken dialog segmentation [5], [12].
One of the challenges of developing PP models for conversational language is the availability of consistently annotated datasets.The very nature of naturally-occurring spoken language makes it difficult to identify exact phrase and sentence boundaries [21], [23], which means that dedicated guidelines are required to train and evaluate punctuation models.
The goal of the present task is to provide a solution for predicting punctuation in the test set collated for this task.

B. Task description
The workflow of this task is illustrated in Figure 1 below.Given raw ASR output, the task is to predict punctuation in annotated ASR transcripts of conversational speech.

C. Dataset
The test set consisted of time-aligned ASR dialogue transcriptions from three sources: 1) CBIZ, a subset of DiaBiz [17], a corpus of phone-based customer support line dialogs3 2) VC, a subset of transcribed video-communicator recordings, which are included in the SpokesBiz Corpus4 3) SPOKES, a subset of the SpokesMix corpus [16].Table I below summarizes the size of the three subsets in terms of dialogs, words and duration of recordings.The full dataset has been split into three subsets as summarized in Table II   The punctuation annotation guidelines were developed in the CLARIN-BIZ project by Karasińska et al. [20].
Participants are encouraged to use both text-based and speech-derived features to identify punctuation symbols (e.g.multimodal framework [22] or to predict casing along with punctuation [15].We allow using the punctuation dataset available at http://2021.poleval.pl/tasks/task1[10]. The punctuation marks evaluated as part of the task are listed in Table III   2) Transcriptions and metadata: The datasets are encoded in the TSV format.
Field descriptions: • column 1: name of the audio file • column 2: unique segment id • column 3: segment text, where each word is separated by a single space The segment text (column 3) format is: • single word text:word start timestamp in ms-word end timestamp in ms 1244 PROCEEDINGS OF THE FEDCSIS.WARSAW, POLAND, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply. 1) Metrics: The final results were evaluated in terms of precision, recall, and F1 scores for predicting each punctuation mark separately.Submissions were compared with respect to the weighted average of F1 scores for each punctuation sign.The method of evaluation was similar to the one used in a PolEval 2021 task named "Punctuation restoration from read text" 5 [10].
2) Per-document score:: 5 http://2021.poleval.pl/tasks/task1 3) Global score per punctuation sign p:: d∈Documents T P d∈Documents T P + F N The final scoring metric was calculated as the weighted average of global scores: Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

E. Results
The winning solution submitted for Task 1 by Oskar Bujacz achieved a weighted F-measure of 83.3 (see Table VII).The author used a token classifier based on the largest variant of the HerBERT model [11] with customized output postprocessing rules.

A. Problem statement
Abbreviations are often overlooked in many NLP pipelines.However, they are still an important point to tackle, especially in such applications as machine translation, named entity recognition, or text-to-speech systems.
There are at least two practical challenges in processing abbreviations.The first is the ability to find the full, expanded dictionary form of an abbreviation.In many cases, this may be done by a simple dictionary lookup, but: -the use of abbreviations is often unconventional and there is no complete list of all possible abbreviation uses, -many of the abbreviations are ambiguous.That is, the same abbreviation may have more than one meaning, translating to possibly different expanded forms.
As in many other NLP tasks, the disambiguation of abbreviations needs to include context and additional language knowledge to be feasible.
The second challenge, which is specific to languages with rich morphology, such as Polish, is the necessity to produce the expanded form of an abbreviation in correct grammatical form, in concordance with the rest of the sentence.

B. Task description
The task aimed to propose a method of disambiguating Polish abbreviations.The method should recognize if a given phrase is an abbreviation and, if so, produce its expanded form, both base, and inflected ones.

C. Dataset 1) Training data:
In this task a (relatively small) training dataset was provided (see example in Figure 2), which included: • the abbreviation • an expanded form of the abbreviation • a base form of the abbreviation • context of the abbreviation, with the '' placeholder marking the place where the abbreviation appeared.
The participants were encouraged to collect and use additional training and dictionary data and to publish it after the competition.
2) Test data: The test data consists of only the abbreviation and the context.The system aims to provide the expanded and base forms of the abbreviation.

D. Evaluation
We will calculate two measures of accuracy for each provided submission: • Af -the accuracy of provided expanded forms of abbreviations (case insensitive string match) • Ab -the accuracy of provided base forms of abbreviations (case insensitive string match).Based on these measures, the final score will be calculated using a weighted average:

E. Results
We received five submissions (see  Krzysztof Wróbel utilized an ensemble of three models, each based on the byt5-base model 6 , trained on different seeds, and employing a majority voting.The training of these models incorporated both the train and dev datasets, as well as a small dataset automatically generated from abbreviations sourced from various dictionaries such as Morfeusz [9], sjp.pl, and Wiktionary 7 . Jakub Karbowski (2nd place submission) trained a sequence-to-sequence model based on the plt5-base model 8 .The input to the model consisted of a context with a masked abbreviation, a target base form, and inflected forms of the expanded abbreviation.The initial training was performed on a synthetic dataset generated from the Polish Wikipedia.The dataset was created by randomly selecting contexts of varying lengths and shortening consecutive words using one of several strategies, such as using the first few letters, the first and last letters, or the first, middle, and last letters.The base form was generated using Spacy 9 .Then, the model was fine-tuned on the PolEval dataset.retrieval components to find passages containing correct answers.Traditionally, lexical methods, such as TF-IDF or BM25 [18], have been used to power retrieval systems.They are fast, interpretable, and don't require training (and therefore a training set).However, they can only return a document if it contains a keyword present in a query.In addition, their understanding of text is limited because they ignore word order.

A. Problem statement
Recently, neural retrieval systems (e.g.Dense Passage Retrieval [8]) have surpassed these traditional methods by fine-tuning pre-trained language models on a large number of (query, document) pairs.They solve the aforementioned problems of lexical methods but at the cost of the need to label training sets and poor generalisation to other domains.As a result, in a zero-shot setup (i.e.no training set), lexical methods are still competitive or even better than neural models.

B. Task description
The aim of the passage retrieval task was to develop a system for cross-domain question-answering retrieval.For each test question, the system should retrieve an ordered list of the ten most relevant passages (i.e. containing the answer) from the given corpus.The system is evaluated on the basis of its performance on test examples from three different domains, namely trivia, law, and customer support.

C. Dataset 1) Training set:
The training set consisted of 5,000 trivia questions from the PolQA dataset [19].Each question was accompanied by up to five passages from Polish Wikipedia containing the answer to the question.In total, the training set consisted of 16,389 question-passage pairs.In addition, we provided a Wikipedia corpus of 7,097,322 passages.The raw Wikipedia dump was parsed with WIKIEXTRACTOR 10 and split into passages at the end of paragraphs or if the passage was longer than 500 characters.
2) Test sets: The systems were evaluated on three test sets with questions from different domains.The first dataset consisted of 1,291 trivia questions similar to those in the training set.
The second dataset consisted of 900 questions and 921 passages related to the large Polish e-commerce platform -Allegro 11 .The dataset was created based on help articles and lists of frequently asked questions available on the Allegro website.Each question-passage pair was manually checked and edited where necessary.
The third dataset contained over 700 legal questions.It was created by randomly selecting the passage and manually writing a question.We also provided a corpus of approximately 26,000 passages extracted from over a thousand acts of laws published between 1993 and 2004.

D. Evaluation
The submitted systems were evaluated using Normalised Discounted Cumulative Gain for the top 10 most relevant passages [7,NDCG@10], where the score of each relevant passage depends on its position in descending order: where rel i is the relevance of the i-th passage and REL p is the list of relevant passages ordered by their relevance.

E. Results
Seven teams submitted a final solution to the task (see Table IX).All systems followed a similar architecture.First, the retriever was used to find the top N most relevant passages, and then the ranker scored these passages in order of importance to select the final 10 most relevant passages.Below are brief descriptions of the submitted systems, starting with the highest scoring ones.
Jakub Pokrywka implemented a retriever using the BM25 algorithm with text stemming using Polimorf. 12  the ranking of answers, separate rankers were used for each domain.For the Allegro and legal domains, an ensemble of mt5-3B 13 and mt5-13B14 models was used, considering a pool of 1,500 candidates.Conversely, for trivia domain, Jakub Pokrywka also used the mt5-3B model but it was supplemented by a custom-trained cross-encoder models, mDeBERTa 15 , and mmarco-mMiniLMv2-L12-H384-v1 16 .For trivia domain, the system included 3,000 candidate passages for effective ranking.
Marek Kozlowski used a system consisting of three retrievers: a lexical retriever (BM25) and two neural retrievers based on roberta-base 17 and roberta-large 18 [3].The BM25 retriever used the ElasticSearch engine with the Morfologik analyser 19 for lemmatisation.For the neural encoders, finetuning the Roberta models involved using the MultipleNega-tiveRankingLoss loss function, large batch sizes and training data consisting of a mixture of Poleval training and translated MSMARCO [13] data sets.After retrieval, a re-ranking step was performed, with the mt5-13B model yielding the best results.
Konrad Wojtasik used an ensemble of several retrieval algorithms, starting with the BM25 algorithm, followed by various multilingual retrievers such as mContriever [6], mDPR [1] and LaBSE [4].To further reduce the number of passages for reranking, he trained the plT5-large model [2] on the translated MSMARCO dataset.The final ranking was performed with mT5-13B on about 350 candidate passages from different sources.
Norbert Ropiak used both lexical (BM25) and neural retrievers (mContriever) and combined the results of both for further processing.He used ms-marco-MiniLM-L-12-v2 20 and mDeBERTa cross-encoders for ranking.Anna Pacanowska's solution was a combination of several models.First, BM25 was used on lemmatised text to retrieve 1,000 candidate passages.Various statistics were calculated on these candidates, such as BM25 on unlemmatised data or on bigrams.The retrieved passages were then translated into English using OPUS-MT 21 , which allowed the English MiniLM-L6 cross-encoder 22 to be used to calculate various scores, including those on raw question/passage pairs and on pairs with answers generated using GPT-3.Finally, logistic regression was used to combine all the results into a final score.
Maciej Kazuła used the BM25 passage retrieval algorithm together with the word inflection dictionary 23 to normalise the text.He fine-tuned the MiniLM-L6 cross-encoder for the ranking process.The cross-encoder was trained on the translated MSMARCO Polish dataset.A new tokeniser was created on the Poleval dataset, as well as on the translated MSMARCO data, in order to better represent Polish words in terms of word forms.
Daniel Karaś used two retrievers, a lexical search using BM25 and neural search using a slightly fine-tuned MiniLM-v6 24 model.Both retrievers were used to find approximately 1,000 candidates per question, except for Allegro where all passages were selected.In a second step, all candidate passages were fed into the mBERT 25 , which was used without any additional training.

F. Summary
All submitted systems used the BM25 algorithm as a retriever, but differed in the way they normalised the text.Many lemmatised the passages, while others favoured stemming or using a dictionary of different word forms.In addition, some teams also used the neural retrievers and combined the candidates from these two approaches.
Given a pool of retrieved candidate passages, the systems used different methods to sort them and select the most relevant ones.The most popular were the cross-encoders, either trained on the multilingual data or fine-tuned by the contestants on the Polish examples.Most teams ensembled several models to achieve better performance.
Three teams used external datasets to train their models.In all cases, they automatically translated the MSMARCO dataset into Polish.
Although the goal of the task was to create a system for cross-domain passage retrieval, it was allowed to submit different systems for different domains.Three participants chose this approach, including the winning system.
Regarding the results, it is observed that the performance of the systems was very much dependent on the ranker.The first three systems that achieved the results in the range of 67-69 NDCG points used a very large mt5-13B model as the reranker.The fourth model which achieved 63 points, used MiniLM-L12 and mDeBERTa.The last three models scoring 51-54 points used only MiniLM-L6 or multilingual BERT (with the exception of Anna Pacanowska's system, which also utilized a custom model).It seems that the retriever did not play an important role in the task, since the best system used only BM25 model.It is also interesting to observe that none of the systems used a learning-to-rank approach.One of the deficiencies of the evaluation is the lack of consideration for the computational heaviness of the approaches, which might be considered in the future incarnations of this task.

V. CONCLUSIONS AND FUTURE PLANS
As each year we observe a growing interest in the PolEval challenge (the number of submissions and participating teams is growing), we plan to continue our efforts to identify new tasks, which are current and interesting in the research area of NLP and Polish language.The next editions will be specifically interesting, considering the current developments in the area of generative AI and language models.
We also plan to organize the datasets created for all the editions of the challenge in a repository to facilitate their distribution and encourage other researchers to use them for their work.

Fig. 2 .
Fig. 2. Examples from the training dataset Table VIII).The final ranking was calculated based on the weighted accuracy of the Test-B dataset.The scores ranged from 92.01 to 19.09.Krzysztof Wróbel obtained the highest score of 92.01.
To improve ŁUKASZ KOBYLI ŃSK ET AL.: POLEVAL 2022/23 CHALLENGE TASKS AND RESULTS 1247 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.