BIGOS - Benchmark Intended Grouping of Open Speech Corpora for Polish Automatic Speech Recognition

This paper presents a Benchmark Intended Grouping of Open Speech (BIGOS), a new corpus designed for Polish Automatic Speech Recognition (ASR) systems. This initial version of the benchmark leverages 1,900 audio recordings from 71 distinct speakers, sourced from 10 publicly available speech corpora. Three proprietary ASR systems and five open-source ASR systems were evaluated on a diverse set of recordings and the corresponding original transcriptions. Interestingly, it was found that the performance of the latest open-source models is on par with that of more established commercial services. Furthermore, a significant influence of the model size on system accuracy was observed, as well as a decrease in scenarios involving highly specialized or spontaneous speech. The challenges of using public datasets for ASR evaluation purposes and the limitations based on this inaugural benchmark are critically discussed, along with recommendations for future research. BIGOS corpus and associated tools that facilitate replication and customization of the benchmark are made publicly available.


I. INTRODUCTION
A UTOMATIC Speech Recognition (ASR) is used in var- ious applications and usage scenarios.Given that multiple aspects impact the difficulty of ASR tasks (vocabulary, acoustic conditions, speech type, etc.), the quality of target systems heavily depends on the effectiveness of the evaluation process.Benchmarking and evaluation ultimately aim to validate the system's ability to adapt to novel and unseen data.[1] To achieve this, multiple evaluation methods, datasets, and metrics are needed.The most commonly used metric for ASR evaluation is the Word Error Rate (WER), which quantifies word-level insertions, deletions, and substitutions between a system and reference transcriptions.WER has known limitations [2,3].When used on a narrow set of evaluation data, the assessment of the capabilities of the models, particularly in terms of generalization to unseen data, may be unclear.
Unlike English [4,5,6], German [7] and recently Hungarian [8], the Polish language lacks a common public-domain reference dataset for ASR benchmarking.Consequently, the results of Polish speech recognition studies are generally not directly comparable.Although transcribed recordings are available, it is often not practical to find or use all available public-domain datasets.
This study introduces BIGOS, a resource intended to enable systematic benchmarking and tracking of Polish ASR systems over time across a diverse range of publicly available corpora.The primary purpose of BIGOS is to alleviate the painstaking efforts required to discover and compile speech corpora from multiple sources.To ensure that original licenses are respected by BIGOS users, the corpus is distributed on the Hugging Face platform 1 , which allows gated access.Alternatively, scripts for self-curation and customization of the dataset are also provided. 2The first iteration of the benchmark presented in this work is performed using 1,900 utterances sourced from 10 corpora and 3 commercial ASR systems and 5 freely available.
The remainder of this paper is structured as follows: Section 2 reviews the relevant literature, and Section 3 outlines the construction of the BIGOS benchmark and dataset, detailing the source speech corpora, corpus statistics, and ASR systems evaluated.Section 4 presents an exemplary application of BIGOS for the evaluation of ASR systems, Section 5 describes the limitations, and Section 6 concludes the paper by outlining the directions for future research.

A. ASR evaluation datasets
Prominent English-only datasets for ASR research and evaluation include the Wall Street Journal, VoxForge, Fisher, CHiME, LibriSpeech, TED-Lium, Common Voice, and Earnings.Wall Street Journal corpus covers news broadcast recordings, while SwitchBoard and Fisher include spontaneous telephone conversations.LibriSpeech [9] and MLS [10] feature narrated audiobooks, while VoxForge includes narrated Wikipedia articles.The TED-LIUM corpus [11] contains oratory educational talks, while the CHiME [12] dataset represents recordings of noisy environments in the real world.Earnings-21 and Earnings-22 contain conversational speech from earnings call recordings [4,5].The most voluminous dataset in terms of both the duration of speech content and language coverage is the MLS (Multilingual Librispeech), which contains 41,000 hours of material [10] for 8 languages.The Mozilla Common Voice dataset covers speech for more than 55 languages and boasts the largest number of contributing speakers, with over 10,000 as of March 2023 [13].Both Common Voice and MLS include Polish language data.All of the aforementioned datasets offer a diverse range of speech sources, speaker demographics, and speech types, providing researchers with valuable resources to investigate various aspects of ASR and to train new systems.

B. ASR benchmarks
The idea of using available speech datasets to benchmark the quality of ASR systems was first implemented nearly a decade ago.Gaida et al. [14] were the first to conduct a comprehensive evaluation of several open-source speech recognition tools.Dernoncourt developed a framework to evaluate seven ASR systems in two different collections and provided scripts to format Common Voice and LibriSpeech. 3Moore et al. [15] introduced a meta-dataset containing reference text, hypotheses from two separate ASR systems, the Word Error Rate (WER), and annotations about speech intelligibility.Ulasik created a multilingual CEASR dataset for English and German [7], based on reference transcriptions from popular public-domain datasets and transcripts from four undisclosed ASR systems.Siegert et al. [16] performed a longitudinal study and found no significant changes in WER for 4 commercial systems over 8 months.Aksenova et al. [1] conducted a comprehensive survey on existing ASR benchmarking methodologies and proposed a systematic benchmarking framework for the most common use cases.Xu et al. [17] compared 4 commercial ASR services with respect to robustness to acoustic background noise.Varod et al. highlighted that ASR performance is language and system specific and that low-resource languages such as Hebrew can have a performance comparable to high-resource languages such as German.[18] The ASR4REAL benchmark [19] revealed significant accuracy variations depending on the accent of the speaker and socioeconomic status.Papadopoulou evaluated four commercial ASR systems in the context of translation post-editing effort [20].The challenges associated with the recognition of spontaneous and accented speech were further analyzed in the benchmarks organized by the Rev and Google companies.[4,5,21].Pasandi et al. highlighted that conversational speech is the most challenging and environmentally relevant type of data for speech recognition.Pires et al. constructed the Portuguese Evaluation Benchmark [22] using the Mozilla Common Voice and Voxforge datasets and five commercial ASR engines.Mihajlik et al. conducted an evaluation of open-source Hungarian ASR systems using a comprehensive linguistic dataset [8].Extending the studies by Ulasik et al. for English and German, Wirth et al. [3] questioned the prevailing statistical ASR evaluation paradigm by performing a manual recognition error assessment.Of paramount importance, the study identified that 18% of the ASR errors originated from flawed ground-truth transcriptions and another 18% from flawed or ambiguous audio within publicly accessible datasets.

C. Polish ASR benchmarks
The first evaluation of commercial ASR systems for the Polish language was carried out in 2018 [23].The first open benchmark for ASR systems was organized by Korzinek [24]. 3https://github.com/Franck-DernoncourtIn 2019, Unai et al. [25] evaluated a self-developed Polish ASR system using 223 hours of speech collected from six datasets, including the Clarin-PL Studio Corpus (EMU) [26], the PELCRA family of corpora [27,28], the Polish Senate recordings corpus [29], the Simple4All Tundra Corpus, and the test results for the PolEval 2019 competition [24].The most extensive benchmark to date is Diabiz Diabiz performed using a set of 400 dialogs in eight domains and three commercial ASR systems.[30,31].

III. BIGOS CORPUS DESIGN AND CURATION
As indicated by the Polish ASR Speech Data catalog 4 as of March 2023, approximately 5300 hours of speech in 51 datasets are available for Polish ASR development.Roughly 1000 hours of transcribed speech spread across 13 datasets is freely accessible under permissive licenses, facilitating the curation of a new evaluation dataset detailed in the following section.

A. BIGOS corpus overview
Table III-A summarizes the properties of the BIGOS dataset.

B. Sourcing and pre-analysis
Polish ASR Speech Data Catalog was used to identify suitable datasets to be included in the benchmark.The following mandatory criteria were considered: • Dataset must be downloadable.
• The license must allow for free, noncommercial use.
• Transcriptions must be available and align with the recordings.
• The sampling rate of audio recordings must be at least 8 kHz.
• Audio encoding using a minimum of 16 bits per sample.
The following is an overview of 10 datasets that meet the criteria and were chosen as sources for the BIGOS dataset.

• The Common Voice dataset (mozilla-common-voice-19),
developed by Mozilla, is an open source multilingual resource [13].This project aims to democratize voice technology by providing a wide-ranging, freely available dataset that covers many languages and accents.Contributors from around the globe donate their voices, reading out pre-defined sentences or validating the accuracy of other contributions.• The Jerzy Sas PWR datasets (Politechnika Wrocławska) (pwr-viu-unk, pwr-shortwords-unk, pwr-maleset-unk).
According to the documentation available online 5 speech was collected using a variety of microphones and in relatively noise-free acoustic conditions.Three datasets are available: short words, very important utterance (VIU), and male AM set.
• The M-AI Labs Speech corpus (mailabs-19), similar to the MLS corpus, was created from LibriVox audiobooks.This corpus covers nine languages and was created by the European company M AI Labs with the mission of "enabling (European) companies to take advantage of AI & ML without having to give up control or know-how." 6The M-AILABS Speech Dataset is provided free of charge and is intended to be used as training data for speech recognition and speech synthesis.The training data consists of nearly a thousand hours of audio for all languages, including 53.5 hours for Polish.
• The AZON Read and Spontaneous Speech Corpora7 (pwr-azon-spont-20, pwr-azon-read-20) is a collection of recordings of academic staff, mainly in the physical chemistry domain.The corpus is divided into two parts: supervised, where the speaker reads the provided text, and unsupervised spontaneous recordings, such as liverecorded interviews and conference presentations by scientific staff.The dataset contains recordings of 27 and 23 speakers, totaling 5 and 2 hours of transcribed speech, respectively.The AZON database is available under a CC-BY-SA license.Two additional corpora, the Spelling and Numbers Voice database (SNUV) from the University of Łódź's PELCRA group and the CLARIN Cyfry corpus, initially met the necessary requirements for this study.However, their unique transcription conventions led to high error rates during initial tests.For example, the word "pstr ąg" in SNUV corpus is transcribed as "py sy ty ry ą gy".The conventional normalization employed by most ASR systems is "p s t r ą g".In the case of Cyfry corpus, only numeric expressions are transcribed, hence high error rates are produced for correctly recognized nonnumeric expressions.As such, these corpora will be included in the next iteration of the benchmark, following a thorough manual retranscription process to mitigate these issues.

C. Curation and selection
Necessary preprocessing parameters were consolidated into specific configuration files for each dataset, including download links, metadata fields to be extracted, etc. Subsequently, the text data and audio were extracted and encoded in a unified format.Dataset-specific transcription norms are preserved, including punctuation and casing.To strike a balance in the evaluation dataset and to facilitate the comparison of Word Error Rate (WER) scores across multiple datasets, 200 samples are randomly selected from each corpus.The only exception is 'pwr-azon-spont-20', which contains significantly longer recordings and utterances, therefore only 100 samples are selected.Finally, the first version of the BIGOS corpus contains 1900 recordings of the 115,915 available in the 10 datasets (1.64% of the total available transcribed speech).The table II provides detailed information on the composition of the BIGOS 1.0 corpus.

D. Preprocessing and format standardization
The following curation methods were applied to the baseline version of the BIGOS dataset: • validation of audio file availability and validity, • unification of audio format to WAV 16

E. Validation and ASR transcripts generation
Upon completing the preprocessing of the entire dataset, the number of obtained recordings, transcriptions, and metadata records in the compiled dataset were checked for consistency.If the validation was successful, the ASR hypotheses for the locally hosted Whisper models were generated.ASR transcriptions for cloud services like Google, Azure, and Whisper were obtained via respective APIs.Table III presents the object of the resulting BIGOS utterance data.

A. Evaluated ASR systems
Below is an overview of the ASR systems evaluated in the first iteration of the BIGOS benchmark.
• Google Cloud Speech-to-Text8 supports more than 125 languages and variants.The "default" model from May 2023 was used for this benchmark.
• Microsoft's Azure Speech Service9 as of May 2023 supports more than 100 languages and variants.The "default" model from May 2023 was used for this benchmark.
• Whisper is an ASR system developed by the OpenAI company.It is trained on a large amount of weakly supervised multilingual and multitask data collected from the Internet [32].The web-hosted model available via API and the locally hosted models from May 2023 were used for this benchmark.10

B. Metrics
ASR systems predictions were evaluated against the target transcriptions using 3 industry-standard metrics: • Sentence Error Rate (SER) calculates the proportion of sentences that are not perfectly recognized, i.e., sentences that contain at least one error.
• Word Error Rate (WER) is defined as the minimum number of operations (substitutions, insertions, and deletions) required to transform the system output into the reference transcript, divided by the total number of words in the reference.
• Character Error Rate (CER) metric calculates the minimum number of character-level operations (substitutions, insertions, and deletions) needed to change the system's output into the reference transcript, divided by the total number of characters in the reference.

V. BENCHMARK RESULTS
This section provides an overview and analysis of the results obtained.

A. Quality per system and model type
The performance of various systems was evaluated using average SER, WER, and CER values obtained from ten test datasets available in BIGOS.The "large" model of the Whisper system achieved the highest accuracy, outperforming all other systems in every metric.The "medium" model of the Whisper system came second, and the "cloud" model of the same system came third.Google and Azure's services followed these, with the remaining Whisper models trailing behind.
Interestingly, the two most accurate systems are both freely available.Despite using the same "large-v2" model, the cloudbased variant was outperformed by the locally hosted "large" variant and, even more surprisingly, by the "medium" variant, which theoretically should be less advanced.On average, free systems outperformed well-established paid services.
To understand why this is the case, a more detailed and manual examination of the evaluation results is required.However, it is crucial to note that lower scores in this evaluation do not necessarily indicate inferior performance in real-world scenarios.
One hypothesis is that commercial systems, despite their ability to handle advanced normalization conventions, might actually perform worse when evaluated on publicly available datasets that use written forms of numerals (e.g., "one", "six o'clock") instead of numeric forms (e.g., "1", "6:00").This paradox suggests that the use of automated evaluation metrics and publicly available datasets used "as-is" (without transcription unification) may not fully represent real-world performance and capabilities.
Tables IV present the average SER, WER, and CER scores for the Azure, Google, and Whisper systems.

B. Quality per dataset
The best overall performance was observed for the PWR corpora, which contain recordings from a single speaker in 588 PROCEEDINGS OF THE FEDCSIS.WARSAW, POLAND, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.Interestingly, for single-word utterances, the limited context led the Google and Whisper local systems to recognize foreign language words instead of Polish words.For example, the word 'zapisz' was recognized as a Russian word 'Запись', the word 'zakończ' as the English word 'the coins', and words 'małe litery' and 'duże litery' as the Italian words 'ma vedi tere' and 'due lettere', respectively.The PWR male set dataset had the second-best performance.A median WER of 6% suggests that modern Polish ASR systems handle short utterances from contemporary literature quite effectively.Slightly worse performance (average WER over 10% for all systems) was observed for the MLS, M-AI Labs, and Common Voice datasets.Given the widespread use and accessibility of the MLS and Common Voice datasets within the global ASR community, it is likely that these datasets were used during training, allowing all systems to efficiently handle in-domain recordings and transcriptions.This hypothesis is supported by the performance of Whisper systems family on the MLS corpus; however, Google's performance on the Common Voice dataset was nearly twice as bad as other systems.Given that Whisper is trained mostly on publicly available data, while commercial systems leverage proprietary datasets, the impact of training and evaluation data leakage is more significant in the case of Whisper.
Performance for the CLARIN mobile dataset was slightly inferior, possibly due to longer utterances and the use of commercial default models, which are not optimized to handle speech recorded with an 8 kHz sampling frequency.
As expected, performance declined for the AZON read and spontaneous corpora, which contain scientific vocabulary from the chemistry field.However, the Google and Whisper local systems handled both types of AZON corpora proficiently, despite containing fillers and hesitations.
Table V-B presents the median WER for specific datasets sourced in BIGOS for Azure, Google, Whisper Cloud and Large systems.

VI. LIMITATIONS
The initial version of the benchmark comes with several limitations.First, the quantity and specificity of the datasets, along with the metadata about speakers and acoustic conditions, are limited.To examine ASR performance for particular sociodemographic groups, such as non-native Polish speakers or specific types of speech, such as whispery speech, dedicated datasets [33] should be used.Second, the unification of normalization relies solely on automatic methods and does not involve manual re-transcription.Lastly, the initial evaluation uses a limited number of test recordings, systems, and models, which constrains the precision and breadth of the benchmark.

VII. CONCLUSION AND FUTURE WORK
This work addresses the lack of a publicly available ASR evaluation suite for Polish by providing BIGOS, Benchmark Intended Grouping of Open Speech corpora.BIGOS, as its name suggests, was compiled from 10 existing publicly accessible Polish speech corpora.A test sample comprising 1900 recordings from 71 distinct speakers was used to gauge the performance of 3 commercial ASR systems against 5 freely available ones.Through automatic evaluation metrics, it was discovered that Whisper Cloud consistently outperforms more established services from Google and Azure on the test set representing publicly available speech datasets for Polish.Interestingly, the largest and second largest of the Whisper models exhibit superior performance compared to its paid version.The BIGOS corpus 11 and tools 12 for corpus curation and evaluation of ASR systems are available to the community, allowing reproduction and extension of this benchmark.
As indicated in the Limitations and Related Work sections, there are many interesting research directions to explore.The primary objective of the next BIGOS iteration is to include a subset of manually verified reference transcriptions.Comparison of error rates, calculated using original and manually verified transcriptions, will reveal the evaluation bias resulting from differences in normalization standards in various publicdomain corpora.Furthermore, the reliability and informativeness of the evaluation could be significantly improved if the evaluation results were manually annotated, similar to the German study [3], which revealed that the evaluation errors may be caused by the poor quality of the evaluation data and that not all errors are of equal importance.Lastly, it will be interesting to measure the robustness of the systems using larger samples, new data sources, and automatically perturbed recordings.
bits/16 kHz, • normalization of audio amplitude to -3 dBFS, • unification of text encoding to UTF8, MICHAŁ JUNCZYK: BIGOS -MULTI-DOMAIN BENCHMARK AND DATASET FOR POLISH AUTOMATIC SPEECH RECOGNITION 587 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.• removal of redundant characters • extraction and unification of metadata.

Table IV AVERAGE
SER, WER AND CER OF EVALUATED POLISH ASR SYSTEMS This limited variability led to perfect performance for the Whisper Cloud and Azure systems in PWR VIUa and the best average WER for the Male set.

Table V AVERAGE
WER PER DATASET FOR SELECTED SYSTEMS