Center for Artificial Intelligence Challenge on Conversational AI Correctness

This paper describes a challenge on Conversational AI correctness with the goal to develop Natural Language Understanding models that are robust against speech recognition errors. The data for the competition consist of natural language utterances along with semantic frames that represent the commands targeted at a virtual assistant. The specification of the task is given along with the data preparation procedure and the evaluation rules. The baseline models for the task are discussed and the results of the competition are reported.


I. INTRODUCTION
R EGARDLESS of the near-human accuracy of Auto- matic Speech Recognition in general-purpose transcription tasks, speech recognition errors can significantly deteriorate the performance of a Natural Language Understanding model that follows the speech-to-text module in a virtual assistant.The problem is even more apparent when an ASR system from an external vendor is used as an integral part of a conversational system without any further adaptation.The goal of this competition is to develop Natural Language Understanding models that are robust to speech recognition errors.
The approach used to prepare data for the challenge is meant to promote models robust to various types of errors in the input, making it impossible to solve the task by simply learning a shallow mapping from incorrectly recognized words to the correct ones.It reflects real-world scenarios where the NLU system is presented with inputs that exhibit various disturbances due to changes in the ASR model, acoustic conditions, speaker variation, and other causes.

II. RELATED WORK
The robustness of Natural Language Understanding models to various types of errors is a subject of several publications.Some authors proposed to use word confusion networks to improve models' robustness to ASR errors [1], [2], [3], [4].Reference [5] developed a learning criterion that prefers NLU models that are robust to ASR errors by adding a loss term This research was partially funded by the "CAIMAC: Conversational AI Multilingual Augmentation and Compression" project, a cooperation between Adam Mickiewicz University and Samsung Electronics.
that measures the distance between the prediction distribution from transcriptions and ASR hypotheses.Reference [6] studied the performance of intent classification and slot labeling models with respect to several kinds of perturbations, such as substituting abbreviations and synonyms, changing casing and punctuation, paraphrasing, and introducing misspellings and morphological variants.Speech characteristics are among three aspects of robustness investigated by [7] in the assessment of task-oriented dialog systems.Reference [8] investigated data-efficient techniques that apply to a wide range of natural language understanding models used in large-scale production environments to make them robust against speech recognition errors, using domain classification as an example.The authors compared the effectiveness of several such techniques in terms of time-varying usage patterns and distribution of ASR errors.
Several benchmarks exist to evaluate NLU models regarding their robustness to ASR errors.RADDLE [9], a benchmark for evaluating the performance of dialog models, prefers models robust to language variations, speech errors, unseen entities, and out-of-domain utterances.ASR-GLUE [10] is a benchmark consisting of 6 different NLU tasks, for which the input data were recorded by six different speakers and at three different noise levels.
Mitigating the impact of ASR errors on downstream tasks was the subject of several contests.In [11], the authors proposed a challenge for improving the recognition rate of an ASR system on the basis of incorrect ASR hypotheses paired with reference texts.Post-edition of ASR output was also the objective of the shared task held by [12].Speech-aware dialogue state tracking was the topic of a recent competition conducted by [13].
The data preparation procedure outlined in Section III involves combining a TTS model and an ASR system.Augmentation of speech corpora with the use of synthesized speech was investigated by [14] and [15].Reference [13] uses synthesized inputs along with spoken utterances in their challenge.
Holding competitions as a method for finding promising solutions to scientific problems has a long history in computer science, particularly in natural language processing [16], [17].This contest is organized under the 1st Symposium on Challenges for Natural Language Processing (CNLPS), a part of the 18th Conference on Computer Science and Intelligence Systems (FedCSIS 2023).FedCSIS Conference Series hosted a wide range of data mining competitions through the years that covered topics such as identifying key risk factors for the Polish State Fire Service [18], network device workload prediction [19], and predicting the costs of forwarding contracts [20].In the process of running our CNLPS challenge, we followed the best practices set out by the organizers of FedCSIS data mining competitions.

III. DATA
The data for the task are derived from the Leyzer dataset [21].The samples consist of user utterances and the semantic representation of the commands targeted at a virtual assistant (VA).A fraction of the utterances in the training set is contaminated with speech recognition errors; however, we left most of the utterances intact to make the task more challenging.The erroneous samples were obtained from user utterances using a TTS model followed by an ASR system.

A. Preparation of Base Text Corpus
We used the second version of the Leyzer corpus, which consists of more utterance variations when compared to the version described in the original paper.The second version of the corpus introduced two additional sub-intent differentiation levels called naturalness level (or simply level) and verb pattern.Although we have not implicitly used this information in this contest, it allowed us to create more variant corpus for the task.Leyzer consists of 20 domains across three languages: English, Spanish, and Polish, with 186 intents and a wide range of samples per intent.Domains can be grouped into several topics that can be found in the most popular VAs: • Communication with Email, Facebook, Phone, Slack, and Twitter domains in that group, which all relate to communication and the transfer of ideas, • Internet with Web Search and Wikipedia that groups domains related to the search for information on the web; therefore, these domains will have a lot of opentitle queries, • Media and Entertainment with Spotify, YouTube, and Instagram domains in that group, which relate to multimedia content with named entities connected to artists or titles, • Devices with Air Conditioner and Speaker domains, which represent simple physical devices that can be controlled by voice, • Self-management with Calendar and Contacts, which consist of actions that involve time planning and people, • Other, uncategorized domains (Fitbit, Google Drive, News, Translate, Weather, Yelp) represent functions and language not shared by other categories.In this sense, the remaining domains can be understood as intentionally not matching the other domains.
Using scripts provided in the Leyzer repository, we generated the text corpus from JSGF grammars.The corpus was divided into train, valid, test-A, and test-B parts using the splitting script provided in the Leyzer repository.First, we differentiate test-B from the rest of the corpus.For test-B, a minimum of 1 test case and up to 20% of the total available sentences for each intent, level, and verb pattern were selected, and the remaining test cases were left in the development corpus.From the development part of the corpus, we further differentiate test-A using the same procedure as for test-B, which extracted a minimum of 1 and up to 20% of test cases for each intent, level, and verb pattern triplet.The remaining corpus was divided into train and valid subsets.The valid subset is 20% of randomly selected test cases without assuring that it contains at least 1 test case for each intent, level, and verb pattern triplet.

B. Augmenting Corpus with Back-transcription
Back-transcription is a technique that can be used to produce speech transcripts from text-only data.Textual data are fed to a TTS engine to produce a speech signal, which in turn is fed to an ASR system, producing an augmented text.Depending on the performance of both models and differences in text normalization performed on the input text, as well as inside these models, the resulting text can be identical to the input or may contain differences introduced in either processing stage.The technique has been used to develop post-processing [22] and error correction [23] models for ASR systems.
We use back-transcription to simulate a virtual assistant user's behavior.The user speaks to the system, and their speech is converted into text by an ASR model, which is subsequently processed by an NLU model (see Fig. 1).NLU text prompts from the Leyzer corpus are synthesized using a TTS engine.The resulting sound signal is used as input to an ASR model producing back a text with an augmented NLU prompt.The procedure is illustrated in Fig. 2. To perform Text-To-Speech synthesis, we used the FastSpeech 2 1 model [24] for English, the VITS model [25] for Polish, and Tacotron 2 [26] for Spanish, both from the Coqui TTS library [27].Speech recognition was performed using the Whisper 2 model [28] for all three languages.

C. CAICCAIC Dataset
The training data are located in the train directory of the contest's repository 3 .The train directory contains two files: • in.tsv with four columns: 1) sample identifier: 306, 2) language code: en-US, 3) data split type: train, 4) utterance: adjust the temperature to 82 degrees fahrenheit on my reception room thermostat.
• expected.tsvwith three columns representing:  {"device_name": "reception room", "value": "82 degrees fahrenheit"} For experimentation, we provide the validation dataset in the dev-A directory of the contest's repository.It was created using the same pipeline as the train dataset.The test data are located in test-A and test-B directories and contain only input values, while expected values hidden for contestants are used by the evaluation platform to score submissions.

IV. BASELINE MODELS
We use XLM-RoBERTa Base [29] as a baseline model for intent detection and slot-filling.The XLM-RoBERTa model, also known as XLM-R, is a transformer-based multilingual masked language model that employs a multilingual masked language model (MLM) objective using only monolingual data.During training, streams of text from each language are sampled, and the model is trained to predict the masked tokens in the input.Subword tokenization is applied directly to raw text data using SentencePiece [30] with a unigram language model.The model does not use language embeddings, which allows it to handle code-switching better.It uses a large vocabulary size of 250K with a full softmax.
XLM-R was pre-trained on 2.5 TB of filtered Common-Crawl data containing 100 languages.This large-scale training led to significant performance gains for various cross-lingual transfer tasks.The model significantly outperforms multilingual BERT (mBERT) on various cross-lingual benchmarks.
Our baseline models were trained independently on the entire training set and optimized on the evaluation set.All baseline models have 12 layers, 768 hidden units, and 12 attention heads, totaling 270M parameters, and a size of 1.1 GB.
We use the leyzer-fedcsis4 dataset from the Hugging Face Model Hub in the baseline training process.Each language-specific portion is processed individually, retaining only the utterance and intent columns.The processed datasets are then merged and split into training, validation, and testing sets.The model is defined for a sequence classification task using the AutoModelForSequenceClassification class, with the number of labels corresponding to the unique intents in the training dataset.Training hyperparameters were set to a learning rate of 2 × 10 −5 , a training batch size of 16, a weight decay of 0.01, and 10 training epochs.Evaluations are performed after each epoch.
Finally, performance metrics such as accuracy and F 1 score are computed to assess the model's effectiveness in its classification task.The final epoch checkpoint evaluation results on the test set are presented in Table II in the "official baseline" row.All baseline intents models achieved results above 90% accuracy, with Spanish, Polish, and all-language models achieving above 95%.We analyzed misclassification errors and found that most of them could be resolved if a model resisted token distortion and could separate syntactically similar classes.
The error analysis of the intent recognition models for English, Spanish, and Polish languages reveals similarities and differences across the models.The Spotify domain tends to be the most problematic for all three languages, suggesting that these models may struggle with understanding and predicting   intents related to music streaming or the specific language used in this domain.Slack and Console domains also prove problematic for the English and Polish models, while for the Spanish model, the recognition of the Airconditioner and Email domains was the most challenging.Regarding specific intents, the English model has the most trouble with ConsoleEdit and AddAlbumToPlaylist, the Spanish model struggles with Play-AlbumOfTypeByArtist and TurnOn, and the Polish model with SetPurposeOnChannel and PlayAlbumOfTypeByArtist.These intents may be harder to recognize due to their semantic complexity, similarity to other intents, or underrepresentation in training data.All models are available on the Hugging Face platform with details of how each model was trained and how to execute them: • intent: en-US 5 , es-ES 6 , pl-PL 7 , and all 8 that was trained and evaluated on all three languages together • slot: en-US 9 , es-ES 10 , pl-PL 11
The submissions were scored using Exact Match Accuracy (EMA), i.e., the percentage of utterance-level predictions in which domain, intent, and all the slots are correct.Besides EMA scores, we also report the following auxiliary metrics: • domain accuracy, i.e., the percentage of utterances with correct domain prediction; • intent accuracy, i.e., the percentage of utterances with the correct intent prediction; • slot word recognition rate, i.e., word recognition rate (WER) calculated on slot annotations, which is the percentage of correctly annotated slot values.All scores were calculated using the GEval [32] library, which was also made available to participants for offline use.

VI. RESULTS
We received 28 submissions from 9 teams.Table II presents the final ranking with cumulative metrics for all languages 12 .Notably, most submissions are based on pre-trained Transformer models [33] adapted to the task, with the Flan-T5 model [34] being the preferred choice.However, the winning  solution [35] used the mBART model [36] as its basis to train a joint, text-to-text model of domain, intent, and slots.This model achieved an Exact Match Accuracy of 0.754 across all the samples, with top results attained for Polish and Spanish NLU commands (0.799 and 0.884 EMA, respectively).It demonstrated outstanding performance in slot recognition with a slot WRR of 0.872 (0.067 better than the secondbest solution).Although the winning solution performed well overall, it was within the accuracy of XLM-RoBERTa baseline models regarding domain and intent accuracy.This observation is intriguing and could be a valuable starting point for future research on developing joint models for domains, intents, and slots.
To gain more insight into the differences between the winning model and the baseline, we performed the analysis using the Geval tool [32].Geval's "most worsening feature" function was used to analyze cases for which one of the models is problematic while the other behaves correctly.The function calculates the difference in a chosen metric between two models being compared, on cases containing a specific feature.The results are reported for cases for which the difference is statistically significant.Table III shows the features that had the most negative impact on the winning results compared to the baseline submission.It appears that numbers in their written form in English input are problematic for the mBART model.Also, it is not surprising to see that English inputs, in general, are easier for the baseline solution compared to the winning one, considering the overall results presented in Table II.Additionally, the mBART model has problems with one of the image-finding intents, which is consistent with the problematic word "images" in input sentences.Con-versely, Table IV presents features that were problematic for the winning submission while being easier for the baseline model.The most problematic features are connected with the Email domain.It looks as baseline model has problems with identifying all kinds of slots of commands used for sending emails.These observations should prompt the authors of the winning submission and anyone else who wants to improve on these results to take a closer look into the specific causes of these particular types of errors and work towards addressing them.
MAREK KUBIS ET AL.: CENTER FOR ARTIFICIAL INTELLIGENCE CHALLENGE ON CONVERSATIONAL AI CORRECTNESS 1321 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

TABLE I UTTERANCE
LENGTH DISTRIBUTION IN THE DATASET.