Optimizing Machine Translation for Virtual Assistants: Multi-Variant Generation with VerbNet and Conditional Beam Search
Marcin Sowański, Artur Janicki
DOI: http://dx.doi.org/10.15439/2023F8601
Citation: Proceedings of the 18th Conference on Computer Science and Intelligence Systems, M. Ganzha, L. Maciaszek, M. Paprzycki, D. Ślęzak (eds). ACSIS, Vol. 35, pages 1149–1154 (2023)
Abstract. In this paper, we introduce a domain-adapted machine translation (MT) model for intelligent virtual assistants (IVA), designed to translate natural language understanding (NLU) training datasets. In our work, we use a constrained beam search to generate multiple valid translations for each input sentence. The search for the best translations in the presented translation algorithm is guided by a verb-frame ontology we derived from VerbNet. To assess the quality of the presented MT models, we train NLU models on these multiverb-translated resources and compare their performance to models trained on resources translated with a traditional single-best approach. Our experiments show that multi-verb translation improves intent classification accuracy by 3.8\% relative compared to single-best translation. We release five MT models that translate from English to Spanish, Polish, Swedish, Portuguese, and French, as well as an IVA verb ontology that can be used to evaluate the quality of IVA-adapted MT.
References
- M. Fomicheva, L. Specia, and F. Guzmán, “Multi-hypothesis machine translation evaluation,” in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 2020, pp. 1218–1232.
- P. Anderson, B. Fernando, M. Johnson, and S. Gould, “Guided open vocabulary image captioning with constrained beam search,” in Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, 2017, pp. 936–945.
- J. Gaspers, P. Karanasou, and R. Chatterjee, “Selecting machine-translated data for quick bootstrapping of a natural language understanding system,” in North American Chapter of the Association for Computational Linguistics, 2018.
- A. Abujabal, C. D. Bovi, S.-R. Ryu, T. Gojayev, F. Triefenbach, and Y. Versley, “Continuous model improvement for language understanding with machine translation,” in North American Chapter of the Association for Computational Linguistics, 2021.
- K. K. Schuler, VerbNet: A broad-coverage, comprehensive verb lexicon. University of Pennsylvania, 2005.
- G. A. Miller, “Wordnet: a lexical database for english,” Communications of the ACM, vol. 38, no. 11, pp. 39–41, 1995.
- K. Basu and G. Gupta, “Natural language question answering with goaldirected answer set programming.” in ICLP Workshops, 2021.
- L. K. Schubert, “What kinds of knowledge are needed for genuine understanding,” in IJCAI 2015 Workshop on Cognitive Knowledge Acquisition and Applications (Cognitum 2015), 2015.
- M. Moneglia, “Natural language ontology of action: A gap with huge consequences for natural language understanding and machine translation,” in Language and Technology Conference, 2011.
- O. Majewska and A. Korhonen, “Verb classification across languages,” Annual Review of Linguistics, vol. 9, 2023.
- A. Huminski, F. Liausvia, and A. Goel, “Semantic roles in verbnet and framenet: Statistical analysis and evaluation,” in Computational Linguistics and Intelligent Text Processing: 20th International Conference, CICLing 2019, La Rochelle, France, April 7–13, 2019, Revised Selected Papers, Part II. Springer, 2023, pp. 135–147.
- B. Levin, English verb classes and alternations: A preliminary investigation. University of Chicago press, 1993.
- L. Sun, A. Korhonen, and Y. Krymolowski, “Verb class discovery from rich syntactic data,” Lecture Notes in Computer Science, vol. 4919, p. 16, 2008.
- M. Sowański and A. Janicki, “Leyzer: A dataset for multilingual virtual assistants,” in Proc. Conference on Text, Speech, and Dialogue (TSD2020), P. Sojka, I. Kopeček, K. Pala, and A. Horák, Eds. Brno, Czechia: Springer International Publishing, 2020, pp. 477–486.
- J. FitzGerald, C. Hench, C. Peris, S. Mackie, K. Rottmann, A. Sanchez, A. Nash, L. Urbach, V. Kakarala, R. Singh et al., “MASSIVE: A 1M-example multilingual natural language understanding dataset with 51 typologically-diverse languages,” arXiv preprint https://arxiv.org/abs/2204.08582, 2022.
- S. Schuster, S. Gupta, R. Shah, and M. Lewis, “Cross-lingual transfer learning for multilingual task oriented dialog,” in Proceedings of NAACL-HLT, 2019, pp. 3795–3805.
- H. Li, A. Arora, S. Chen, A. Gupta, S. Gupta, and Y. Mehdad, “Mtop: A comprehensive multilingual task-oriented semantic parsing benchmark,” in Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, 2021, pp. 2950–2962.
- R. Goel, W. Ammar, A. Gupta, S. Vashishtha, M. Sano, F. Surani, M. Chang, H. Choe, D. Greene, K. He et al., “Presto: A multilingual dataset for parsing realistic task-oriented dialogs,” arXiv preprint https://arxiv.org/abs/2303.08954, 2023.
- E. Bastianelli, A. Vanzo, P. Swietojanski, and V. Rieser, “SLURP: A Spoken Language Understanding Resource Package,” in Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020.
- S. Gupta, R. Shah, M. Mohit, A. Kumar, and M. Lewis, “Semantic parsing for task oriented dialog using hierarchical representations,” in Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, 2018, pp. 2787–2792.
- I. Casanueva, I. Vulić, G. Spithourakis, and P. Budzianowski, “Nlu++: A multi-label, slot-rich, generalisable dataset for natural language understanding in task-oriented dialogue,” in Findings of the Association for Computational Linguistics: NAACL 2022, 2022, pp. 1998–2013.
- A. Fan, S. Bhosale, H. Schwenk, Z. Ma, A. El-Kishky, S. Goyal, M. Baines, O. Celebi, G. Wenzek, V. Chaudhary, N. Goyal, T. Birch, V. Liptchinsky, S. Edunov, E. Grave, M. Auli, and A. Joulin, “Beyond english-centric multilingual machine translation,” 2020.
- D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in Proc. of the 6th International Conference on Learning Representations (ICRL 2015), San Diego, CA, 2015.
- A. Conneau, K. Khandelwal, N. Goyal, V. Chaudhary, G. Wenzek, F. Guzmán, É. Grave, M. Ott, L. Zettlemoyer, and V. Stoyanov, “Unsupervised cross-lingual representation learning at scale,” in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020, pp. 8440–8451.