Logo PTI Logo FedCSIS

Proceedings of the 18th Conference on Computer Science and Intelligence Systems

Annals of Computer Science and Information Systems, Volume 35

Classifying Industrial Sectors from German Textual Data with a Domain Adapted Transformer

, ,

DOI: http://dx.doi.org/10.15439/2023F6694

Citation: Proceedings of the 18th Conference on Computer Science and Intelligence Systems, M. Ganzha, L. Maciaszek, M. Paprzycki, D. Ślęzak (eds). ACSIS, Vol. 35, pages 463470 ()

Full text

Abstract. For economics and sociological research, lists of industries and their branches are widely used in research to categorize data and get an overview on different types of industries. However, many different taxonomies and ordering schema exist, due to different research focus but also due to different national scenarios and interests. In this paper, we will focus without loss of generality on regional data from Germany. Manual annotation of textual data is time-consuming and tedious, naturally giving rise to our initial research question, also highly inspired by questions from computational social sciences: How can we automatically categorize textual data, e.g. job advertisements or business profiles, by industrial sectors? We will present an approach towards classification using a pre-trained domain-adapted Transformer model. We find that domain-adapted models generalize better and outperform state of the art non domain-adapted Transformer models on Out-Of-Distribution data. Additionally, we open source two novel data-sets mapping textual data to WZ2008 sections and divisions, enabling further research.

References

  1. R. Fechner, D. J. Dörpinghaus, and A. Firll, “FedCSIS 2023 Classifying Industrial Sectors with a Domain Adapted Transformer - Datasets and Configuration files,” Jul. 2023. [Online]. Available: https://doi.org/10.5281/zenodo.8192546
  2. M. Pejic-Bach, T. Bertoncel, M. Meško, and Ž. Krstić, “Text mining of industry 4.0 job advertisements,” International journal of information management, vol. 50, pp. 416–431, 2020.
  3. R. Chaisricharoen, W. Srimaharaj, S. Chaising, and K. Pamanee, “Classification approach for industry standards categorization,” in 2022 Joint International Conference on Digital Arts, Media and Technology with ECTI Northern Section Conference on Electrical, Electronics, Computer and Telecommunications Engineering (ECTI DAMT & NCON). IEEE, 2022, pp. 308–313.
  4. A. McCallum, K. Nigam et al., “A comparison of event models for naive bayes text classification,” in AAAI-98 workshop on learning for text categorization, vol. 752, no. 1. Madison, WI, 1998, pp. 41–48.
  5. A. M. Kibriya, E. Frank, B. Pfahringer, and G. Holmes, “Multinomial naive bayes for text categorization revisited,” in AI 2004: Advances in Artificial Intelligence: 17th Australian Joint Conference on Artificial Intelligence, Cairns, Australia, December 4-6, 2004. Proceedings 17. Springer, 2005, pp. 488–499.
  6. H. Hayashi and Q. Zhao, “Quick induction of nntrees for text categorization based on discriminative multiple centroid approach,” in 2010 IEEE International Conference on Systems, Man and Cybernetics. IEEE, 2010, pp. 705–712.
  7. K. Kowsari, K. Jafari Meimandi, M. Heidarysafa, S. Mendu, L. Barnes, and D. Brown, “Text classification algorithms: A survey,” Information, vol. 10, no. 4, p. 150, 2019.
  8. C. Ospino, “Occupations: Labor market classifications, taxonomies, and ontologies in the 21st century,” Inter-American Development Bank, 2018.
  9. M. Rodrigues, Fernández-Macı́as, and Enrique, Sostero, Matteo, “A unified conceptual framework of tasks, skills and competences,” Seville, 2021. [Online]. Available: https://joint-research-centre.ec.europa.eu/publications/unified-conceptual-framework-tasks-skills-and-competences en
  10. A.-S. Gnehm, E. Bühlmann, and S. Clematide, “Evaluation of transfer learning and domain adaptation for analyzing german-speaking job advertisements,” in Proceedings of the Thirteenth Language Resources and Evaluation Conference, 2022, pp. 3892–3901.
  11. A.-S. Gnehm, E. Bühlmann, H. Buchs, and S. Clematide, “Fine-grained extraction and classification of skill requirements in german-speaking job ads.” Association for Computational Linguistics, 2022.
  12. J. Büchel, J. Engler, and A. Mertens, “The demand for data skills in german companies: Evidence from online job advertisements,” How to Reconstruct Ukraine? Challenges, Plans and the Role of the EU, p. 56, 2023.
  13. B. Gehrke, H. Legler, M. Leidmann, and K. Hippe, “Forschungs-und wissensintensive wirtschaftszweige: Produktion, wertschöpfung und beschäftigung in deutschland sowie qualifikationserfordernisse im europäischen vergleich,” Studien zum deutschen Innovationssystem, Tech. Rep., 2009.
  14. N. Gillmann and V. Hassler, “Coronabetroffenheit der wirtschaftszweige in gesamt-und ostdeutschland,” ifo Dresden berichtet, vol. 27, no. 04, pp. 03–05, 2020.
  15. U. Kies, D. Klein, and A. Schulte, “Cluster wald und holz Deutschland: Makroökonomische bedeutung, regionale zentren und strukturwan- del der beschäftigung in holzbasierten wirtschaftszweigen,” Cluster in Mitteldeutschland–Strukturen, Potenziale, Förderung, p. 103, 2012.
  16. V.-P. Niitamo, “Berufs-und qualifikationsanforderungen im ikt-bereich in europa erkennen und messen,” Schmidt, SL; Strietska-Ilina, O.; Dworschak, B, pp. 194–201, 2005.
  17. J. Hartmann and G. Schütz, “Die klassifizierung der berufe und der wirtschaftszweige im sozio-oekonomischen panel-neuvercodung der daten 1984-2001,” SOEP Survey Papers, Tech. Rep., 2017.
  18. M. Titze, M. Brachert, and A. Kubis, “The identification of regional industrial clusters using qualitative input–output analysis (qioa),” Regional Studies, vol. 45, no. 1, pp. 89–102, 2011.
  19. U. Kies, T. Mrosek, and A. Schulte, “Spatial analysis of regional industrial clusters in the german forest sector,” International Forestry Review, vol. 11, no. 1, pp. 38–51, 2009.
  20. Statistisches Bundesamt, “Klassifikation der Wirtschaftszweige,” Wiesbaden, 2008. [Online]. Available: https://www.destatis.de/static/DE/dokumente/klassifikation-wz-2008-3100100089004.pdf
  21. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.
  22. B. Chan, T. Möller, M. Pietsch, and T. Soni. (2019) bertbase-german-cased transformer model. [Online]. Available: https://huggingface.co/bert-base-german-cased
  23. A.-S. Gnehm, E. Bühlmann, and S. Clematide, “Evaluation of transfer learning and domain adaptation for analyzing german-speaking job advertisements,” in Proceedings of the 13th Language Resources and Evaluation Conference. Marseille, France: European Language Resources Association, 2022.
  24. J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V. Le, D. Zhou et al., “Chain-of-thought prompting elicits reasoning in large language models,” Advances in Neural Information Processing Systems, vol. 35, pp. 24 824–24 837, 2022.
  25. S. Yao, D. Yu, J. Zhao, I. Shafran, T. L. Griffiths, Y. Cao, and K. Narasimhan, “Tree of thoughts: Deliberate problem solving with large language models,” arXiv preprint https://arxiv.org/abs/2305.10601, 2023.