Using Word Embeddings for Italian Crime News Categorization
Giovanni Bonisoli, Federica Rollo, Laura Po
DOI: http://dx.doi.org/10.15439/2021F118
Citation: Proceedings of the 16th Conference on Computer Science and Intelligence Systems, M. Ganzha, L. Maciaszek, M. Paprzycki, D. Ślęzak (eds). ACSIS, Vol. 25, pages 461–470 (2021)
Abstract. Several studies have shown that the use of embeddings improves outcomes in many NLP activities, including text categorization. In this paper, we focus on how word embeddings can be used on newspaper articles about crimes to categorize them according to the type of crime they report. Our approach was tested on an Italian dataset of 15,361 crime news articles combining different Word2Vec models and exploiting supervised and unsupervised Machine Learning categorization algorithms. The tests show very promising results.
References
- S. Ghankutkar, N. Sarkar, P. Gajbhiye, S. Yadav, D. Kalbande, and N. Bakereywala, “Modelling machine learning for analysing crime news,” in 2019 International Conference on Advances in Computing, Communication and Control (ICAC3), 2019, pp. 1–5. [Online]. Available: https://doi.org/10.1109/ICAC347590.2019.9036769
- M. Hassan and M. Z. Rahman, “Crime news analysis: Location and story detection,” in 2017 20th International Conference of Computer and Information Technology (ICCIT), 2017, pp. 1–6. [Online]. Available: https: //doi.org/10.1109/ICCITECHN.2017.8281798
- D. Velásquez, S. Medina, G. Yamada, P. Lavado, M. Núñez, H. Alatrista, and J. Morzan, “I read the news today, oh boy: The effect of crime news coverage on crime perception and trust,” Institute of Labor Economics (IZA), IZA Discussion Papers 12056, Dec. 2018. [Online]. Available: https://ideas.repec.org/p/iza/izadps/dp12056.html
- D. Ghosh, S. A. Chun, B. Shafiq, and N. R. Adam, “Big data-based smart city platform: Real-time crime analysis,” in Proceedings of the 17th International Digital Government Research Conference on Digital Government Research, DG.O 2016, Shanghai, China, June 08 - 10, 2016, Y. Kim and S. M. Liu, Eds. ACM, 2016, pp. 58–66. [Online]. Available: https://doi.org/10.1145/2912160.2912205
- S. K and P. S. Thilagam, “Crime base: Towards building a knowledge base for crime entities and their relationships from online news papers,” Information Processing & Management, vol. 56, no. 6, p. 102059, 2019. [Online]. Available: https://doi.org/10.1016/j.ipm.2019.102059
- L. Po and F. Rollo, “Building an urban theft map by analyzing newspaper crime reports,” in 2018 13th International Workshop on Semantic and Social Media Adaptation and Personalization (SMAP), 2018, pp. 13–18. [Online]. Available: https://doi.org/10.1109/SMAP.2018.8501866
- T. Dasgupta, A. Naskar, R. Saha, and L. Dey, “Crimeprofiler: Crime information extraction and visualization from news media,” in Proceedings of the International Conference on Web Intelligence, ser. WI ’17. New York, NY, USA: Association for Computing Machinery, 2017, p. 541–549. [Online]. Available: https://doi.org/10.1145/3106426.3106476
- F. Rollo and L. Po, “Crime event localization and deduplication,” in The Semantic Web – ISWC 2020, J. Z. Pan, V. Tamma, C. d’Amato, K. Janowicz, B. Fu, A. Polleres, O. Seneviratne, and L. Kagal, Eds. Cham: Springer International Publishing, 2020, pp. 361–377. [Online]. Available: https://doi.org/10.1007/978-3-030-62466-8_23
- L. Po, F. Rollo, and R. T. Lado, “Topic detection in multichannel italian newspapers,” in Semantic Keyword-Based Search on Structured Data Sources - COST Action IC1302 Second International KEYSTONE Conference, IKC 2016, Cluj-Napoca, Romania, September 8-9, 2016, Revised Selected Papers, ser. Lecture Notes in Computer Science, A. Calì, D. Gorgan, and M. Ugarte, Eds., vol. 10151, 2016, pp. 62–75. [Online]. Available: https://doi.org/10.1007/978-3-319-53640-8_6
- F. Rollo, “A key-entity graph for clustering multichannel news: student research abstract,” in Proceedings of the Symposium on Applied Computing, SAC 2017, Marrakech, Morocco, April 3-7, 2017, A. Seffah, B. Penzenstadler, C. Alves, and X. Peng, Eds. ACM, 2017, pp. 699–700. [Online]. Available: https: //doi.org/10.1145/3019612.3019930
- S. Bergamaschi, L. Po, and S. Sorrentino, “Comparing topic models for a movie recommendation system,” in WEBIST 2014 - Proceedings of the 10th International Conference on Web Information Systems and Technologies, Volume 2, Barcelona, Spain, 3-5 April, 2014, V. Monfort and K. Krempels, Eds. SciTePress, 2014, pp. 172–183. [Online]. Available: https://doi.org/10.5220/0004835601720183
- L. Po and D. Malvezzi, “Community detection applied on big linked data,” J. Univers. Comput. Sci., vol. 24, no. 11, pp. 1627–1650, 2018. [Online]. Available: http://www.jucs.org/jucs_24_11/community_detection_applied_on
- C. Wang, P. Nulty, and D. Lillis, “A comparative study on word embeddings in deep learning for text classification,” in Proceedings of the 4th International Conference on Natural Language Processing and Information Retrieval, ser. NLPIR 2020. New York, NY, USA: Association for Computing Machinery, 2020, p. 37–46. [Online]. Available: https://doi.org/10.1145/3443279.3443304
- T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient estimation of word representations in vector space,” in 1st International Conference on Learning Representations, ICLR 2013, Scottsdale, Arizona, USA, May 2-4, 2013, Workshop Track Proceedings, Y. Bengio and Y. LeCun, Eds., 2013. [Online]. Available: http: //arxiv.org/abs/1301.3781
- P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov, “Enriching word vectors with subword information,” Transactions of the Association for Computational Linguistics, vol. 5, 07 2016. [Online]. Available: https://doi.org/10.1162/tacl_a_00051
- J. Pennington, R. Socher, and C. D. Manning, “Glove: Global vectors for word representation,” in Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014, October 25-29, 2014, Doha, Qatar, A meeting of SIGDAT, a Special Interest Group of the ACL, A. Moschitti, B. Pang, and W. Daelemans, Eds. ACL, 2014, pp. 1532–1543. [Online]. Available: https://doi.org/10.3115/v1/d14-1162
- A. Moreo, A. Esuli, and F. Sebastiani, “Word-class embeddings for multiclass text classification,” Data Min. Knowl. Discov., vol. 35, no. 3, pp. 911– 963, 2021. [Online]. Available: https://doi.org/10.1007/s10618-020-00735-3
- A. Fesseha, S. Xiong, E. D. Emiru, M. Diallo, and A. Dahou, “Text classification based on convolutional neural networks and word embedding for low-resource languages: Tigrinya,” Inf., vol. 12, no. 2, p. 52, 2021. [Online]. Available: https://doi.org/10.3390/info12020052
- A. Borg, M. Boldt, O. Rosander, and J. Ahlstrand, “E-mail classification with machine learning and word embeddings for improved customer support,” Neural Comput. Appl., vol. 33, no. 6, pp. 1881–1902, 2021. [Online]. Available: https://doi.org/10.1007/s00521-020-05058-4
- E. Christodoulou, A. Gregoriades, M. Pampaka, and H. Herodotou, “Application of classification and word embedding techniques to evaluate tourists’ hotel-revisit intention,” in Proceedings of the 23rd International Conference on Enterprise Information Systems, ICEIS 2021, Online Streaming, April 26-28, 2021, Volume 1, J. Filipe, M. Smialek, A. Brodsky, and S. Hammoudi, Eds. SCITEPRESS, 2021, pp. 216–223. [Online]. Available: https://doi.org/10.5220/0010453502160223
- P. Semberecki and H. Maciejewski, “Deep learning methods for subject text classification of articles,” in Proceedings of the 2017 Federated Conference on Computer Science and Information Systems, FedCSIS 2017, Prague, Czech Republic, September 3-6, 2017, ser. Annals of Computer Science and Information Systems, M. Ganzha, L. A. Maciaszek, and M. Paprzycki, Eds., vol. 11, 2017, pp. 357–360. [Online]. Available: https://doi.org/10.15439/2017F414
- T. Lin, “Performance of different word embeddings on text classification,” https://towardsdatascience.com/nlp-performance-of-different-word-embeddings-on-text-classification-de648c6262b, 2019, accessed: 7 June 2021.
- J. Lilleberg, Y. Zhu, and Y. Zhang, “Support vector machines and word2vec for text classification with semantic features,” in 14th IEEE International Conference on Cognitive Informatics & Cognitive Computing, ICCI*CC 2015, Beijing, China, July 6-8, 2015, N. Ge, J. Lu, Y. Wang, N. Howard, P. Chen, X. Tao, B. Zhang, and L. A. Zadeh, Eds. IEEE Computer Society, 2015, pp. 136–140. [Online]. Available: https://doi.org/10.1109/ICCI-CC.2015.7259377
- G. Di Gennaro, A. Buonanno, A. Di Girolamo, A. Ospedale, F. A. N. Palmieri, and G. Fedele, An Analysis of Word2Vec for the Italian Language. Singapore: Springer Singapore, 2021, pp. 137–146. [Online]. Available: https://doi.org/10.1007/978-981-15-5093-5_13
- B. Li, A. Drozd, Y. Guo, T. Liu, S. Matsuoka, and X. Du, “Scaling word2vec on big corpus,” Data Sci. Eng., vol. 4, no. 2, pp. 157–175, 2019. [Online]. Available: https://doi.org/10.1007/s41019-019-0096-6
- K. W. Bowyer, N. V. Chawla, L. O. Hall, and W. P. Kegelmeyer, “SMOTE: synthetic minority over-sampling technique,” CoRR, vol. abs/1106.1813, 2011. [Online]. Available: https://doi.org/10.1613/jair.953