Open Class Authorship Attribution of Lithuanian Internet Comments using One-Class Classifier

Algimantas Venčkauskas; Arnas Karpavičius; Robertas Damaševičius; Romas Marcinkevičius; Jurgita Kapočiūtė-Dzikienė; Christian Napoli

Open Class Authorship Attribution of Lithuanian Internet Comments using One-Class Classifier

Algimantas Venčkauskas, Arnas Karpavičius, Robertas Damaševičius, Romas Marcinkevičius, Jurgita Kapočiūtė-Dzikienė, Christian Napoli

DOI: http://dx.doi.org/10.15439/2017F461

Citation: Proceedings of the 2017 Federated Conference on Computer Science and Information Systems, M. Ganzha, L. Maciaszek, M. Paprzycki (eds). ACSIS, Vol. 11, pages 373–382 (2017)

Full text

Abstract. Internet can be misused by cyber criminals as a platform to conduct illegitimate activities (such as harassment, cyber bullying, and incitement of hate or violence) anonymously. As a result, authorship analysis of anonymous texts in Internet (such as emails, forum comments) has attracted significant attention in the digital forensic and text mining communities. The main problem is a large number of possible of authors, which hinders the effective identification of a true author. We interpret open class author attribution as a process of expert recommendation where the decision support system returns a list of suspected authors for further analysis by forensics experts rather than a single prediction result, thus reducing the scale of the problem. We describe the task formally and present algorithms for constructing the suspected author list. For evaluation we propose using a simple Winner-Takes-All (WTA) metric as well as a set of gain-discount model based metrics from the information retrieval domain (mean reciprocal rank, discounted cumulative gain and rank-biased precision). We also propose the List Precision (LP) metric as an extension of WTA for evaluating the usability of the suspected author list. For experiments, we use our own dataset of Internet comments in Lithuanian language and consider the use of language-specific (Lithuanian) lexical features together with general lexical features derived from English language. For classification we use one-class Support Vector Machine (SVM) classifier. The results of experiments show that the usability of open class author attribution can be improved considerably by using a set of language-specific lexical features together with general lexical features, while the proposed method can be used to reduce the number of suspected authors thus alleviating the work of forensic linguists.

References

Irons, A., and Lallie, H.S. 2014. Digital Forensics to Intelligent Forensics. Future Internet, 6, 584-96.
Chaski C. E. 2012. Author Identification in the Forensic Setting. In L. Solan and P. Tiermsa (Eds.), The Oxford Handbook of Forensic Linguistics, Oxford University Press.
Iqbal, F., Binsalleeh, H., Fung, B. C. M., and Debbabi, M. 2013. A unified data mining solution for authorship analysis in anonymous textual communications. Inf. Sci., Vol., 231, pp. 98–112.
Koppel, M., Schler, J., and Argamon, S. 2011. Authorship Attribution in the Wild. Language Resources and Evaluation, 45(1), pp. 83–94.
Van Halteren, H. 2004. Linguistic profiling for authorship recognition and verification. Proc. of 42nd Meeting on Association for Computational Linguistics, ACL'2004, pp. 199–206.
Brocardo, M. L., Traore, I., and Woungang, I. Authorship verification of e-mail and tweet messages applied for continuous authentication. J. Comput. Syst. Sci., 81(8), pp. 1429–40.
Neralla, S., Bhaskari, D.L., and Avadhani, P. S. 2014. A Stylometric Investigation Tool for Authorship Attribution in E-Mail Forensics. In ICT and Critical Infrastructure: Proceedings of the 48th Annual Convention of Computer Society of India- Vol. II. Advances in Intelligent Systems and Computing, Vol. 249, pp. 543-9.
Alazab, M., Layton, R., Broadhurst, R., and Bouhours, B. 2013. Malicious Spam Emails Developments and Authorship Attribution. Proc. of 4th Cybercrime and Trustworthy Computing Workshop (CTC ’13), pp. 58–68.
de Vel, O., Anderson, A., Corney, M., and Mohay, G. 2001. Mining e-mail content for author identification forensics. SIGMOD Rec., 30(4), pp. 55–64.
Potthast, S., Stein, B., Barron-Cedeno, A., and Rosso, P. 2010. An Evaluation Framework for Plagiarism Detection. Proceedings of the 23rd International Conference on Computational Linguistics (COLING 2010), pp. 997–1005. ACL.
Amuchi, F., Al-Nemrat, A., Alazab, M., and Layton, R. 2012. Identifying Cyber Predators through Forensic Authorship Analysis of Chat Logs. Third Cybercrime and Trustworthy Computing Workshop (CTC), pp. 28-37.
Damasevicius, R., Valys, R., and Wozniak, M. 2016. Intelligent tagging of online texts using fuzzy logic. IEEE Symposium Series on Computational Intelligence, SSCI 2016, 1-8. IEEE.
Krilavičius, T., Medelis, Z., Kapočiũtė-Dzikienė, J., and Žalandauskas, T. 2012. News Media Analysis Using Focused Crawl and Natural Language Processing: Case of Lithuanian News Websites. Proc. of Int. Conf. on Information and software technologies, ICIST 2012, pp. 48–61.
Steen, R. G. 2014. The Demographics of Deception: What Motivates Authors Who Engage in Misconduct? Publications, 2, 44-50.
Ding, S.H.H, Fung, B.C.M., and Debbabi, M. 2015. A Visualizable Evidence-Driven Approach for Authorship Attribution. ACM Trans. Inf. Syst. Secur., 17, 3, Article 12, 30.
Koppel, M., and Schler, J. 2004. Authorship Verification As a One-class Classification Problem. Proceedings of the Twenty-first International Conference on Machine Learning (ICML), 489-495.
Veenman, C. J., and Li, Z. 2013. Authorship Verification with Compression Features. Working Notes for CLEF 2013 Conference. CEUR Workshop Proceedings 1179.
Can, M. 2014. Authorship Attribution Using Principal Component Analysis and Competitive Neural Networks. Math. Comput. Appl., 19, 21-36.
Mikros, G., and Perifanos K. 2013. Authorship attribution in greek tweets using author’s multilevel n-gram profiles, in: AAAI Spring Symposium Series.
Sousa-Silva, R., Sarmento, L., Grant, T., Oliveira, E., and Maia, B. 2010. Comparing sentence-level features for authorship analysis in Portuguese. Proc. of the 9th international conference on Computational Processing of the Portuguese Language (PROPOR’10), pp. 51-54.
Reicher, T., Krišto, I., Belša, I., Šilic, A. 2010. Automatic authorship attribution for texts in Croatian language using combinations of features. Proc. of the 14th international conference on Knowledge-based and intelligent information and engineering systems: Part II (KES’10), pp. 21-30.
Graovac, J. 2012. Serbian Text Categorization Using Byte Level n-Grams. Local Proceedings of the Fifth Balkan Conference in Informatics, BCI’12, pp. 93–96.
Tomović, A., and Janičić, P. 2007. A Variant of N-Gram Based Language Classification. Artificial Intelligence and Human-Oriented Computing, 10th Congress of the Italian Association for Artificial Intelligence, AI*IA 2007, pp. 410–421.
Rudman, J. 1998. The state of authorship attribution studies: Some problems and solutions. Computers and the Humanities, 31, pp. 351–365.
Joachims, T. 2002. Learning to Classify Text Using Support Vector Machines: Methods, Theory and Algorithms. Kluwer Academic Publishers, Norwell, MA, USA.
Venčkauskas, A., Damaševičius, R., Marcinkevičius, R., and Karpavičius, A. 2015. Problems of authorship identification of the national language electronic discourse. In: Proc. of the 21st Int. Conference on Information and software technologies, ICIST 2015, pp. 415–432.
Šveikauskienė, D. 2005. Graph Representation of the Syntactic Structure of the Lithuanian Sentence. INFORMATICA, Vol. 16, No. 3, pp. 407–418.
Klimas, A. 1974. Studies on Word Formation in Lithuanian. Lituanus Lithuanian Quarterly Journal of Arts and Sciences, 20(3).
Kapočiũtė-Dzikienė, J., Utka, A., and Šarkutė, L. 2014. Feature Exploration for Authorship Attribution of Lithuanian Parliamentary Speeches. Proc. of 17th International Conference on Text, Speech and Dialogue, TSD 2014, pp. 93–100.
Kapočiũtė-Dzikienė, J., Utka, A., and Šarkutė, L. 2015. Authorship Attribution of Internet Comments with Thousand Candidate Authors. Proc. of the 21st Int. Conference on Information and software technologies, ICIST 2015, pp. 433–48.
Zečević, A., and Stanković, S. V. 2013. Language Identification: The Case of Serbian. Proceedings of Natural Language Processing for Serbian - Resources and Application.
Stahczyk, U., and Cyran, K. A. 2007. Machine learning approach to authorship attribution of literary texts. Journal of Applied Mathematics, 7(4):151–8.
Türkoğlu, F., and Diri, B. 2007. Fatih Amasyali, M. Author attribution of Turkish texts by feature mining. Proc. of the 3rd international conference on Advanced intelligent computing theories and applications (ICIC’07), pp. 1086–93.
Beliga, S., and Martincic-Ipsic, S. 2014. Non-standard words as features for text categorization. 37th Int. Convention on Information and Communication Technology, Electronics and Microelectronics, MIPRO 2014, pp. 1165–9.
Schölkopf, B., Williamson, R. C., Smola, A. J., Shawe-Taylor, J., and Platt, J. C. 1999. Support Vector Method for Novelty Detection. Advances in Neural Information Processing Systems 12, NIPS 1999, pp. 582–8.
Vapnik, V. N. 1995. The nature of statistical learning theory. Springer-Verlag.
Chang, C.-C., and Lin, C.-J. 2011. LIBSVM: a library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2(3), 27.
King, D. E. 2009. Dlib-ml: A machine learning toolkit. The Journal of Machine Learning Research, 10, 1755–8.
Li, P., Burges, C., and Wu, Q. 2007. McRank: Learning to rank using classification and gradient boosting. Proceedings of the Twenty-First Annual Conference on Neural Information Processing Systems, NIPS, pp. 897–904.
Chapelle, O., Le, Q., and Smola, A. 2007. Large margin optimization of ranking measures. In NIPS Workshop on Machine Learning for Web Search.
Smucker, M. D., and Clarke, C. L. A. 2012. Time-based calibration of effectiveness measures. Proc. of the 35th international ACM SIGIR conference on Research and development in information retrieval (SIGIR’12), pp. 95–104.
Craswell, N. 2009. Mean Reciprocal Rank. Encyclopedia of Database Systems, Vol. 1703.
Järvelin, K., and Kekäläinen, J. 2002. Cumulated gain-based evaluation of IR techniques. ACM Trans. Inf. Syst., 20(4), pp. 422–46.
Moffat, A. and Zobel, J. 2008. Rank-biased precision for measurement of retrieval effectiveness. ACM TOIS, 27(1), pp. 1–27.
Zhang, Y., Park, L. A. F., and Moffat, A. Parameter sensitivity in rank-biased precision. Proc. of the 13th Australasian Document Computing Symposium (ADCS), pp. 61–68.
Nini, A., and Grant, T. 2013. Bridging the gap between stylistic and cognitive approaches to authorship analysis using Systemic Functional Linguistics and multidimensional analysis. International Journal of Speech Language and the Law, 20(2), pp. 173-202.
Kestemont M. 2014. Function Words in Authorship Attribution From Black Magic to Theory? Proc. of the 3rd Workshop on Computational Linguistics for Literature (CLfL) at 14th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2014, pp. 59–66.
Forstall, C., and Scheirer, W. 2010. Features from Frequency: Authorship and Stylistic Analysis Using Repetitive Sound. Proc. of the Chicago Colloquium on Digital Humanities and Computer Science, 1(2).
Arun, R., Suresh, V., and Veni Madhavan, C. E. 2009. Stopword Graphs and Authorship Attribution in Text Corpora. Proc. of the 2009 IEEE International Conference on Semantic Computing (ICSC ’09), pp. 192–6.
Napoli, C., Tramontana, E., Lo Sciuto, G., Woźniak, M., Damaševičius, R., and Borowik, G. 2015. Authorship Semantical Identification using Holomorphic Chebyshev Projectors. In: Asia-Pacific Conference on Computer Aided System Engineering (APCASE), pp. 232-237. IEEE.