A Comparison of Authorship Attribution Approaches Applied on the Lithuanian Language

Jurgita Kapočiūtė-Dzikienė; Algimantas Venčkauskas; Robertas Damaševičius

A Comparison of Authorship Attribution Approaches Applied on the Lithuanian Language

Jurgita Kapočiūtė-Dzikienė, Algimantas Venčkauskas, Robertas Damaševičius

DOI: http://dx.doi.org/10.15439/2017F110

Citation: Proceedings of the 2017 Federated Conference on Computer Science and Information Systems, M. Ganzha, L. Maciaszek, M. Paprzycki (eds). ACSIS, Vol. 11, pages 347–351 (2017)

Full text

Abstract. This paper reports comparative authorship attribution results obtained on the Internet comments of the morphologically complex Lithuanian language. We have explored the impact of machine learning and similarity-based approaches on the different author set sizes (containing 10, 100, and 1,000 candidate authors), feature types (lexical, morphological, and character), and feature selection techniques (feature ranking, random selection). The authorship attribution task was complicated due to the used Lithuanian language characteristics, non-normative texts, an extreme shortness of these texts, and a large number of candidate authors. The best results were achieved with the machine learning approaches. On the larger author sets the entire feature set composed of word-level character tetra-grams demonstrated the best performance.

References

H. Van Halteren, R. H. Baayen, F. Tweedie, M. Haverkort, and A. Neijt. New Machine Learning Methods Demonstrate the Existence of a Human Stylome. Journal of Quantitative Linguistics, vol. 12, 2005, pp. 65–77.
M. Koppel, J. Schler, and Sh. Argamon. Computational Methods in Authorship Attribution. Journal of the American Society for Information Science and Technology (JASIST), vol. 60, no. 1, 2009, pp. 9–26.
E. Stamatatos. A Survey of Modern Authorship Attribution Methods. Journal of the Association for Information Science and Technology, vol. 60, no. 3, 2009, pp. 538–556.
K. Luyckx, and W. Daelemans. Authorship Attribution and Verification with Many Authors and Limited Data. Proceedings of the 22Nd International Conference on Computational Linguistics, vol. 1, 2008, pp. 513–520.
A. Narayanan, H. Paskov, N. Z. Gong, J. Bethencourt, E. Stefanov, E. Ch. R. Shin, and D. Song. On the Feasibility of Internet-Scale Author Identification. Proceedings of the 2012 IEEE Symposium on Security and Privacy, 2012, pp. 300–314.
R. Schwartz, O. Tsur, A. Rappoport, and M. Koppel. Authorship Attribution of Micro-Messages. Empirical Methods in Natural Language Processing, 2013, pp. 1880–1891.
M. Koppel, J. Schler, Sh. Argamon, and E. Messeri. Authorship Attribution with Thousands of Candidate Authors. Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 2006, pp. 659–660.
M. Koppel, J. Schler, and Sh. Argamon. Authorship attribution in the wild. Language Resources and Evaluation, vol. 45, no. 1, 2011, pp. 83–94.
M. Koppel, J. Schler, and Sh. Argamon. Authorship Attribution: What’s Easy and What’s Hard? Journal of Law & Policy, vol. 21, 2013, pp. 317–331.
M. Koppel, J. Schler, Sh. Argamon, and Y. Winter. The “Fundamental Problem” of Authorship Attribution. English Studies, vol. 93, no. 3, 2012, pp. 284–291.
S. Okuno, H. Asai, and H. Yamana. A Challenge of Authorship Identification for Ten-Thousand-scale Microblog Users. IEEE International Conference on Big Data, 2014, pp. 52–54.
Y. Seroussi, I. Zukerman, and F. Bohnert. Authorship Attribution with Latent Dirichlet Allocation. Proceedings of the Fifteenth Conference on Computational Natural Language Learning, 2011, pp. 181–189.
G. Žalkauskaitė. Idiolekto požymiai elektroniniuose laiškuose. [Idiolect signs in e-mails], Vilnius University, Lithuania. PhD thesis, 2012 (in Lithuanian).
A. Venčkauskas, R. Damaševičius, R. Marcinkevičius, and A. Karpavičius. Problems of Authorship Identification of the National Language Electronic Discourse. ICIST 2015: 21st International Conference on Information and Software Technologies, 2015, pp. 415–432.
J. Kapočiūtė-Dzikienė, L. Šarkutė, and A. Utka. The Effect of Author Set Size in Authorship Attribution for Lithuanian. NODALIDA: 20th Nordic Conference of Computational Linguistics, 2015, pp. 87–96.
J. Kapočiūtė-Dzikienė, A. Utka, and L. Šarkutė. Authorship Attribution of Internet Comments with Thousand Candidate Authors. ICIST 2015: 21st International Conference on Information and Software Technologies, 2015, pp. 433–448.
E. Maciej. Does size matter? Authorship attribution, small samples, big problem. Digital Scholarship in the Humanities, vol. 30, no. 1, 2013, pp. 167–182.
K. Luyckx. Authorship Attribution of E-mail as a Multi-Class Task. CLEF 2011 Labs and Workshop, Notebook Papers, (eds.) V. Petras and P. Forner and P. Clough, 2011.
C. Cortes, and V. Vapnik. Support-Vector Networks. Machine Learning, vol. 20, no. 3, 1995, pp. 273–297.
M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I. H. Witten. The WEKA Data Mining Software: An Update. SIGKDD Explorations, vol. 11, no. 1, 2009, pp. 10–18.
D. D. Lewis, and W. A. Gale. A Sequential Algorithm for Training Text Classifiers. 17th Annual International ACM-SIGIR Conference on Research and Development in Information Retrieval, 1994, pp. 3–12.
G. Salton, and Ch. Buckley. Term-weighting Approaches in Automatic Text Retrieval. Information Processing & Management, vol. 24, no. 5, 1988, pp. 513–523.
V. Daudaravičius, E. Rimkutė, and A. Utka. Morphological annotation of the Lithuanian corpus. Proceedings of the Workshop on Balto-Slavonic Natural Language Processing: Information Extraction and Enabling Technologies (ACL’07), 2007, pp. 94–99.
J. Kapočiūtė-Dzikienė, F. Vaassen, W. Daelemans, and A. Krupavičius. Improving Topic Classification for Highly Inflective Languages. Proceedings of 24th International Conference on Computational Linguistics (COLING 2012), 2012, pp. 1393–1410.