Lithuanian Author Profiling with the Deep Learning

Jurgita Kapočiūtė-Dzikienė; Robertas Damaševičius

Lithuanian Author Profiling with the Deep Learning

Jurgita Kapočiūtė-Dzikienė, Robertas Damaševičius

DOI: http://dx.doi.org/10.15439/2018F22

Citation: Proceedings of the 2018 Federated Conference on Computer Science and Information Systems, M. Ganzha, L. Maciaszek, M. Paprzycki (eds). ACSIS, Vol. 15, pages 169–172 (2018)

Full text

Abstract. We address the Lithuanian author profiling task in two dimensions (AGE and GENDER) using two deep learning methods (i.e., Long Short-Term Memory -- LSTM) and Convolutional Neural Network -- CNN) applied on the top of Lithuanian neural word embeddings. We also investigate an impact of the training dataset size on the author profiling accuracy. The best results are achieved with the largest datasets, containing 5,000 instances in each class. Besides, LSTM was more effective on the smaller datasets, and CNN -- on the larger ones. We compare the deep learning methods with the traditional machine learning methods (in particular, Naive Bayes Multinomial and Support Vector Machine), and frequencies of elements as the feature representation). The comparison revealed that the deep learning is not the best solution for our author profiling task.

References

H. van Halteren, R. H. Baayen, F. Tweedie, M. Haverkort, and A. Neijt. “New machine learning methods demonstrate the existence of a human stylome”. Quantitative Linguistics, vol. 12(1), 2005, pp. 65–77.
P. Juola. “Future trends in authorship attribution”. Advances in Digital Forensics III – IFIP International Conference on Digital Forensics, vol. 242, 2007, pp. 119–132.
H. Gómez-Adorno, G. Sidorov, D. Pinto, D. Vilariño, and A. Gelbukh. “Automatic authorship detection using textual patterns extracted from integrated syntactic graphs”. Sensors, vol. 16(9), 2016, pp. 1374, http://dx.doi.org/10.3390/s16091374.
V. Ong, A. D. S. Rahmanto, Williem, D. Suhartono, A. E. Nugroho, E. W. Andangsari, and M. N. Suprayogi. “Personality prediction based on Twitter information in Bahasa Indonesia”. Federated Conference on Computer Science and Information Systems, FedCSIS 2017. In the 2nd International Workshop on Language Technologies and Applications (LTA’17), 2017, http://dx.doi.org/10.15439/2017F359.
Sh. Argamon, S. Dhawle, M. Koppel, and J. W. Pennebaker. “Lexical predictors of personality type”. Proceedings of the Joint Annual Meeting of the Interface and the Classification Society of North America, 2005.
J. Schler, M. Koppel, Sh. Argamon, and J. W. Pennebaker. “Effects of age and gender on blogging”. AAAI Spring Symposium: Computational Approaches to Analyzing Weblogs, AAAI, 2006, 199–205.
R. Aljumily. “Hierarchical and non-hierarchical linear and non-linear clustering methods to “Shakespeare authorship question””. Social Sciences, MDPI AG, vol. 4(3), 2015, pp. 758–799, http://dx.doi.org/10.3390/socsci4030758.
Ch. Napoli, E. Tramontana, G. Lo Sciuto, M. Woźniak, R. Damaševičius, and G. Borowik. “Authorship semantical identification using holomorphic Chebyshev projectors”. 2015 Asia-Pacific Conference on Computer Aided System Engineering, IEEE, 2015, http://dx.doi.org/10.1109/APCASE.2015.48.
E. Stamatatos. “A survey of modern authorship attribution methods”. Journal of the Association for Information Science and Technology, John Wiley & Sons, Inc. vol. 60(3), 2009, pp. 538–556, http://dx.doi.org/10.1002/asi.21001.
F. Rangel, P. Rosso, M. Koppel, E. Stamatatos, and G. Inches. “Overview of the author profiling task at PAN 2013”. CLEF 2013 Evaluation Labs and Workshop – Working Notes Papers, 2013.
F. Rangel, P. Rosso, I. Chugur, M. Potthast, M. Trenkmann, B. Stein, B. Verhoeven, and W. Daelemans. “Overview of the 2nd author profiling task at PAN 2014”. CLEF 2014 Evaluation Labs and Workshop – Working Notes Papers, 2014.
F. Rangel, F. Celli, P. Rosso, M. Potthast, B. Stein, and W. Daelemans. “Overview of the 3rd author profiling task at PAN 2015”. CLEF 2015 Evaluation Labs and Workshop – Working Notes Papers, 2015.
P. Rangel, M. Francisco, P. Rosso, B. Verhoeven, W. Daelemans, M. Potthast, Martin, and B. Stein. “Overview of the 4th author profiling task at PAN 2016: Cross-Genre Evaluations”. Working Notes Papers of the CLEF 2016 Evaluation Labs, 2016.
P. Rangel, M. Francisco, P. Rosso, M. Potthast, and B. Stein. “Overview of the 5th author profiling task at PAN 2017: gender and language variety identification in Twitter”. Working Notes Papers of the CLEF 2017 Evaluation Labs, 2017.
J. Kapočiūtė-Dzikienė, A. Venčkauskas, and R. Damaševičius. “Comparison of authorship attribution approaches applied on the Lithuanian language”. Federated Conference on Computer Science and Information Systems, FedCSIS 2017. In the 2nd International Workshop on Language Technologies and Applications (LTA’17), 2017, http://dx.doi.org/10.15439/2017F110.
A. Venčkauskas, A. Karpavičius, R. Damaševičius, R. Marcinkevičius, and J. Kapočiūtė-Dzikienė. “Open class authorship attribution of Lithua- nian Internet comments using one-class classifier”. Federated Confer- ence on Computer Science and Information Systems, FedCSIS 2017. In the 2nd International Workshop on Language Technologies and Applications (LTA’17), 2017, http://dx.doi.org/10.15439/2017F461.
J. Kapočiūtė-Dzikienė, L. Šarkutė, and A. Utka. “Author profiling of Lithuanian parliamentary speeches: exploring the influence of features and dataset sizes”. Human Language Technologies – The Baltic Perspec- tive: Proceedings of the 6th International Conference Baltic HLT, IOS press, 2014, pp. 99–106, http://dx.doi.org/10.3233/978-1-61499-442-8-99.
S. Hochreiter and J. Schmidhuber. “Long short-term memory”. Neural Computation, vol. 9(8), 1997, pp. 1735–1780, http://dx.doi.org/10.1162/neco.1997.9.8.1735.
Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner. “Gradient-based learning applied to document recognition”. Proceedings of the IEEE, 1998, pp. 2278–2324, http://dx.doi.org/10.1109/5.726791.
Y. Kim. “Convolutional neural networks for sentence classification”. Empirical Methods in Natural Language Processing, EMNLP, 2014, pp. 1746–1751, http://dx.doi.org/10.3115/v1/D14-1181.
J. Kapočiūtė-Dzikienė and R. Damaševičius. “Intrinsic evaluation of Lithuanian word embeddings using WordNet”. CSOC 2018: 7th computer science on-line conference, 2018, pp. 394-404, http://dx.doi.org/10.1007/978-3-319-91189-2_39.
M. Sokolova and G. Lapalme. “A systematic analysis of performance measures for classification tasks”. Information Processing and Management, vol. 45(4), 2009, pp. 427–437, http://dx.doi.org/10.1016/j.ipm.2009.03.002.
Q. McNemar. “Note on the sampling error of the difference between correlated proportions or percentages”. Psychometrika, vol. 12(2), 1947, pp. 153–157, http://doi.org/10.1007/BF02295996.