Deep Learning methods for Subject Text Classification of Articles

Piotr Semberecki; Henryk Maciejewski

Deep Learning methods for Subject Text Classification of Articles

Piotr Semberecki, Henryk Maciejewski

DOI: http://dx.doi.org/10.15439/2017F414

Citation: Proceedings of the 2017 Federated Conference on Computer Science and Information Systems, M. Ganzha, L. Maciaszek, M. Paprzycki (eds). ACSIS, Vol. 11, pages 357–360 (2017)

Full text

Abstract. This work presents a method of classification of text documents using deep neural network with LSTM (long short-term memory) units. We tested different approaches to building feature vectors to represent documents to be classified: we used feature vectors constructed as sequences of words included in the documents, or, alternatively, we first converted words into vector representations using word2vec tool and used sequences of these vector representations as features of documents. We evaluated feasibility of this approach for the task of subject classification of documents using a collection of Wikipedia articles representing 7 subject categories. Our experiments show that the approach based on an LSTM network with documents represented as sequences of words coded into word2vec vectors outperformed a standard, bag-of-word approach with documents represented as frequency-of-words feature vectors.

References

G. Forman, “An extensive empirical study of feature selection metrics for text classification,” The Journal of machine learning research, vol. 3, 2003, pp. 1289–1305.
P. Semberecki and H. Maciejewski, “Distributed classification of text documents of Apache Spark platform,” in Artificial Intelligence and Soft Computing Conference, I Zakopane, 2016, pp. 621–630.
F. Sebastiani. “Machine learning in automated text categorization.” ACM Computing Surveys, 34(1):1-47, 2002.
S. Wang and C.D. Manning, “Baselines and bigrams: Simple, good sentiment and topic classification,” in Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Short Papers-Volume 2. Association for Computational Linguistics, 2012, pp. 90-94.
K.A. Vidhya, G. Aghila, “A Survey of Naive Bayes Machine Learning approach in Text Document Classification,” International Journal of Computer Science and Information Security, vol. 7, 2010, no. 2, pp. 206–211.
L. Wang, X. Zhao, “Improved k-nn Classification Algorithm Research in Text Categorization,” in Proceedings of the 2nd International Conference on Communications and Networks (CECNet), 2012, pp. 1848–1852.
W. Zi-Qiang, S. Xia, Z. De-Xian, L. Xin, “An Optimal SVM-Based Text Classification Algorithm,” in Fifth International Conference on Machine Learning and Cybernetics, Dalian, 2006, pp. 13–16.
M. Koppel, J. Schler, S. Argamon, “Authorship attribution in the wild,” Language Resources and Evaluation, vol. 45(1), 2011, pp. 83–94.
M. Koppel and Y. Winter, “Determining if two documents are written by the same author,” Journal of the Association for Information Science and Technology, vol. 65(1), 2014, pp. 178–187.
T. Mikolov, K. Chen, G. Corrado, J. Dean, “Efficient Estimation of Word Representations in Vector Space,” in Proceedings of Workshop at ICLR, 2013.
J. Li, X. Chen, E. Hovy, D. Jurasky, “Visualizing and Understanding Neural Models in NLP “, CoRR 2015.
Bird, S., Klein, E., Loper, E.: “Natural Language Processing with Python - Analyzing Text with the Natural Language Toolkit” O’Reilly 2009
“Google News word2vec dataset” https://code.google.com/archive/p/word2vec/
S. Hochreiter, Y. Bengio, P. Frasconi, and J. Schmidhuber. “Gradient flow in recurrent nets: the difficulty of learning long-term dependencies”. A Field Guide to Dynamical Recurrent Neural Networks. IEEE Press, 2001.
S. Hochreiter, J. Schmidhuber “Long Short-term Memory” Neural Computing 1997. vol. 9 pp. 1735–1780
J. Chung, C. Gulcehre , K. Cho , Y. Bengio, “Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling", CoRR 2014
A. Kumar, O, IRsoy, J. Su, J. Bradbury, R. English, B. Pierce, P. Ondruska, I. Gulrajani, R. Socher, “Ask Me Anything: Dynamic Memory Networks for Natural Language Processing”, CoRR 2015
Z. Huange, W. Xu, Kai You, “Bidirectional LSTM-CRF Models for Sequence Tagging” CoRR 2015
K. Tai, R. Socher, C. Manning “Improved Semantic Representations From Tree-Structured Long Short-Term” CoRR 2015
M. Lamar, Y. Maron, M. Johnson, E. Bienenstock, “SVD and clustering for unsupervised POS tagging”, In ACL 2010, pp. 215–219, July 11–16.
Kaggle “Sentiment Analysis on Movie Reviews” https://www.kaggle.com/c/sentiment-analysis-on-movie-reviews
“How to implement Sentiment Analysis using word embedding and Convolutional Neural Networks on Keras.” https://medium.com/@thoszymkowiak/how-to-implement-sentiment-analysis-using-word-embedding-and-convolutional-neural-networks-on-keras-163197aef623
Pennington, J., Socher, R., Manning, C. D.: “GloVe: Global Vectors for Word Representation” In EMNLP 2014, pp. 1532–1543