Evaluating Combinations of Classification Algorithms and Paragraph Vectors for News Article Classification

Johannes Lindén; Stefan Forsström; Tingting Zhang

Evaluating Combinations of Classification Algorithms and Paragraph Vectors for News Article Classification

Johannes Lindén, Stefan Forsström, Tingting Zhang

DOI: http://dx.doi.org/10.15439/2018F110

Citation: Proceedings of the 2018 Federated Conference on Computer Science and Information Systems, M. Ganzha, L. Maciaszek, M. Paprzycki (eds). ACSIS, Vol. 15, pages 489–495 (2018)

Full text

Abstract. News companies have a need to automate and make the process of writing about popular and new events more effective. Current technologies involve robotic programs that fill in values in templates and website listeners that notify editors when changes are made so that the editor can read up on the source change on the actual website. Editors can provide news faster and better if directly provided with abstracts of the external sources and categorical meta-data that supports what the text is about. In this article, the focus is on the importance of evaluating critical parameter modifications of the four classification algorithms Decisiontree, Randomforest, Multi Layer perceptron and Long-Short-Term-Memory in a combination with the paragraph vector algorithms Distributed Memory and Distributed Bag of Words, with an aim to categorise news articles. The result shows that Decisiontree and Multi Layer perceptron are stable within a short interval, while Randomforest is more dependent on the parameters best split and number of trees. The most accurate model is Long-Short-Term-Memory model.

References

O. Hjelmstedt and M. Sellfors, “Robotjournalistikens nya utmaningar,” 2017.
Q. V. Le and T. Mikolov, “Distributed Representations of Sentences and Documents.” in ICML, vol. 14, 2014, pp. 1188–1196.
A. Joulin, E. Grave, P. Bojanowski, and T. Mikolov, “Bag of Tricks for Efficient Text Classification,” arXiv preprint https://arxiv.org/abs/1607.01759, 2016.
S. Lai, L. Xu, K. Liu, and J. Zhao, “Recurrent Convolutional Neural Networks for Text Classification,” Twenty-Ninth AAAI Conference on Artificial Intelligence, pp. 2267–2273, 2015.
P. Liu, X. Qiu, X. Chen, S. Wu, and X. Huang, “Multi-timescale long short-term memory neural network for modelling sentences and documents,” in Proceedings of the 2015 conference on empirical methods in natural language processing, 2015, pp. 2326–2335.
A. Go, R. Bhayani, and L. Huang, “Twitter sentiment classification using distant supervision,” CS224N Project Report, Stanford, vol. 1, no. 12, 2009.
M.-L. Antonie and O. R. Zaiane, “Text document categorization by term association,” in Data Mining, 2002. ICDM 2003. Proceedings. 2002 IEEE International Conference on. IEEE, 2002, pp. 19–26.
M. Dabrowski, J. Gromada, and T. Michalik, “A practical study of neural network-based image classification model trained with transfer learning method.” in FedCSIS Position Papers, http://dx.doi.org/http://dx.doi.org/10.15439/2016F211, 2016, pp. 49–56.
A. Conneau, H. Schwenk, L. Barrault, and Y. Lecun, “Very deep convolutional networks for natural language processing,” arXiv preprint, 2016.
D. Połap, M. Woźniak, W. Wei, and R. Damaševičius, “Multi-threaded learning control mechanism for neural networks,” Future Generation Computer Systems, http://dx.doi.org/https://doi.org/10.1016/j.future.2018.04.050, vol. 87, pp. 16–34, 2018.
K. Leetaru and P. A. Schrodt, “Gdelt: Global data on events, location, and tone, 1979–2012,” in ISA Annual Convention, vol. 2. Citeseer, 2013.
M. D. Ward, A. Beger, J. Cutler, M. Dickenson, C. Dorff, and B. Radford, “Comparing GDELT and ICEWS event data,” Analysis, vol. 21, pp. 267–297, 2013.
A. Thanda, A. Agarwal, K. Singla, A. Prakash, and A. Gupta, “A Document Retrieval System for Math Queries,” pp. 346–353, 2016.
A. Voutilainen, “Part-of-speech tagging,” The Oxford handbook of computational linguistics, pp. 219–232, 2003.
D. Andor, C. Alberti, D. Weiss, A. Severyn, A. Presta, K. Ganchev, S. Petrov, and M. Collins, “Globally normalized transition-based neural networks,” arXiv preprint https://arxiv.org/abs/1603.06042, 2016.
C. Alberti, D. Andor, I. Bogatyy, M. Collins, D. Gillick, L. Kong, T. Koo, J. Ma, M. Omernick, S. Petrov et al., “Syntaxnet models for the conll 2017 shared task,” arXiv preprint https://arxiv.org/abs/1703.04929, 2017.
J. Nilsson and J. Hall, Reconstruction of the Swedish Treebank Talbanken. Matematiska och systemtekniska institutionen, 2005.
J. Einarsson, “Projektet talbanken. i: C platzack (utg), svenskans beskrivning 8, s76-96,” 1974.
J. Einarsson, “Talbankens talspråkskonkordans,” 1976.
J. Lindén, “Understand and Utilise Unformatted Text Documents by Natural Language Processing algorithm,” vol. 46, no. 0, 2017.
X. Rong, “word2vec parameter learning explained,” CoRR, vol. abs/1411.2738, 2014. [Online]. Available: http://arxiv.org/abs/1411.2738
B. Recht, C. Re, S. Wright, and F. Niu, “Hogwild: A lock-free approach to parallelizing stochastic gradient descent,” in Advances in Neural Information Processing Systems, 2011, pp. 693–701.
T. Mikolov, S. Kombrink, L. Burget, J. Černocky, and S. Khudanpur, “Extensions of recurrent neural network language model,” in Acoustics, Speech and Signal Processing (ICASSP), 2011 IEEE International Conference on. IEEE, 2011, pp. 5528–5531.
A. Karpathy, “The unreasonable effectiveness of recurrent neural networks,” Andrej Karpathy blog, 2015.
C. Olah, “Understanding lstm networks,” GITHUB blog, posted on August, vol. 27, p. 2015, 2015.
M. Sundermeyer, R. Schlüter, and H. Ney, “LSTM Neural Networks for Language Modeling.” in Interspeech, 2012, pp. 194–197.
Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Y. Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dan Mané, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng, “{TensorFlow}: Large-Scale Machine Learning on Heterogeneous Systems,” 2015. [Online]. Available: http://tensorflow.org/
N. Chinchor and B. Sundheim, “Muc-5 evaluation metrics,” in Proceedings of the 5th conference on Message understanding. Association for Computational Linguistics, 1993, pp. 69–78.