## Methodology of Constructing and Analyzing the Hierarchical Contextually-Oriented Corpora

### Nina Rizun, Yurii Taranenko

DOI: http://dx.doi.org/10.15439/2018F69

Citation: Proceedings of the 2018 Federated Conference on Computer Science and Information Systems, M. Ganzha, L. Maciaszek, M. Paprzycki (eds). ACSIS, Vol. 15, pages 505–514 (2018)

Abstract. The Methodology of Constructing and Analyzing the Hierarchical structure of the Contextually-Oriented Corpora was developed. The methodology contains the following steps: Contextual Component of the Corpora's Structure Building; Text Analysis of the Contextually-Oriented Hierarchical Corpus. Main contribution of this study is the following: hierarchical structure of the Corpus provides advanced possibilities for identification of the Morphological and Structural features of texts of different tonalities; Contextual, Morphological and Structural specificity of texts with tonality, originally assigned by the authors, has significant differences; exist the certain thought and writing style Templates, under the influence of which the formation of texts of various tonalities takes place. As basic features of such templates for the texts of the two basic (positive/negative) tonalities could be used: Contextual Structure, Morphological Types, Emotional Features, Writing Style and Vocabulary Richness. For verification of the proposed methodology, a case study of Polish-language film reviews Dataset was used.

### References

- Mikolov T., Chen K., Corrado G., Dean J. Efficient Estimation of Word Representations in Vector Space. Proceedings of Workshop at International Conference on Learning Representations (ICLP), 2013. http://arxiv.org/abs/1301.3781
- Feldman, R. Techniques and applications for sentiment analysis. Communications of the ACM, 2013. 56(4), pp. 82-89.
- Mikolov T., Sutskever I., Chen K., Corrado G., Dean J. Distributed Representations of Word and Phrases and their Compositionaly. Proceedings of Workshop at The Twenty-seventh Annual Conference on Neural Information Processing Systems. 2013 (NIPS) http://arxiv.org/abs/1310.4546
- Mikolov T., Le Q. Distributed Representations of Sentences and Documents. Proceedings of Workshop at The 31st International Conference on Machine Learning (ICML). 2014. http://jmlr.org/proceedings/papers/v32/le14.pdf.
- Elias P. Interval and recency rank source encoding: two on-line adaptive variable-length schemes. IEEE Trans. Inform. Theory. 1987. V. 33, N 1. P. 3–10.
- Popescu, I.-I., Altmann, G., Čech, R. The Lambda-structure of Texts. Lüdenscheid: RAM-Verlag, 2011. (1) Vocabulary Richness Measure in Genres. Available from: https://www.researchgate.net/publication/258518594_Vocabulary_Richness_Measure_in_Genres [accessed Jul 10 2018].
- Dempster, A.P., Laird, N.M., & Rubin, D.B. Maximum Likelihood from Incomplete Data via the EM Algorithm. Journal of the Royal Statistical Society. 1977. Series B., 39(1), 1-38.
- Feinerer, I., Hornik, K. & Meyer, D. Text mining infrastructure in R. Journal of statistical software, 2008. 25(5). American Statistical Association.
- Segalovich, I. A fast morphological algorithm with unknown word guessing induced by a dictionary for a web search engine. In Proceedings of the International Conference on Machine Learning; Models, Technologies and Applications. 2003.
- Koreniu T., Laurikkala J., Järvelin K., & Juhola M. Stemming and Lemmatization in the Clustering of Finnish Text Documents. Proceedings of the thirteenth ACM international conference on Information and knowledge management, 2004. Washington, DC, USA, 625-633.
- Alkula, R. From plain character strings to meaningful words: Producing better full text databases for inflectional and compounding languages with morphological analysis software. Information Retrieval, 2001. 4, 195-208.
- Nokel, M. A. & Lukashevich, N.V. Thematic Models: Adding Bigrams and Accounting Similarities Between Unigrams and Bigrams. Computational methods and programming, 2015. 16, 215-217
- Salton G., Wong A., Yang C.S. (A vector space model for automatic indexing. Communications of the ACM. 1975. Volume 18. Issue 11, pp. 613-620
- Jain A.K., Murty M.N. & Flynn P.J. Data Clustering: A Review; ACM Computing Surveys, 1999. 31 (3), 264-323. http://dx.doi.org/10.1145/331499.331504
- Papadimitrious, C.H., Raghavan, P., Tamaki, H., and Vempala, S. Latent semantic indexing: A probabilistic analysis. Journal of Computer and System Sciences, 2000. 61, 217-235
- Rizun N., Ossowska K., Taranenko Y. Modeling the Customer’s Contextual Expectations Based on Latent Semantic Analysis Algorithms. Information Systems Architecture and Technology: 38th International Conference on Information Systems Architecture and Technology. 2018, pp.364-373.
- Rizun N., Taranenko Y., Waloszek W. The Algorithm of Modelling and Analysis of Latent Semantic Relations: Linear Algebra vs. Probabilistic Topic Models. Knowledge Engineering and Semantic Web. 8th International Conference, 2017, pp.53-68.
- Patricia J. Crossno, Andrew T. Wilson and Timothy M. Shead, Daniel M. Dunlavy. Topic View: Visually Comparing Topic Models of Text Collections. 2011.
- Leticia H. Anaya. (2011). Comparing Latent Dirichlet Allocation and Latent Semantic Analysis as Classifiers, Doctor of Philosophy (Management Science), 2011. 226 pp
- Papadimitrious, C.H., Raghavan, P., Tamaki, H., and Vempala, S. Latent semantic indexing: A probabilistic analysis. Journal of Computer and System Sciences, 2000. 61, 217-235.
- Deerwester S., Susan T. Dumais, Harshman R. Indexing by Latent Semantic Analysis. 1990. http://lsa.colorado.edu/papers/JASIS.lsi.90.pdf
- Dumais, S. T., Furnas, G. W., Landauer, T. K. and Deerwester, S. Using latent semantic analysis to improve information retrieval. Proceedings of CHI'88: Conference on Human Factors in Computing, New York: 1988. ACM, 281-285
- Blei, D. M. Introduction to Probabilistic Topic Models. Communications of the ACM, 2012. 55 (4), 77-84.
- Blei, D. M., Ng, A., & Jordan, M. Latent Dirichlet Allocation. International Journal of Advanced Computer Science and Applications (3): 2003. 147-153.
- Anagha R Moosad, Aiswarya V., Subathra P and P.N. Kumar. Browsing Behavioural Analysis Using Topic Modelling. International Joural of Computer Technology and Applications. 2015. No.8. Issue No. :5. Pp. 1853-1861
- Alghamdi R. Alfalqi K. A Survey of Topic Modeling in Text Mining. International Journal of Advanced Computer Science and Applications (IJACSA), 2015. Volume 6 Issue 1.
- Daud Ali, Li Juanzi, Zhou Lizhu, Muhammad Faqir. Knowledge discovery through directed probabilistic topic models: a survey. Proceedings of Frontiers of Computer Science in China. 2010. pp. 280-301
- Lee, S., Song, J., and Kim, Y. An empirical comparison of four text mining methods. Journal of Computer Information Systems, 2010
- Rizun N., Taranenko Y., Waloszek W. The Algorithm of Building the Hierarchical Contextual Framework of Textual Corpora. Eighth IEEE International Conference on Intelligent Computing and Information System, 2017, pp.366-372.
- Rizun N., Kucharska W. Text Mining Algorithms for Extracting Brand Knowledge from Facebook. The Fashion Industry Case. International Business Information Management Conference. 2018.
- Mandelbrot B. On recurrent noise limiting coding. Lab. d’Electronique et de physique appliquces. 1954. Paris.
- Mandelbrot B. In the theory of word frequencies and on related markovian models of discourse. The structure of language and its mathematical aspects. Providence, RI: Amer. Math. Soc. 1961. pp. 190–219. Proceeding Symposium on Applied Mathematics; V. 12.