Logo PTI
Polish Information Processing Society
Logo FedCSIS

Annals of Computer Science and Information Systems, Volume 15

Proceedings of the 2018 Federated Conference on Computer Science and Information Systems

Automatic Extraction of Synonymous Collocation Pairs from a Text Corpus

, , , ,

DOI: http://dx.doi.org/10.15439/2018F186

Citation: Proceedings of the 2018 Federated Conference on Computer Science and Information Systems, M. Ganzha, L. Maciaszek, M. Paprzycki (eds). ACSIS, Vol. 15, pages 485488 ()

Full text

Abstract. Automatic extraction of synonymous collocation pairs from text corpora is a challenging task of NLP. In order to search collocations of similar meaning in English texts, we use logical-algebraic equations. These equations combine grammatical and semantic characteristics of words of substantive, attributive and verbal collocations types. With Stanford POS tagger and Stanford Universal Dependencies parser, we identify the grammatical characteristics of words. We exploit WordNet synsets to pick synonymous words of collocations. The potential synonymous word combinations found are checked for compliance with grammatical and semantic characteristics of the proposed logical-linguistic equations. Our dataset includes more than half a million Wikipedia articles from a few portals. The experiment shows that the more frequent synonymous collocations occur in texts, the more related topics of the texts might be. The precision of synonymous collocations search in our experiment has achieved the results close to other studies like ours.


  1. C. De Boom, S. V.Canneyt, S. Bohez, T. Demeester, B. Dhoedt, “Learning Semantic Similarity for Very Short Texts,” Pattern Recognition Letters, vol. 80, 2016, pp. 150–156. http://dx.doi.org/10.1109/ICDMW.2015.86
  2. J. Ganitkevitch, B. V. Durme, C. Callison-Burch, “PPDB: The paraphrase database,” in Proc. of the 2013 Conf. of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2013, pp. 758–764.
  3. H. Wu, M. Zhou, “Synonymous Collocation Extraction Using Translation Information,” in Proc. of the 41st Annu. Meeting on Association for Computational Linguistics, Stroudsburg, PA, USA, vol. 1, 2003, pp. 120–127. http://dx.doi.org/10.3115/1075096.1075112
  4. M. Pasca, P. Dienes, “Aligning Needles in a Haystack: Paraphrase Acquisition Across the Web,” in Proc. of the Second Int. Joint Conf.: Natural Language Processing, Korea, 2005, pp. 119–130. http://dx.doi.org/10.1007/11562214_11
  5. R. Barzilay, Kathleen R. McKeown, “Extracting Paraphrases from a Parallel Corpus,” in Proc. of the 39th Annu.Meeting on Association for Computational Linguistics, Stroudsburg, PA, USA, 2001, pp. 50–57. http://dx.doi.org/10.3115/1073012.1073020
  6. J. Ganitkevitch, C. Callison-Burch, C. Napoles, B. V. Durme, “Learning Sentential Paraphrases from Bilingual Parallel Corpora for Text-to-Text Generation,” in Proc. of the Conf. on Empirical Methods in Natural Language Processing, 2011, pp. 1168–1179.
  7. B. Dolan, C. Quirk, C. Brockett, “Unsupervised construction of large paraphrase corpora: exploiting massively parallel news sources,” in Proc. of the 20th Int. Conf. on Computational Linguistics, Geneva, Switzerland, 2004. http://dx.doi.org/10.3115/1220355.1220406
  8. L. Han, A. Kashyap, T. Finin, J. Mayfield, J. Weese, "UMBC EBIQUITY-CORE: Semantic Textual Similarity Systems”, in Proc. of the Second Joint Conf. on Lexical and Computational Semantics, vol. 1, 2013, pp. 44–52.
  9. T. Kenter, M. de Rijke, “Short Text Similarity with Word Embeddings, ”in Proc. of the 24th ACM Int. Conf. on Information and Knowledge Management, 2015, pp. 1411–1420. http://dx.doi.org/10.1145/2806416.2806475
  10. W. Lewoniewski, K. Węcel, W. Abramowicz, “Relative Quality and Popularity Evaluation of Multilingual Wikipedia Articles,” Informatics, 2017. http://dx.doi.org/10.3390/informatics4040043
  11. S. Petrasova, N. Khairova, “Automatic Identification of Collocation Similarity,” in Proc. of 10th Inter. Scientific and Technical Conf.: Computer Science & Information Technologies, Lviv, 2015, pp. 136–138. http://dx.doi.org/10.1109/STC-CSIT.2015.7325451
  12. S. Petrasova, N. Khairova, “Using a Technology for Identification of Semantically Connected Text Elements to Determine a Common Information Space,” Cybernetics and Systems Analysis, Springer, vol. 53 (1), 2017, pp. 115–124. http://dx.doi.org/10.1007/s10559-017-9912-z
  13. Joakim Nivre Marie-Catherine de Marneffe, Filip Ginter, Yoav Goldberg, Jan Hajic, Christopher D Manning, et al., “Universal Dependencies v1: A Multilingual Treebank Collection,” in Proc. of the Tenth Int. Conf. on Language Resources and Evaluation, Paris, France, 2016
  14. T. McEnery, A. Hardie, “Corpus Linguistics: Method, Theory and Practice,” Cambridge University Press, 2012.
  15. Lewoniewski W. “Enrichment of Information in Multilingual Wikipedia Based on Quality Analysis,” Lecture Notes in Business Information Processing, vol 303. Springer, Cham, 2017, pp 216–227. http://dx.doi.org/10.1007/978-3-319-69023-0_19