Grammatical Case Based IS-A Relation Extraction with Boosting for Polish

Paweł Łoziński, Dariusz Czerski, Mieczysław Kłopotek

DOI: http://dx.doi.org/10.15439/2016F391

Citation: Proceedings of the 2016 Federated Conference on Computer Science and Information Systems, M. Ganzha, L. Maciaszek, M. Paprzycki (eds). ACSIS, Vol. 8, pages 533–540 (2016)

Full text

Abstract. Pattern-based methods of IS-A relation extraction rely heavily on so called Hearst patterns. These are ways of expressing instance enumerations of a class in natural language. While these lexico-syntactic patterns prove quite useful, they may not capture all taxonomical relations expressed in text. Therefore in this paper we describe a novel method of IS-A relation extraction from patterns, which uses morpho-syntactical annotations along with grammatical case of noun phrases that constitute entities participating in IS-A relation. We also describe a method for increasing the number of extracted relations that we call \emph{pseudo-subclass boosting} which has potential application in any pattern-based relation extraction method. Experiments were conducted on a corpus of about 0.5 billion web documents in Polish language.

References

H. Poon and P. Domingos, “Unsupervised ontology induction from text,” in Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 2010, pp. 296–305.
A. Fader, S. Soderland, and O. Etzioni, “Identifying relations for open information extraction,” in Proceedings of the Conference on Empirical Methods in Natural Language Processing, ser. EMNLP ’11. Stroudsburg, PA, USA: Association for Computational Linguistics, 2011, pp. 1535–1545.
M. Banko, M. J. Cafarella, S. Soderland, M. Broadhead, and O. Etzioni, “Open information extraction from the Web,” in Proceedings of the 20th International Joint Conference on Artifical Intelligence, ser. IJCAI’07. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 2007, pp. 2670–2676.
O. Etzioni, A. Fader, J. Christensen, S. Soderland, and M. Mausam, “Open information extraction: The second generation,” in Proceedings of the Twenty-Second International Joint Conference on Artificial Intelligence - Volume Volume One, ser. IJCAI’11. AAAI Press, 2011, pp. 3–10.
E. Barbu, “Property type distribution in wordnet, corpora and wikipedia,” Expert Systems with Applications, vol. 42, no. 7, 2015, pp. 3501 – 3507.
W. Wu, H. Li, H. Wang, and K. Zhu, “Probase: A probabilistic taxonomy for text understanding,” in ACM International Conference on Management of Data (SIGMOD), May 2012.
T. Fountain and M. Lapata, “Taxonomy induction using hierarchical random graphs,” in Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, 2012, pp. 466–476.
P. Cimiano, A. Hotho, and S. Staab, “Learning concept hierarchies from text corpora using formal concept analysis.” J. Artif. Intell. Res.(JAIR), vol. 24, 2005, pp. 305–339.
P. Szwed, “Concepts extraction from unstructured Polish texts: A rule based approach,” in Computer Science and Information Systems (FedCSIS), 2015 Federated Conference on, Sept 2015, pp. 355–364.
M. A. Hearst, “Automatic acquisition of hyponyms from large text corpora,” in Proceedings of the 14th Conference on Computational Linguistics - Volume 2, ser. COLING ’92. Stroudsburg, PA, USA: Association for Computational Linguistics, 1992, pp. 539–545.
Z. Kozareva, “Simple, Fast and Accurate Taxonomy Learning,” in Text Mining. Springer International Publishing, 2014, pp. 41–62.
J. Dean and S. Ghemawat, “Mapreduce: simplified data processing on large clusters,” Commun. ACM, vol. 51, no. 1, Jan. 2008, pp. 107–113. [Online]. Available: http://doi.acm.org/10.1145/1327452.1327492
F. Xu, D. Kurz, J. Piskorski, and S. Schmeier, “Term extraction and mining of term relations from unrestricted texts in the financial do- main,” in Proceedings of the 5th International Conference on Business Information Systems, Poznan, Poland, 2002.
J. Nivre, J. Hall, J. Nilsson, A. Chanev, G. Eryigit, S. Kübler, S. Marinov, and E. Marsi, “Maltparser: A language-independent system for data-driven dependency parsing,” Natural Language Engineering, vol. 13, no. 02, 2007, pp. 95–135.
R. McDonald and F. Pereira, “Online learning of approximate depen- dency parsing algorithms,” in In Proc. of EACL, 2006, pp. 81–88.
R. McDonald and G. Satta, “On the complexity of non-projective data- driven dependency parsing,” in Proceedings of the 10th International Conference on Parsing Technologies, ser. IWPT ’07. Stroudsburg, PA, USA: Association for Computational Linguistics, 2007, pp. 121–132.
M. Kuhlmann and J. Nivre, “Transition-based techniques for non- projective dependency parsing,” Northern European Journal of Lan- guage Technology, vol. 2, no. 1, 2010, pp. 1–19.
A. Nagórko, Zarys gramatyki polskiej. Warszawa: Wydawnictwo Naukowe PWN, 2007.
R. Snow, D. Jurafsky, and A. Y. Ng, “Learning syntactic patterns for automatic hypernym discovery,” in Advances in Neural Information Processing Systems (NIPS 2004), November 2004.
A. Przepiórkowski, M. Bańko, R. L. Górski, and B. Lewandowska- Tomaszczyk, Eds., Narodowy Korpus Języka Polskiego. Warszawa: Wydawnictwo Naukowe PWN, 2012.
M. Woliński, M. Miłkowski, M. Ogrodniczuk, and A. Przepiórkowski, “Polimorf: a (not so) new open morphological dictionary for Polish,” in Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC’12), N. Calzolari (Conference Chair), K. Choukri, T. Declerck, M. U. Doğan, B. Maegaard, J. Mariani, A. Moreno, J. Odijk, and S. Piperidis, Eds. Istanbul, Turkey: European Language Resources Association (ELRA), may 2012.
A. Przepiórkowski, The IPI PAN Corpus: Preliminary version. Warsaw: Institute of Computer Science, Polish Academy of Sciences, 2004.
J. Nivre, J. Hall, J. Nilsson, A. Chanev, G. Eryigit, S. Kübler, S. Marinov, and E. Marsi, “Maltparser: A language-independent system for data-driven dependency parsing,” Natural Language Engineering, vol. 13, 6 2007, pp. 95–135.
A. Wróblewska, “Polish Dependency Bank,” Linguistic Issues in Lan- guage Technology, vol. 7, no. 2, 2012.
S. Buchholz and E. Marsi, “Conll-x shared task on multilingual depen- dency parsing,” in Proceedings of the Tenth Conference on Computa- tional Natural Language Learning, ser. CoNLL-X ’06. Stroudsburg, PA, USA: Association for Computational Linguistics, 2006, pp. 149– 164.
Z. Saloni and M. Świdziński, Składnia współczesnego języka polskiego. Warszawa: Wydawnictwo Naukowe PWN, 2011.
J. D. Lafferty, A. McCallum, and F. C. N. Pereira, “Conditional random fields: Probabilistic models for segmenting and labeling sequence data,” in Proceedings of the Eighteenth International Conference on Machine Learning, ser. ICML ’01. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 2001, pp. 282–289.
C. D. Manning and H. Schütze, Foundations of Statistical Natural Language Processing. Cambridge, MA, USA: MIT Press, 1999.
A. Clark, C. Fox, and S. Lappin, The Handbook of Computational Linguistics and Natural Language Processing. Wiley-Blackwell, 2010.
M. Ogrodniczuk, A. Wójcicka, K. Głowińska, and M. Kopeć, “De- tection of nested mentions for coreference resolution in Polish,” in Advances in Natural Language Processing: Proceedings of the 9th International Conference on NLP, PolTAL 2014, Warsaw, Poland, September 17–19, 2014, ser. Lecture Notes in Artificial Intelligence, A. Przepiórkowski and M. Ogrodniczuk, Eds. Heidelberg: Springer International Publishing, 2014, vol. 8686, pp. 270–277.
P.-M. Ryu and K.-S. Choi, “Automatic acquisition of ranked is-a relation from unstructured text,” 2007.
D. Ravichandran, P. Pantel, and E. Hovy, “The Terascale Challenge,” in Proceedings of KDD Workshop on Mining for and from the Semantic Web (MSW-04), 2004, pp. 1–11.