Supercombinator Set Construction from a Context-Free Representation of Text

Michal Sičák, Ján Kollár

DOI: http://dx.doi.org/10.15439/2016F334

Citation: Proceedings of the 2016 Federated Conference on Computer Science and Information Systems, M. Ganzha, L. Maciaszek, M. Paprzycki (eds). ACSIS, Vol. 8, pages 503–512 (2016)

Full text

Abstract. Grammars might be used for various other aspects, than just to represent a language. Grammar inference is a large field which main goal is the construction of grammars from various sources. Written text might be analysed indirectly with the use of such inferred grammars. Grammars acquired from processed text might grow into large structures as the inference process could be continuous. We present a method to decompose and store grammars into a non-redundant set of lambda calculus supercombinators. Grammars decomposition is based on their structure and each distinct element is stored only once in such a structure. We present a method that can create such a set from any context-free grammar. To prove this and to show the possible applications in the field of natural language processing we present a case study performed on samples from two books. Those samples are the entire Book of Genesis from The King James Bible and the first 24 chapters of War and peace by Tolstoy. We obtain context-free grammars with the Sequitur algorithm and then we process them with our method. The results show significant decline in the number of grammar elements in all cases.

References

S. Edelman, “On the nature of minds, or: truth and consequences,” Journal of Experimental & Theoretical Artificial Intelligence, vol. 20, no. 3, pp. 181–196, 2008. http://dx.doi.org/10.1080/09528130802319086. [Online]. Available: http://dx.doi.org/10.1080/09528130802319086
N. Chomsky, Syntactic Structures. Mouton and Co., 1957.
E. M. Gold, “Language identification in the limit,” Information and control, vol. 10, no. 5, pp. 447–474, 1967. http://dx.doi.org/10.1016/S0019-9958(67)91165-5.
C. De La Higuera, “A bibliographical study of grammatical inference,” Pattern recognition, vol. 38, no. 9, pp. 1332–1348, 2005. http://dx.doi.org/10.1016/j.patcog.2005.01.003.
L. Onnis, H. R. Waterfall, and S. Edelman, “Learn locally, act globally: Learning language from variation set cues,” Cognition, vol. 109, no. 3, pp. 423–430, 2008. http://dx.doi.org/10.1016/j.cognition.2008.10.004.
C. G. Nevill-Manning and I. H. Witten, “Identifying hierarchical structure in sequences: A linear-time algorithm,” J. Artif. Intell. Res.(JAIR), vol. 7, pp. 67–82, 1997. http://dx.doi.org/doi:10.1613/jair.374.
P. Klint, R. Lämmel, and C. Verhoef, “Toward an engineering discipline for grammarware,” ACM Trans. Softw. Eng. Methodol., vol. 14, no. 3, pp. 331–380, Jul. 2005. http://dx.doi.org/10.1145/1072997.1073000.
J. Kollár, M. Sičák, and M. Spišiak, “Towards machine mind evolution,” in Computer Science and Information Systems (FedCSIS), 2015 Federated Conference on. IEEE, 2015. http://dx.doi.org/10.15439/2015F210 pp. 985–990.
J. Kollár, M. Spišiak, and M. Sičák, “Abstract language of the machine mind,” Acta Electrotechnica et Informatica, vol. 15, no. 3, pp. 24–31, 2015. http://dx.doi.org/10.15546/aeei-2015-0025.
M. Sičák, “Higher order regular expressions,” in Engineering of Modern Electric Systems (EMES), 2015 13th International Conference on. IEEE, 2015. http://dx.doi.org/10.1109/EMES.2015.7158427 pp. 1–4.
P. W. Adriaans and M. Van Zaanen, “Computational grammar induction for linguists,” Grammars, vol. 7, pp. 57–68, 2004.
D. Klein and C. D. Manning, “Corpus-based induction of syntactic structure: Models of dependency and constituency,” in Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics. Association for Computational Linguistics, 2004. http://dx.doi.org/10.3115/1218955.1219016 pp. 478–485.
A. Stevenson and J. R. Cordy, “Grammatical inference in software engineering: an overview of the state of the art,” in Software Language Engineering. Springer, 2013, pp. 204–223. http://dx.doi.org/10.1007/978-3-642-36089-3_12
N. Carvalho, J. J. Almeida, M. J. Pereira, and P. Henriques, “Probabilistic synset based concept location,” in SLATe’12—Symposium on Languages, Applications and Technologies. Alberto Simões and Ricardo Queirós and Daniela da Cruz, 2012. doi: 10.4230/OASIcs.SLATE.2012.239 pp. 239–253.
J.-A. Asensio, N. Padilla, and L. Iribarne, “Information retrieval using an ontological web-trading model,” in Proceedings of the 2013 Federated Conference on Computer Science and Information Systems. IEEE, 2013, pp. pages 243–249.
M. Wang, X. Liu, L. Huang, B. Lang, and H. Yu, “Ontology-based concept similarity integrating image semantic and visual information,” in Proceedings of the 2014 Federated Conference on Computer Science and Information Systems, ser. Annals of Computer Science and Information Systems, vol. 2. IEEE, 2014. http://dx.doi.org/10.15439/2014F273 pp. pages 289–296.
P. Šaloun, “From lightweight ontology to mental illness indication,” in Scientific Conference on Informatics, 2015 IEEE 13th International. IEEE, 2015. http://dx.doi.org/10.1109/Informatics.2015.7377799 pp. 9–12.
P. Szwed, “Concepts extraction from unstructured polish texts: a rule based approach,” in Proceedings of the 2015 Federated Conference on Computer Science and Information Systems, ser. Annals of Computer Science and Information Systems, vol. 5. IEEE, 2015. http://dx.doi.org/10.15439/2015F280 pp. 355–364.
K. Jassem and L. Pawluczuk, “Automatic summarization of polish news articles by sentence selection,” in Proceedings of the 2015 Federated Conference on Computer Science and Information Systems, ser. Annals of Computer Science and Information Systems, vol. 5. IEEE, 2015. http://dx.doi.org/10.15439/2015F186 pp. 337–341.