Improving Logical Structure Analysis of Visually Structured Documents with Textual Features

Huu-Loi Le; Nghia Luu Trong; Huyen Ngo Thanh

Improving Logical Structure Analysis of Visually Structured Documents with Textual Features

Huu-Loi Le, Nghia Luu Trong, Huyen Ngo Thanh

DOI: http://dx.doi.org/10.15439/2022R26

Citation: Proceedings of the 2022 Seventh International Conference on Research in Intelligent and Computing in Engineering, Vu Dinh Khoa, Shivani Agarwal, Gloria Jeanette Rincon Aponte, Nguyen Thi Hong Nga, Vijender Kumar Solanki, Ewa Ziemba (eds). ACSIS, Vol. 33, pages 151–156 (2022)

Full text

Abstract. This paper introduces a new model to improve the quality of logical structure analysis of visually structured documents. To do that, we extend the model of Koreeda and Manning. In order to enhance textual features, we define a new feature that uses the font size of texts as an indicator. As our observation, the font size is an important indicator that can be used to represent the structure of a document. The new font size feature is combined with visual, textual, and semantic features for training an analyzer. Experimental results on four legal datasets show that the new font size feature contributes to the model and helps to improve the F-scores. The ablation study also shows the contribution of each feature in our model.

References

Y. Koreeda and C. Manning, “Capturing logical structure of visually structured documents with multimodal transition parser,” in Proceedings of the Natural Legal Language Processing Workshop 2021. Punta Cana, Dominican Republic: Association for Computational Linguistics, Nov. 2021, pp. 144–154. [Online]. Available: https://aclanthology.org/2021.nllp-1.15
V. W. Frederik Obermaier, Bastian Obermayer and W. Jaschensky, “About the panama papers,” in Süddeutsche Zeitung, 2016.
M.-T. Nguyen, D. T. Le, and L. Le, “Transformers-based information extraction with limited data for domain-specific business documents,” Engineering Applications of Artificial Intelligence, vol. 97, p. 104100, 2021.
Y. Hatsutori, K. Yoshikawa, and H. Imai, “Estimating legal document structure by considering style information and table of contents,” in New Frontiers in Artificial Intelligence, S. Kurahashi, Y. Ohta, S. Arai, K. Satoh, and D. Bekki, Eds. Cham: Springer International Publishing, 2017, pp. 270–283.
C. G. Stahl, S. R. Young, D. Herrmannova, R. M. Patton, and J. C. Wells, “Deeppdf: A deep learning approach to extracting text from pdfs,” Oak Ridge National Lab.(ORNL), Oak Ridge, TN (United States), Tech. Rep., 2018.
C. Soto and S. Yoo, “Visual detection with context for document layout analysis,” in Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 2019, pp. 3464–3470.
Y. Xu, M. Li, L. Cui, S. Huang, F. Wei, and M. Zhou, “Layoutlm: Pre-training of text and layout for document image understanding,” in Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2020, pp. 1192–1200.
Y. Xu, Y. Xu, T. Lv, L. Cui, F. Wei, G. Wang, Y. Lu, D. A. F. Florêncio, C. Zhang, W. Che, M. Zhang, and L. Zhou, “Layoutlmv2: Multi-modal pre-training for visually-rich document understanding,” CoRR, vol. abs/2012.14740, 2020. [Online]. Available: https://arxiv.org/abs/2012.14740
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.
C. Sporleder and M. Lapata, “Automatic paragraph identification: A study across languages and domains,” in Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing, 2004, pp. 72–79.
C. Abreu, H. Cardoso, and E. Oliveira, “FinDSE@FinTOC-2019 shared task,” in Proceedings of the Second Financial Narrative Processing Workshop (FNP 2019). Turku, Finland: Linköping University Electronic Press, Sep. 2019, pp. 69–73. [Online]. Available: https://aclanthology.org/W19-6410
D. Ferrés, H. Saggion, F. Ronzano, and À. Bravo, “Pdfdigest: an adaptable layout-aware pdf-to-xml textual content extractor for scientific articles,” in Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), 2018.
M. Ostendorf, M. Collins, S. Narayanan, D. W. Oard, and L. Vanderwende, “Proceedings of human language technologies: The 2009 annual conference of the north american chapter of the association for computational linguistics,” in Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, 2009.
S. Zhang, X. Ma, K. Duh, and B. V. Durme, “AMR parsing as sequence-to-graph transduction,” CoRR, vol. abs/1905.08704, 2019. [Online]. Available: http://arxiv.org/abs/1905.08704