Modelling an IT solution to anonymise selected data processed in digital documents
Barbara Probierz, Tomasz Jach, Jan Kozak, Radosław Pacud, Tomasz Turek
DOI: http://dx.doi.org/10.15439/2022F49
Citation: Proceedings of the 17th Conference on Computer Science and Intelligence Systems, M. Ganzha, L. Maciaszek, M. Paprzycki, D. Ślęzak (eds). ACSIS, Vol. 30, pages 715–719 (2022)
Abstract. Allowing access to real legal documents is an important element both for the development of science and the judiciary. On the other hand, protecting information about citizens or organizations that appear in these documents is crucial and required by law. Therefore, before the documents are distributed, the data anonymisation process should be carried out. Unfortunately, we are still looking for an effective tool that will automatically anonymise documents in such a way that the main concept of the document is preserved. Especially in the case of documents written in inflectional language. The aim of this article is to show how important (and at the same time how difficult) is the task to identify personal or corporate data of client, as well as other related personal data in documents that are subject to legal protection. We conducted research aimed at assessing the usefulness of IT techniques as well as decision rules and patterns in the anonymisation of legal documents. A set of real legal documents written in Polish was used for the research in which we identified selected types of data that need to be anonymiesed. Eventually, the obtained results were assessed by field experts. Additionally, in order to verify the effectiveness of the proposed solution, we conducted research on a set of 50,000 false identities with names, company names, addresses and other confidential information. The collection was created using Fake Name Generator. The obtained results from both experiments confirmed that the solutions we proposed is accurate even in the case of real legal documents.
References
- P. Štarchoň and T. Pikulík, “Gdpr principles in data protection encourage pseudonymization through most popular and full-personalized devices- mobile phones,” Procedia Computer Science, vol. 151, pp. 303–312, 2019.
- M. Mozes and B. Kleinberg, “No intruder, no validity: Evaluation criteria for privacy-preserving text anonymization,” arXiv preprint https://arxiv.org/abs/2103.09263, 2021.
- P. Regulation, “Regulation (eu) 2016/679 of the european parliament and of the council of 27 april 2016 on the protection of natural persons with regard to the processing of personal data and on the free movement of such data, and repealing directive 95/46/ec (general data protection regulation) (text with eea relevance),” Regulation (eu), vol. 679, p. 2016, 2016.
- I. Glaser, T. Schamberger, and F. Matthes, “Anonymization of german legal court rulings,” in Proceedings of the Eighteenth International Conference on Artificial Intelligence and Law, 2021, pp. 205–209.
- G. M. Csányi, D. Nagy, R. Vági, J. P. Vadász, and T. Orosz, “Challenges and open problems of legal document anonymization,” Symmetry, vol. 13, no. 8, p. 1490, 2021.
- B. Mohit, “Named entity recognition,” in Natural language processing of semitic languages. Springer, 2014, pp. 221–245.
- T. H. Cao, T. M. Tang, and C. K. Chau, “Text clustering with named entities: a model, experimentation and realization,” in Data mining: Foundations and intelligent paradigms. Springer, 2012, pp. 267–287.
- R. Grishman and B. M. Sundheim, “Message understanding conference-6: A brief history,” in COLING 1996 Volume 1: The 16th International Conference on Computational Linguistics, 1996.
- H. Vico and D. Calegari, “Software architecture for document anonymization,” Electronic Notes in Theoretical Computer Science, vol. 314, pp. 83–100, 2015.
- B. M. Sundheim, “Overview of results of the muc-6 evaluation,” in Sixth Message Understanding Conference (MUC-6): Proceedings of a Conference Held in Columbia, Maryland, November 6-8, 1995, 1995.
- D. Nadeau and S. Sekine, “A survey of named entity recognition and classification,” Lingvisticae Investigationes, vol. 30, no. 1, pp. 3–26, 2007.
- O. Kabasakal and A. Mutlu, “Named entity recognition in turkish bank documents,” Kocaeli Journal of Science and Engineering, vol. 4, no. 2, pp. 86–92, 2021.
- F. Graliński, K. Jassem, M. Marcińczuk, and P. Wawrzyniak, “Named entity recognition in machine anonymization,” Recent Advances in Intelligent Information Systems, pp. 247–260, 2009.
- J. Piskorski, “Named-entity recognition for polish with sprout,” in Intelligent Media Technology for Communicative Intelligence. Springer, 2004, pp. 122–133.
- B. Kleinberg and M. Mozes, “Web-based text anonymization with node. js: Introducing netanos (named entity-based text anonymization for open science),” Journal of Open Source Software, vol. 2, no. 14, p. 293, 2017.
- D. Reynders, “Digitalising justice systems to bring out the best in justice,” Eucrim: the European Criminal Law Associations’ fórum, no. 4, pp. 236–237, 2021.
- C. Dozier, R. Kondadadi, M. Light, A. Vachher, S. Veeramachaneni, and R. Wudali, “Named entity recognition and resolution in legal text,” in Semantic Processing of Legal Texts. Springer, 2010, pp. 27–43.
- C. Cardellino, M. Teruel, L. A. Alemany, and S. Villata, “A low-cost, high-coverage legal named entity recognizer, classifier and linker,” in Proceedings of the 16th edition of the International Conference on Articial Intelligence and Law, 2017, pp. 9–18.
- J. R. Finkel, T. Grenager, and C. D. Manning, “Incorporating non-local information into information extraction systems by gibbs sampling,” in Proceedings of the 43rd annual meeting of the association for computational linguistics (ACL’05), 2005, pp. 363–370.
- E. Leitner, G. Rehm, and J. Moreno-Schneider, “Fine-grained named entity recognition in legal documents,” in International Conference on Semantic Systems. Springer, 2019, pp. 272–287.
- H. P. Luhn, “Computer for verifying numbers,” US Patent, vol. 2, no. 950, p. 048, 1960.