Modelling an IT solution to anonymise selected data processed in digital documents

Abstract. Allowing access to real legal documents is an important element both for the development of science and the judiciary. On the other hand, protecting information about citizens or organizations that appear in these documents is crucial and required by law. Therefore, before the documents are distributed, the data anonymisation process should be carried out. Unfortunately, we are still looking for an effective tool that will automatically anonymise documents in such a way that the main concept of the document is preserved. Especially in the case of documents written in inflectional language. The aim of this article is to show how important (and at the same time how difficult) is the task to identify personal or corporate data of client, as well as other related personal data in documents that are subject to legal protection. We conducted research aimed at assessing the usefulness of IT techniques as well as decision rules and patterns in the anonymisation of legal documents. A set of real legal documents written in Polish was used for the research in which we identified selected types of data that need to be anonymiesed. Eventually, the obtained results were assessed by field experts. Additionally, in order to verify the effectiveness of the proposed solution, we conducted research on a set of 50,000 false identities with names, company names, addresses and other confidential information. The collection was created using Fake Name Generator. The obtained results from both experiments confirmed that the solutions we proposed is accurate even in the case of real legal documents.


