Artificial Intelligence applications in anomaly identification detection of big database
Phan Huy Thang, Nguyen Thi Ngoc Anh
DOI: http://dx.doi.org/10.15439/2021R11
Citation: Proceedings of the 2021 Sixth International Conference on Research in Intelligent and Computing, Vijender Kumar Solanki, Nguyen Ho Quang (eds). ACSIS, Vol. 27, pages 87–92 (2021)
Abstract. Data matching is the process of finding, matching, and combining records from many databases or even within one database that belong to the same entities. All parts of the data matching process have been improved during the previous decade as a result of research in various disciplines such as applied statistics, data mining, machine learning, database administration, and digital libraries.Indeed, with the significant advance in artificial intelligence over the past decade, all aspects of the data identification process, especially on how to improve the accuracy of data matching. Firstly, this paper presents the process of comparing data, detailing the steps to perform pre-processing data, comparing the data fields of each record, classification, and quality assessment. Secondly, the paper introduces a method to expand the problem of identifying duplicate objects with big data. Third, the paper also provides specific aspects of unstructured data matching times. Moreover, the methodology of solving big data matching problems by machine learning is proposed. Finally, the proposed method is applied to the problem of database cleanup and identification of identifier abnormalities at the national credit centre CIC with correct results from 96\% to 98\%. The achieved results are not only theoretical but also practical in business operations at CIC.
References
- Ahmed K. Elmagarmid, Panagiotis G. Ipeirotis, Vassilios S. Verykios, Duplicate record detection: A Survey, In: IEEE Transactions on knowledge and data engineering 2007, Vol.19.
- G. Ranganathan, V.Bindhu,. Jenifer Raj, Duplicate record detection using intelligent approaches, In: International Journal of Pure and Applied Mathematics 2018, Vol.119, No.12, pp.13077–13087.
- Peter Christen, Data Matching: Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection; Springer (2012).
- Batini, C., Scannapieco, and M.: Data quality: Concepts, methodologies and techniques. Data-Centric Systems and Applications. Springer (2006)
- Arasu, A., Götz, M., Kaushik el at: On active learning of record matching packages.In: ACM SIGMOD, pp.783—794. Indianapolis (2010).
- Alvarez, R., Jonas, J., Winkler, W., Wright, R .: Interstate voter registration database matching: the Oregon-Washington 2008 pilot project. In: Workshop on Trustworthy Elections, pp.17—17. USENIX Association (2009).
- Roya Hassanian-esfahani, Mohammad-javad Kargar , Sectional MinHash for near-duplicate detection, In: Expert Systems with Applications, Volume 99, 1 June 2018, pp.203–212.
- Arfa Skandar, Mariam Rehman,Maria Anjum, An Efficient Duplication Record Detection Algorithm for Data Cleansing, In: International Journal of Computer Applications, Volume 127, October 2015, pp.28-37.
- Djulaga Hadzic and Nermin Sarajlic, Methodology for fuzzy duplicate record identification based on the semantic-syntactic information of similarity, In Journal of King Saud University - Computer and Information Sciences, Volume 32, 2020, pp.126-136.
- Toan Nguyen Mau and Van-Nam Huynh, An LSH-based k-representatives clustering method for large categorical data, Neurocomputing, volume 463, pages 29-44, year 2021,