Towards increasing F-measure of approximate string matching in O(1) complexity

Adrian Boguszewski, Julian Szymański, Karol Draszawka

DOI: http://dx.doi.org/10.15439/2016F311

Citation: Proceedings of the 2016 Federated Conference on Computer Science and Information Systems, M. Ganzha, L. Maciaszek, M. Paprzycki (eds). ACSIS, Vol. 8, pages 527–532 (2016)

Full text

Abstract. The paper analyzes existing approaches for approx- imate string matching based on linear search with Levenshtein distance, AllScan and CPMerge algorithms using cosine, Jaccard and Dice distance measures. The methods are presented and compared to our approach that improves indexing time using Locally Sensitive Hashing. Advantages and drawbacks of the methods are identified based on theoretical considerations as well as empirical evaluations on real-life dictionaries.

References

W. H. Gomaa and A. A. Fahmy, “A survey of text similarity approaches,” International Journal of Computer Applications, vol. 68, no. 13, 2013.
A. Parker et al., “Computer algorithms for plagiarism detection,” 1989.
Z. Su, B.-R. Ahn, K.-Y. Eom, M.-K. Kang, J.-P. Kim, and M.-K. Kim, “Plagiarism detection using the levenshtein distance and smith-waterman algorithm,” in Innovative Computing Information and Control, 2008. ICICIC’08. 3rd International Conference on. IEEE, 2008, pp. 569–569.
F. D. Garcia, J.-H. Hoepman, and J. Van Nieuwenhuizen, “Spam filter analysis,” in Security and Protection in Information Processing Systems. Springer, 2004, pp. 395–410.
T. Okuda, E. Tanaka, and T. Kasai, “A method for the correction of garbled words based on the levenshtein metric,” Computers, IEEE Transactions on, vol. 100, no. 2, pp. 172–178, 1976.
N. Okazaki and J. Tsujii, “Simple and efficient algorithm for approximate dictionary matching,” Proceedings of the 23rd International Conference on Computational Linguistics, pp. 851–859, 2010.
V. I. Levenshtein, Binary codes capable of correcting deletions, insertions, and reversals. Soviet Physics Doklady, 1966.
F. J. Damerau, “A technique for computer detection and correction of spelling errors,” Communications of the ACM, vol. 7, no. 3, pp. 171–176, 1964.
G. V. Bard, “Spelling-error tolerant, order-independent pass-phrases via the damerau-levenshtein string-edit distance metric,” in Proceedings of the fifth Australasian symposium on ACSW frontiers-Volume 68. Australian Computer Society, Inc., 2007, pp. 117–124.
G. Navarro, “A guided tour to approximate string matching,” ACM computing surveys (CSUR), vol. 33, no. 1, pp. 31–88, 2001.
K. Monostori, R. Finkel, A. Zaslavsky, G. Hodász, and M. Pataki, “Comparison of overlap detection techniques,” in Computational Science—ICCS 2002. Springer, 2002, pp. 51–60.
C. Badica, N. T. Nguyen, and M. Brezovan, Eds., Computational Collective Intelligence. Technologies and Applications - 5th International Conference, ICCCI 2013, Craiova, Romania, September 11-13, 2013, Proceedings, ser. Lecture Notes in Computer Science, vol. 8083. Springer, 2013.
S.-S. Choi, S.-H. Cha, and C. C. Tappert, “A survey of binary similarity and distance measures,” Journal of Systemics, Cybernetics and Informatics, vol. 8, no. 1, pp. 43–48, 2010.
B. Kaliski and M. Robshaw, “Message authentication with md5,” CryptoBytes (RSA Labs Technical Newsletter), vol. 1, no. 1, 1995.
A. Gionis, P. Indyk, R. Motwani et al., “Similarity search in high dimensions via hashing,” in VLDB, vol. 99, no. 6, 1999, pp. 518–529.
M. Slaney and M. Casey, “Locality-sensitive hashing for finding nearest neighbors [lecture notes],” Signal Processing Magazine, IEEE, vol. 25, no. 2, pp. 128–131, 2008.
L. Paulevé, H. Jégou, and L. Amsaleg, “Locality sensitive hashing: A comparison of hash function types and querying mechanisms,” Pattern Recognition Letters, vol. 31, no. 11, pp. 1348–1358, 2010.
J. Leskovec, A. Rajaraman, and J. D. Ullman, Mining of massive datasets. Cambridge University Press, 2014.
SJP, http://sjp.pl/slownik/growy/, 2016, [Online; 01-07-2016].
Wikipedia, https://en.wikipedia.org/wiki/Wikipedia:Lists_of_common_misspellings, 2016, [Online: 01-07-2016].