A survey on text similarity measure

(1) Didik Dwi Prasetya Mail (Universitas Negeri Malang, Indonesia)
(2) * Aji Prasetya Wibawa Mail (Universitas Negeri Malang, Indonesia)
(3) Tsukasa Hirashima Mail (Graduate School of Engineering, Hiroshima University, Japan)
*corresponding author


Measurement of text similarity is a very important activity to determining the degree of similarity between objects. Finding of similarities between words, sentences, and documents are part of the essence of text similarity. Words can be said similar in two ways, lexically and semantically. There have been many studies of text similarity and resulting in various approaches and algorithms. This paper will summarize the measurements of text similarity categorized into four major groups: String-based, Corpus-based, Knowledge-based, and Hybrid similarities. To complete this study, we also conducted a small investigation to evaluate text similarity using common algorithms that represent categories of text similarity




Article metrics

Abstract views : 183





Yunianta A, Barukab OM, Yusof N, Dengen N, Haviluddin H, Othman MS. Semantic data mapping technology to solve semantic data problem on heterogeneity aspect. International Journal of Advances in Intelligent Informatics. 2017 Dec 1;3(3):161-72.

Gomaa WH, Fahmy AA. A survey of text similarity approaches. International Journal of Computer Applications. 2013 Jan 1;68(13).

Hidayat EY, Firdausillah F, Hastuti K, Dewi IN, Azhari A. Automatic text summarization using latent drichlet allocation (LDA) for document clustering. International Journal of Advances in Intelligent Informatics. 2015 Dec 1;1(3):132-9.

Camacho H, Salhi A. One—to ‘one Greedy Matching Algorithm.

Yu M, Li G, Deng D, Feng J. String similarity search and join: a survey. Frontiers of Computer Science. 2016 Jun 1;10(3):399-417.

Wang J, Li G, Fe J. Fast-join: An efficient method for fuzzy token matching based string similarity join. InData Engineering (ICDE), 2011 IEEE 27th International Conference on 2011 Apr 11 (pp. 458-469). IEEE.

Hamming RW. Error detecting and error correcting codes. Bell Labs Technical Journal. 1950 Apr 1;29(2):147-60.

Levenshtein VI. Binary codes capable of correcting spurious insertions and deletions of ones. Problems of information Transmission. 1965;1(1):8-17.

Damerau FJ. A technique for computer detection and correction of spelling errors. Communications of the ACM. 1964 Mar 1;7(3):171-6.

Needleman SB, Wunsch CD. A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of molecular biology. 1970 Mar 28;48(3):443-53.

Wagner RA, Fischer MJ. The string-to-string correction problem. Journal of the ACM (JACM). 1974 Jan 1;21(1):168-73.

Smith T, Waterman M. ªIdentification of Common Molecular Subsequences. º J. Molecular Biology. 1981;147:195-7.

Jaro MA. Advances in record-linkage methodology as applied to matching the 1985 census of Tampa, Florida. Journal of the American Statistical Association. 1989 Jun 1;84(406):414-20.

Winkler WE. String Comparator Metrics and Enhanced Decision Rules in the Fellegi-Sunter Model of Record Linkage.

Broder AZ, Glassman SC, Manasse MS, Zweig G. Syntactic clustering of the web. Computer Networks and ISDN Systems. 1997 Sep 1;29(8-13):1157-66.

Kondrak G. N-gram similarity and distance. InInternational symposium on string processing and information retrieval 2005 Nov 2 (pp. 115-126). Springer, Berlin, Heidelberg.

Mahdi AM, Tiun S. Utilizing wordnet for instance-based schema matching. InProceedings of the International Conference on Advances in Computer Science and Electronics Engineering (CSEE 2014) (pp. 59-63).

Gravano L, Ipeirotis PG, Jagadish HV, Koudas N, Muthukrishnan S, Srivastava D. Approximate string joins in a database (almost) for free. InVLDB 2001 Sep 11 (Vol. 1, pp. 491-500).

Bilenko MY. Learnable similarity functions and their application to record linkage and clustering (Doctoral dissertation). 2006

Jaccard P. Étude comparative de la distribution florale dans une portion des Alpes et des Jura. Bull Soc Vaudoise Sci Nat. 1901;37:547-79.

Dice LR. Measures of the amount of ecologic association between species. Ecology. 1945 Jul 1;26(3):297-302.

Bhattacharya A. On a measure of divergence of two multinomial populations. Sankhya. v7.:401-6.

Krause EF. Taxicab geometry: An adventure in non-Euclidean geometry. Courier Corporation; 1975.

Friedman, J. H. On bias, variance, 0/1-loss, and the curse-of-dimensionality. Data Mining and Knowledge Discovery, 1(1), 55–77. 1997

Monge, A., and Elkan, C. The field-matching problem: algorithm and applications. In Proceedings of the Second Inter- national Conference on Knowledge Discovery and Data Mining. 1996.

Cohen, W. W., Ravikumar, P., & Fienberg, S. E. (2003). A comparison of string distance metrics for name- matching tasks. In Proceedings of the IJCAI-2003 Workshop on Information Integration on the Web, pp. 73–78, Acapulco, Mexico.

Wang J, Li G, Feng J. Fast-join: an efficient method for fuzzy token matching based string similarity join. In: Proceedings of the 27th IEEE International Conference on Data Engineering. 2011, 458–469

Kulkarni A, More C, Kulkarni M, Bhandekar V. Text Analytic Tools for Semantic Similarity. Imperial Journal of Interdisciplinary Research. 2016 Apr 1;2(5).

Lund, K., Burgess, C. & Atchley, R. A. Semantic and associative priming in a high-dimensional semantic space. Cognitive Science Proceedings (LEA), 660-665. 1995

Landauer, T.K. & Dumais, S.T. A solution to plato’s problem: The latent semantic analysis theory of acquisition, induction, and representation of knowledge", Psychological Review, 104. 1997

Manning CD, Schütze H. Foundations of statistical natural language processing. MIT press; 1999.

Turney, P. Mining the web for synonyms: PMI- IR versus LSA on TOEFL. In Proceedings of the Twelfth European Conference on Machine Learning (ECML). 2001

Gabrilovich E. & Markovitch, S. Computing Semantic Relatedness using Wikipedia-based Explicit Semantic Analysis, Proceedings of the 20th International Joint Conference on Artificial Intelligence, pages 6–12. 2007

Cilibrasi, R.L. & Vitanyi, P.M.B. The Google Similarity Distance, IEEE Trans. Knowledge and Data Engineering, 19:3, 370-383. 2007

Kolb, P. Disco: A multilingual database of distributionally similar words. In: Proceedings of KONVENS-2008, Berlin. 2008

Mihalcea R, Corley C, Strapparava C. Corpus-based and knowledge-based measures of text semantic similarity. InAAAI 2006 Jul 16 (Vol. 6, pp. 775-780).

Budanitsky A, Hirst G. Evaluating WordNet-based measures of lexical semantic relatedness. Comput Linguist 2005;32:13–47.

Slimani T. Description and evaluation of semantic similarity measures approaches. arXiv preprint arXiv:1310.8059. 2013 Oct 30.

Diaz. 2017. HESML: A scalable ontology-based semantic similarity measures library with a set of reproducible experiments and a replication dataset

Meng L, Huang R, Gu J. A review of semantic similarity measures in wordnet. International Journal of Hybrid Information Technology. 2013 Jan;6(1):1-2.

Rada R, Mili H, Bicknell E, Blettner M. Development and application of a metric on semantic nets. IEEE transactions on systems, man, and cybernetics. 1989 Jan;19(1):17-30.

Lastra-Díaz, J. J., and García-Serrano, A. A refinement of the well-founded Information Content models with a very detailed experimental survey on WordNet. Technical Report TR-2016-01. NLP and IR Research Group. ETSI Informática. Universidad Nacional de Educación a Distancia. 2016

Varelas G, Voutsakis E, Raftopoulou P, Petrakis EG, Milios EE. Semantic similarity methods in wordNet and their application to information retrieval on the web. InProceedings of the 7th annual ACM international workshop on Web information and data management 2005 Nov 4 (pp. 10-16). ACM.

Tversky A. Features of similarity. Psychological review. 1977 Jul;84(4):327.

SáNchez D, Batet M. Semantic similarity estimation in the biomedical domain: An ontology-based information-theoretic perspective. Journal of biomedical informatics. 2011 Oct 1;44(5):749-59.Stanchev, Creating a similarity graph from WordNet, in: Proceedings of the 4th International Conference on Web Intelligence, Mining and Semantics (WIMS’14). Article No. 36, ACM, 2014, doi:10.1145/2611040.2611055.

Stanchev L. Creating a similarity graph from WordNet. InProceedings of the 4th International Conference on Web Intelligence, Mining and Semantics (WIMS14) 2014 Jun 2 (p. 36). ACM.

Martinez-Gil, CoTO: a novel approach for fuzzy aggregation of semantic similarity measures, Cognit. Syst. Res. 40 (2016) 8–17, doi:10.1016/j.cogsys. 2016.01.001.

Lin. Sherlock: A Semi-automatic Framework for Quiz Generation Using a Hybrid Semantic Similarity Measure. 2015

Hasan. A semantic enhanced hybrid recommendation approach: A case study of e-Government tourism service recommendation system. 2015

Atoum. Efficient Hybrid Semantic Text Similarity using Wordnet and a Corpus. 2016

Batet M, Sánchez D. Improving semantic relatedness assessments: ontologies meet textual corpora. Procedia Computer Science. 2016 Jan 1;96:365-74.

Copyright (c) 2018

Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

International Journal of Advances in Intelligent Informatics
ISSN 2442-6571  (print) | 2548-3161 (online)
Organized by Informatics Department - Universitas Ahmad Dahlan , and UTM Big Data Centre - Universiti Teknologi Malaysia
Published by Universitas Ahmad Dahlan
W : http://ijain.org
E : info@ijain.org, andri.pranolo@tif.uad.ac.id (paper handling issues)
     ijain@uad.ac.id, andri.pranolo.id@ieee.org (publication issues)

View IJAIN Stats

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0