The performance of text similarity algorithms

Didik Dwi Prasetya; Aji Prasetya Wibawa; Tsukasa Hirashima

doi:10.26555/ijain.v4i1.152


The performance of text similarity algorithms

⁽¹⁾ Didik Dwi Prasetya

(Universitas Negeri Malang, Indonesia)
^{(2) *} Aji Prasetya Wibawa

(Universitas Negeri Malang, Indonesia)
⁽³⁾ Tsukasa Hirashima

(Graduate School of Engineering, Hiroshima University, Japan)
^*corresponding author

Abstract

Text similarity measurement compares text with available references to indicate the degree of similarity between those objects. There have been many studies of text similarity and resulting in various approaches and algorithms. This paper investigates four majors text similarity measurements which include String-based, Corpus-based, Knowledge-based, and Hybrid similarities. The results of the investigation showed that the semantic similarity approach is more rational in finding substantial relationship between texts.

Keywords

Similarity measure; String-based; Corpus-based; Knowledge-based; Text Mining

DOI

https://doi.org/10.26555/ijain.v4i1.152

Article metrics

Abstract views : 80605 | PDF views : 682

Cite

How to cite item

Full Text

Download

References

[1] A. Yunianta, O. M. Barukab, N. Yusof, N. Dengen, H. Haviluddin, and M. S. Othman, “Semantic data mapping technology to solve semantic data problem on heterogeneity aspect,” Int. J. Adv. Intell. Informatics, vol. 3, no. 3, pp. 161–172, Dec. 2017, doi: https://doi.org/10.26555/ijain.v3i3.131.

[2] W. H. Gomaa and A. A. Fahmy, “A survey of text similarity approaches,” Int. J. Comput. Appl., vol. 68, no. 13, 2013, doi: https://doi.org/10.5120/11638-7118.

[3] E. Y. Hidayat, F. Firdausillah, K. Hastuti, I. N. Dewi, and A. Azhari, “Automatic Text Summarization Using Latent Drichlet Allocation (LDA) for Document Clustering,” Int. J. Adv. Intell. Informatics, vol. 1, no. 3, p. 132, Dec. 2015, doi: https://doi.org/10.26555/ijain.v1i3.43.

[4] R. W. Barron and L. Henderson, “The effects of lexical and semantic information on same-different visual comparison of words,” Mem. Cognit., vol. 5, no. 5, pp. 566–579, Sep. 1977, doi: https://doi.org/10.3758/BF03197402.

[5] J. Wang, G. Li, and J. Fe, “Fast-join: An efficient method for fuzzy token matching based string similarity join,” in 2011 IEEE 27th International Conference on Data Engineering, 2011, pp. 458–469, doi: https://doi.org/10.1109/ICDE.2011.5767865.

[6] R. W. Hamming, “Error Detecting and Error Correcting Codes,” Bell Syst. Tech. J., vol. 29, no. 2, pp. 147–160, Apr. 1950, doi: https://doi.org/10.1002/j.1538-7305.1950.tb00463.x.

[7] V. I. Levenshtein, “Binary codes capable of correcting spurious insertions and deletions of ones,” Probl. Inf. Transm., vol. 1, no. 1, pp. 8–17, 1965.

[8] F. J. Damerau, “A technique for computer detection and correction of spelling errors,” Commun. ACM, vol. 7, no. 3, pp. 171–176, Mar. 1964, doi: https://doi.org/10.1145/363958.363994.

[9] S. B. Needleman and C. D. Wunsch, “A general method applicable to the search for similarities in the amino acid sequence of two proteins,” J. Mol. Biol., vol. 48, no. 3, pp. 443–453, Mar. 1970, doi: https://doi.org/10.1016/0022-2836(70)90057-4.

[10] R. A. Wagner and M. J. Fischer, “The String-to-String Correction Problem,” J. ACM, vol. 21, no. 1, pp. 168–173, Jan. 1974, doi: https://doi.org/10.1145/321796.321811.

[11] T. F. Smith and M. S. Waterman, “Identification of common molecular subsequences,” J. Mol. Biol., vol. 147, no. 1, pp. 195–197, Mar. 1981, doi: https://doi.org/10.1016/0022-2836(81)90087-5.

[12] M. A. Jaro, “Advances in Record-Linkage Methodology as Applied to Matching the 1985 Census of Tampa, Florida,” J. Am. Stat. Assoc., vol. 84, no. 406, pp. 414–420, Jun. 1989, doi: https://doi.org/10.1080/01621459.1989.10478785.

[13] W. E. Winkler, “String comparator metrics and enhanced decision rules in the Fellegi-Sunter model of Record Linkage.,” p. 8, 1990, available at: http://files.eric.ed.gov/fulltext/ED325505.pdf.

[14] A. Z. Broder, S. C. Glassman, M. S. Manasse, and G. Zweig, “Syntactic clustering of the Web,” Comput. Networks ISDN Syst., vol. 29, no. 8–13, pp. 1157–1166, Sep. 1997, doi: https://doi.org/10.1016/S0169-7552(97)00031-7.

[15] G. Kondrak, “N-gram similarity and distance,” in International symposium on string processing and information retrieval, 2005, pp. 115–126, doi: https://doi.org/10.1007/11575832_13.

[16] A. M. Mahdi and S. Tiun, “Utilizing wordnet for instance-based schema matching,” in Proceedings of the International Conference on Advances in Computer Science and Electronics Engineering (CSEE 2014), pp. 59–63, available at : http://www.academia.edu/download/34671264/ahmed_CSEE_2014.pdf.

[17] L. Gravano et al., “Approximate string joins in a database (almost) for free,” in VLDB, 2001, vol. 1, pp. 491–500, available at : http://www.vldb.org/conf/2001/P491.pdf.

[18] M. Yu, G. Li, D. Deng, and J. Feng, “String similarity search and join: a survey,” Front. Comput. Sci., vol. 10, no. 3, pp. 399–417, Jun. 2016, doi: https://doi.org/10.1007/s11704-015-5900-5.

[19] M. Y. Bilenko, “Learnable similarity functions and their application to record linkage and clustering,” 2006.

[20] P. Jaccard, “Étude comparative de la distribution florale dans une portion des Alpes et des Jura,” Bull Soc Vaudoise Sci Nat, vol. 37, pp. 547–579, 1901.

[21] L. R. Dice, “Measures of the Amount of Ecologic Association Between Species,” Ecology, vol. 26, no. 3, pp. 297–302, Jul. 1945, doi: https://doi.org/10.2307/1932409.

[22] A. Bhattacharya, “On a measure of divergence of two multinomial populations,” Sankhya. v7, pp. 401–406.

[23] E. F. Krause, Taxicab geometry: An adventure in non-Euclidean geometry. Courier Corporation, 1975.

[24] J. H. Friedman, “On Bias, Variance, 0/1—Loss, and the Curse-of-Dimensionality,” Data Min. Knowl. Discov., vol. 1, no. 1, pp. 55–77, 1997, doi: https://doi.org/10.1023/A:1009778005914.

[25] A. Kulkarni, C. More, M. Kulkarni, and V. Bhandekar, “Text Analytic Tools for Semantic Similarity,” Imp. J. Interdiscip. Res., vol. 2, no. 5, 2016, available at: http://imperialjournals.com/index.php/IJIR/article/view/688.

[26] K. Lund, “Semantic and associative priming in high-dimensional semantic space,” in Proc. of the 17th Annual conferences of the Cognitive Science Society, 1995, 1995, pp. 660–665.

[27] T. K. Landauer and S. T. Dumais, “A solution to Plato’s problem: The latent semantic analysis theory of acquisition, induction, and representation of knowledge.,” Psychol. Rev., vol. 104, no. 2, pp. 211–240, 1997, doi: https://doi.org/10.1037/0033-295X.104.2.211.

[28] E. Gabrilovich and S. Markovitch, “Computing semantic relatedness using wikipedia-based explicit semantic analysis.,” in IJcAI, 2007, vol. 7, pp. 1606–1611, available at: http://www.aaai.org/Papers/IJCAI/2007/IJCAI07-259.pdf.

[29] R. L. Cilibrasi and P. M. B. Vitanyi, “The Google Similarity Distance,” IEEE Trans. Knowl. Data Eng., vol. 19, no. 3, pp. 370–383, Mar. 2007, doi: https://doi.org/10.1109/TKDE.2007.48.

[30] P. Kolb, “Disco: A multilingual database of distributionally similar words,” Proc. KONVENS-2008, Berlin, vol. 156, 2008, available at: http://www.ling.uni-potsdam.de/~kolb/KONVENS2008-Kolb.pdf.

[31] R. Mihalcea, C. Corley, C. Strapparava, and others, “Corpus-based and knowledge-based measures of text semantic similarity,” in AAAI, 2006, vol. 6, pp. 775–780, available at: http://www.aaai.org/Papers/AAAI/2006/AAAI06-123.pdf.

[32] A. Budanitsky and G. Hirst, “Evaluating WordNet-based Measures of Lexical Semantic Relatedness,” Comput. Linguist., vol. 32, no. 1, pp. 13–47, Mar. 2006, doi: https://doi.org/10.1162/coli.2006.32.1.13.

[33] T. Slimani, “Description and Evaluation of Semantic Similarity Measures Approaches,” Int. J. Comput. Appl., vol. 80, no. 10, pp. 25–33, Oct. 2013, doi: https://doi.org/10.5120/13897-1851.

[34] J. J. Lastra-Díaz, A. García-Serrano, M. Batet, M. Fernández, and F. Chirigati, “HESML: A scalable ontology-based semantic similarity measures library with a set of reproducible experiments and a replication dataset,” Inf. Syst., vol. 66, pp. 97–118, Jun. 2017, doi: https://doi.org/10.1016/j.is.2017.02.002.

[35] L. Meng, R. Huang, and J. Gu, “A review of semantic similarity measures in wordnet,” Int. J. Hybrid Inf. Technol., vol. 6, no. 1, pp. 1–12, 2013, available at: https://pdfs.semanticscholar.org/da95/ceaf335971205f83c8d55f2292463fada4ef.pdf.

[36] R. Rada, H. Mili, E. Bicknell, and M. Blettner, “Development and application of a metric on semantic nets,” IEEE Trans. Syst. Man. Cybern., vol. 19, no. 1, pp. 17–30, 1989, doi: https://doi.org/10.1109/21.24528.

[37] J. J. Lastra-D’iaz and A. Garc’ia-Serrano, “A refinement of the well-founded Information Content models with a very detailed experimental survey on WordNet,” 2016, available at: http://e-spacio.uned.es/fez/eserv/bibliuned:DptoLSI-ETSI-Informes-Jlastra-refinement/Refinement_Espace_LastraGarcia.pdf.

[38] G. Varelas, E. Voutsakis, P. Raftopoulou, E. G. M. Petrakis, and E. E. Milios, “Semantic similarity methods in wordNet and their application to information retrieval on the web,” in Proceedings of the seventh ACM international workshop on Web information and data management - WIDM ’05, 2005, p. 10, doi: https://doi.org/10.1145/1097047.1097051.

[39] A. Tversky, “Features of similarity.,” Psychol. Rev., vol. 84, no. 4, pp. 327–352, 1977, doi: https://doi.org/10.1037/0033-295X.84.4.327.

[40] T. B. Huedo-Medina, J. Sánchez-Meca, F. Mar’in-Mart’inez, and J. Botella, “Assessing heterogeneity in meta-analysis: Q statistic or I² index?,” Psychol. Methods, vol. 11, no. 2, p. 193, 2006.

[41] A. E. Monge, C. Elkan, and others, “The Field Matching Problem: Algorithms and Applications.,” in KDD, 1996, pp. 267–270, available at : http://www.aaai.org/Papers/KDD/1996/KDD96-044.pdf.

[42] W. Cohen, P. Ravikumar, and S. Fienberg, “A comparison of string metrics for matching names and records,” in Kdd workshop on data cleaning and object consolidation, 2003, vol. 3, pp. 73–78, available at : https://www.cs.cmu.edu/afs/cs/Web/People/wcohen/postscript/kdd-2003-match-ws.pdf.

[43] C. Lin, D. Liu, W. Pang, and Z. Wang, “Sherlock: A Semi-automatic Framework for Quiz Generation Using a Hybrid Semantic Similarity Measure,” Cognit. Comput., vol. 7, no. 6, pp. 667–679, Dec. 2015, doi: https://doi.org/10.1007/s12559-015-9347-7.

[44] M. Al-Hassan, H. Lu, and J. Lu, “A semantic enhanced hybrid recommendation approach: A case study of e-Government tourism service recommendation system,” Decis. Support Syst., vol. 72, pp. 97–109, Apr. 2015, doi: https://doi.org/10.1016/j.dss.2015.02.001.

[45] I. Atoum and A. Otoom, “Efficient Hybrid Semantic Text Similarity using Wordnet and a Corpus,” Int. J. Adv. Comput. Sci. Appl., vol. 7, no. 9, pp. 124–130, 2016, doi: 10.14569/IJACSA.2016.070917, available at :http://thesai.org/Publications/ViewPaper?Volume=7&Issue=9&Code=ijacsa&SerialNo=17.

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

___________________________________________________________
International Journal of Advances in Intelligent Informatics
ISSN 2442-6571 (print) | 2548-3161 (online)
Organized by UAD and ASCEE Computer Society
Published by Universitas Ahmad Dahlan
W: http://ijain.org
E: info@ijain.org (paper handling issues)
andri.pranolo.id@ieee.org (publication issues)

View IJAIN Stats

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0

Username
Password
Remember me