The performance of text similarity algorithms

Didik Dwi Prasetya; Aji Prasetya Wibawa; Tsukasa Hirashima

doi:10.26555/ijain.v4i1.152


The performance of text similarity algorithms

⁽¹⁾ Didik Dwi Prasetya

(Universitas Negeri Malang, Indonesia)
^{(2) *} Aji Prasetya Wibawa

(Universitas Negeri Malang, Indonesia)
⁽³⁾ Tsukasa Hirashima

(Graduate School of Engineering, Hiroshima University, Japan)
^*corresponding author

Abstract

Text similarity measurement compares text with available references to indicate the degree of similarity between those objects. There have been many studies of text similarity and resulting in various approaches and algorithms. This paper investigatesÂ four majors text similarity measurements which include String-based, Corpus-based, Knowledge-based, and Hybrid similarities. The results of the investigation showed that the semantic similarity approach is more rational in finding substantial relationship between texts.

Keywords

Similarity measure; String-based; Corpus-based; Knowledge-based; Text Mining

DOI

https://doi.org/10.26555/ijain.v4i1.152

Article metrics

Abstract views : 86820 | PDF views : 760

Cite

How to cite item

Full Text

Download

References

[1] A. Yunianta, O. M. Barukab, N. Yusof, N. Dengen, H. Haviluddin, and M. S. Othman, â€œSemantic data mapping technology to solve semantic data problem on heterogeneity aspect,â€ Int. J. Adv. Intell. Informatics, vol. 3, no. 3, pp. 161â€“172, Dec. 2017, doi: https://doi.org/10.26555/ijain.v3i3.131.

[2] W. H. Gomaa and A. A. Fahmy, â€œA survey of text similarity approaches,â€ Int. J. Comput. Appl., vol. 68, no. 13, 2013, doi: https://doi.org/10.5120/11638-7118.

[3] E. Y. Hidayat, F. Firdausillah, K. Hastuti, I. N. Dewi, and A. Azhari, â€œAutomatic Text Summarization Using Latent Drichlet Allocation (LDA) for Document Clustering,â€ Int. J. Adv. Intell. Informatics, vol. 1, no. 3, p. 132, Dec. 2015, doi: https://doi.org/10.26555/ijain.v1i3.43.

[4] R. W. Barron and L. Henderson, â€œThe effects of lexical and semantic information on same-different visual comparison of words,â€ Mem. Cognit., vol. 5, no. 5, pp. 566â€“579, Sep. 1977, doi: https://doi.org/10.3758/BF03197402.

[5] J. Wang, G. Li, and J. Fe, â€œFast-join: An efficient method for fuzzy token matching based string similarity join,â€ in 2011 IEEE 27th International Conference on Data Engineering, 2011, pp. 458â€“469, doi: https://doi.org/10.1109/ICDE.2011.5767865.

[6] R. W. Hamming, â€œError Detecting and Error Correcting Codes,â€ Bell Syst. Tech. J., vol. 29, no. 2, pp. 147â€“160, Apr. 1950, doi: https://doi.org/10.1002/j.1538-7305.1950.tb00463.x.

[7] V. I. Levenshtein, â€œBinary codes capable of correcting spurious insertions and deletions of ones,â€ Probl. Inf. Transm., vol. 1, no. 1, pp. 8â€“17, 1965.

[8] F. J. Damerau, â€œA technique for computer detection and correction of spelling errors,â€ Commun. ACM, vol. 7, no. 3, pp. 171â€“176, Mar. 1964, doi: https://doi.org/10.1145/363958.363994.

[9] S. B. Needleman and C. D. Wunsch, â€œA general method applicable to the search for similarities in the amino acid sequence of two proteins,â€ J. Mol. Biol., vol. 48, no. 3, pp. 443â€“453, Mar. 1970, doi: https://doi.org/10.1016/0022-2836(70)90057-4.

[10] R. A. Wagner and M. J. Fischer, â€œThe String-to-String Correction Problem,â€ J. ACM, vol. 21, no. 1, pp. 168â€“173, Jan. 1974, doi: https://doi.org/10.1145/321796.321811.

[11] T. F. Smith and M. S. Waterman, â€œIdentification of common molecular subsequences,â€ J. Mol. Biol., vol. 147, no. 1, pp. 195â€“197, Mar. 1981, doi: https://doi.org/10.1016/0022-2836(81)90087-5.

[12] M. A. Jaro, â€œAdvances in Record-Linkage Methodology as Applied to Matching the 1985 Census of Tampa, Florida,â€ J. Am. Stat. Assoc., vol. 84, no. 406, pp. 414â€“420, Jun. 1989, doi: https://doi.org/10.1080/01621459.1989.10478785.

[13] W. E. Winkler, â€œString comparator metrics and enhanced decision rules in the Fellegi-Sunter model of Record Linkage.,â€ p. 8, 1990, available at: http://files.eric.ed.gov/fulltext/ED325505.pdf.

[14] A. Z. Broder, S. C. Glassman, M. S. Manasse, and G. Zweig, â€œSyntactic clustering of the Web,â€ Comput. Networks ISDN Syst., vol. 29, no. 8â€“13, pp. 1157â€“1166, Sep. 1997, doi: https://doi.org/10.1016/S0169-7552(97)00031-7.

[15] G. Kondrak, â€œN-gram similarity and distance,â€ in International symposium on string processing and information retrieval, 2005, pp. 115â€“126, doi: https://doi.org/10.1007/11575832_13.

[16] A. M. Mahdi and S. Tiun, â€œUtilizing wordnet for instance-based schema matching,â€ in Proceedings of the International Conference on Advances in Computer Science and Electronics Engineering (CSEE 2014), pp. 59â€“63, available at : http://www.academia.edu/download/34671264/ahmed_CSEE_2014.pdf.

[17] L. Gravano et al., â€œApproximate string joins in a database (almost) for free,â€ in VLDB, 2001, vol. 1, pp. 491â€“500, available at : http://www.vldb.org/conf/2001/P491.pdf.

[18] M. Yu, G. Li, D. Deng, and J. Feng, â€œString similarity search and join: a survey,â€ Front. Comput. Sci., vol. 10, no. 3, pp. 399â€“417, Jun. 2016, doi: https://doi.org/10.1007/s11704-015-5900-5.

[19] M. Y. Bilenko, â€œLearnable similarity functions and their application to record linkage and clustering,â€ 2006.

[20] P. Jaccard, â€œÃ‰tude comparative de la distribution florale dans une portion des Alpes et des Jura,â€ Bull Soc Vaudoise Sci Nat, vol. 37, pp. 547â€“579, 1901.

[21] L. R. Dice, â€œMeasures of the Amount of Ecologic Association Between Species,â€ Ecology, vol. 26, no. 3, pp. 297â€“302, Jul. 1945, doi: https://doi.org/10.2307/1932409.

[22] A. Bhattacharya, â€œOn a measure of divergence of two multinomial populations,â€ Sankhya. v7, pp. 401â€“406.

[23] E. F. Krause, Taxicab geometry: An adventure in non-Euclidean geometry. Courier Corporation, 1975.

[24] J. H. Friedman, â€œOn Bias, Variance, 0/1â€”Loss, and the Curse-of-Dimensionality,â€ Data Min. Knowl. Discov., vol. 1, no. 1, pp. 55â€“77, 1997, doi: https://doi.org/10.1023/A:1009778005914.

[25] A. Kulkarni, C. More, M. Kulkarni, and V. Bhandekar, â€œText Analytic Tools for Semantic Similarity,â€ Imp. J. Interdiscip. Res., vol. 2, no. 5, 2016, available at: http://imperialjournals.com/index.php/IJIR/article/view/688.

[26] K. Lund, â€œSemantic and associative priming in high-dimensional semantic space,â€ in Proc. of the 17th Annual conferences of the Cognitive Science Society, 1995, 1995, pp. 660â€“665.

[27] T. K. Landauer and S. T. Dumais, â€œA solution to Platoâ€™s problem: The latent semantic analysis theory of acquisition, induction, and representation of knowledge.,â€ Psychol. Rev., vol. 104, no. 2, pp. 211â€“240, 1997, doi: https://doi.org/10.1037/0033-295X.104.2.211.

[28] E. Gabrilovich and S. Markovitch, â€œComputing semantic relatedness using wikipedia-based explicit semantic analysis.,â€ in IJcAI, 2007, vol. 7, pp. 1606â€“1611, available at: http://www.aaai.org/Papers/IJCAI/2007/IJCAI07-259.pdf.

[29] R. L. Cilibrasi and P. M. B. Vitanyi, â€œThe Google Similarity Distance,â€ IEEE Trans. Knowl. Data Eng., vol. 19, no. 3, pp. 370â€“383, Mar. 2007, doi: https://doi.org/10.1109/TKDE.2007.48.

[30] P. Kolb, â€œDisco: A multilingual database of distributionally similar words,â€ Proc. KONVENS-2008, Berlin, vol. 156, 2008, available at: http://www.ling.uni-potsdam.de/~kolb/KONVENS2008-Kolb.pdf.

[31] R. Mihalcea, C. Corley, C. Strapparava, and others, â€œCorpus-based and knowledge-based measures of text semantic similarity,â€ in AAAI, 2006, vol. 6, pp. 775â€“780, available at: http://www.aaai.org/Papers/AAAI/2006/AAAI06-123.pdf.

[32] A. Budanitsky and G. Hirst, â€œEvaluating WordNet-based Measures of Lexical Semantic Relatedness,â€ Comput. Linguist., vol. 32, no. 1, pp. 13â€“47, Mar. 2006, doi: https://doi.org/10.1162/coli.2006.32.1.13.

[33] T. Slimani, â€œDescription and Evaluation of Semantic Similarity Measures Approaches,â€ Int. J. Comput. Appl., vol. 80, no. 10, pp. 25â€“33, Oct. 2013, doi: https://doi.org/10.5120/13897-1851.

[34] J. J. Lastra-DÃaz, A. GarcÃa-Serrano, M. Batet, M. FernÃ¡ndez, and F. Chirigati, â€œHESML: A scalable ontology-based semantic similarity measures library with a set of reproducible experiments and a replication dataset,â€ Inf. Syst., vol. 66, pp. 97â€“118, Jun. 2017, doi: https://doi.org/10.1016/j.is.2017.02.002.

[35] L. Meng, R. Huang, and J. Gu, â€œA review of semantic similarity measures in wordnet,â€ Int. J. Hybrid Inf. Technol., vol. 6, no. 1, pp. 1â€“12, 2013, available at: https://pdfs.semanticscholar.org/da95/ceaf335971205f83c8d55f2292463fada4ef.pdf.

[36] R. Rada, H. Mili, E. Bicknell, and M. Blettner, â€œDevelopment and application of a metric on semantic nets,â€ IEEE Trans. Syst. Man. Cybern., vol. 19, no. 1, pp. 17â€“30, 1989, doi: https://doi.org/10.1109/21.24528.

[37] J. J. Lastra-Dâ€™iaz and A. Garcâ€™ia-Serrano, â€œA refinement of the well-founded Information Content models with a very detailed experimental survey on WordNet,â€ 2016, available at: http://e-spacio.uned.es/fez/eserv/bibliuned:DptoLSI-ETSI-Informes-Jlastra-refinement/Refinement_Espace_LastraGarcia.pdf.

[38] G. Varelas, E. Voutsakis, P. Raftopoulou, E. G. M. Petrakis, and E. E. Milios, â€œSemantic similarity methods in wordNet and their application to information retrieval on the web,â€ in Proceedings of the seventh ACM international workshop on Web information and data management - WIDM â€™05, 2005, p. 10, doi: https://doi.org/10.1145/1097047.1097051.

[39] A. Tversky, â€œFeatures of similarity.,â€ Psychol. Rev., vol. 84, no. 4, pp. 327â€“352, 1977, doi: https://doi.org/10.1037/0033-295X.84.4.327.

[40] T. B. Huedo-Medina, J. SÃ¡nchez-Meca, F. Marâ€™in-Martâ€™inez, and J. Botella, â€œAssessing heterogeneity in meta-analysis: Q statistic or I² index?,â€ Psychol. Methods, vol. 11, no. 2, p. 193, 2006.

[41] A. E. Monge, C. Elkan, and others, â€œThe Field Matching Problem: Algorithms and Applications.,â€ in KDD, 1996, pp. 267â€“270, available at : http://www.aaai.org/Papers/KDD/1996/KDD96-044.pdf.

[42] W. Cohen, P. Ravikumar, and S. Fienberg, â€œA comparison of string metrics for matching names and records,â€ in Kdd workshop on data cleaning and object consolidation, 2003, vol. 3, pp. 73â€“78, available at : https://www.cs.cmu.edu/afs/cs/Web/People/wcohen/postscript/kdd-2003-match-ws.pdf.

[43] C. Lin, D. Liu, W. Pang, and Z. Wang, â€œSherlock: A Semi-automatic Framework for Quiz Generation Using a Hybrid Semantic Similarity Measure,â€ Cognit. Comput., vol. 7, no. 6, pp. 667â€“679, Dec. 2015, doi: https://doi.org/10.1007/s12559-015-9347-7.

[44] M. Al-Hassan, H. Lu, and J. Lu, â€œA semantic enhanced hybrid recommendation approach: A case study of e-Government tourism service recommendation system,â€ Decis. Support Syst., vol. 72, pp. 97â€“109, Apr. 2015, doi: https://doi.org/10.1016/j.dss.2015.02.001.

[45] I. Atoum and A. Otoom, â€œEfficient Hybrid Semantic Text Similarity using Wordnet and a Corpus,â€ Int. J. Adv. Comput. Sci. Appl., vol. 7, no. 9, pp. 124â€“130, 2016, doi: 10.14569/IJACSA.2016.070917, available at :http://thesai.org/Publications/ViewPaper?Volume=7&Issue=9&Code=ijacsa&SerialNo=17.

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

___________________________________________________________
International Journal of Advances in Intelligent Informatics
ISSN 2442-6571 (print) | 2548-3161 (online)
Organized by UAD and ASCEE Computer Society
Published by Universitas Ahmad Dahlan
W: http://ijain.org
E: info@ijain.org (paper handling issues)
andri.pranolo.id@ieee.org (publication issues)

View IJAIN Stats

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0

Username
Password
Remember me