The performance of text similarity algorithms

ABSTRACT


Introduction
The recent information problematic issue is the rapid data growth [1].Text similarity measurement is a text mining approach that could be overcome this overwhelming problem.Finding the similarity between words is a primary stage for sentence, paragraph and document similarities [2].Text similarity approach may alleviate people on finding relevant information.This is the backbone of successful text mining operations such as searching and information retrieval (IR), text classification, information extraction (IE), document clustering [3], sentiment analysis, machine translation, text summarization, and natural language processing (NLP).
Lexical and semantic similarity words is an essential element of sentence, paragraph and document similarity measurement [2].Lexical similarity a degree of two given string are similar in its character sequence.While the score is one (1), means the words are 100% lexically identic.In contrast, zero (0) indicates that there is no common word between given strings.On the other hand, semantic similarity represents the likeness among text and document on the basis of their contextual meaning.For example, the pair of "book" and "cook" have a high lexical similarity, but they are not semantically related.The pair of "car" and "wheel" that seems have no lexical similarity, but they are very semantically related as they are automotive-related terms.
Gomaa [2] explained the three main categories of text similarity approach, but did not discuss about the evaluation of algorithms performance.This paper will survey the measurement approaches of lexically and semantically text similarities from the widely used to the recent issues.This study also evaluate the ten most common algorithms that represents each category of text similarity measure.

Method
To complete the study of this text similarity, we conducted a performance investigation of text similarity algorithms.In this evaluation, three pairs of texts are used, took from Barron's research [4].The pairs are, Pair 1 ("book", "cook"); Pair 2 ("car", "wheel"); and Pair 3 ("antique", "ancient").
Refering to the test data, we can see that the texts in first pair represent lexical similarity, while the second pair describe the semantic similarity.The last pair represent that both texts have lexical and semantic similarity.This evaluation involves ten algorithms from four categories of text similarity measure we have describe.To test these algorithms we used several libraries, such as SimMetrics, SoftTFIDF, WS4J, and SEMILAR.In this test, we only focus on the retuned similarity score by each executed algorithm.

Text similarity algorithms
Different approaches have been promoted to measure the similarity between one text with another.The method is divided into four major groups, String-based, Corpus-based, Knowledge-based, and Hybrid text similarities; as shown in Fig. 1.These approaches will be detailed in the following subsections.String-based similarity is the oldest, simplest yet most popular measurement approach.This measure operates on string sequences and character composition.Two main types of string similarity functions are character-based similarity functions, and token-based similarity functions.
Character-based Similarity is also called sequence-based or edit distance (ED) measurement.It takes two strings of characters and then calculates the edit distance (including insertion, deletion and substitution) between them.Character-based quantifies character similarity between two strings to quantify the similarity, for instance edit distance which is the minimum number of single-character edit operations needed to transform one to another [5].In another word, two strings are similar if the edit distance minimum operation number is smaller than the given threshold.Some examples of this approach are Hamming distance [6], Levenshtein distance [7].Damerau-Levenshtein [7], [8], Needleman-Wunsch [9], Longest Common Subsequence [10].Smith-Waterman [11], Jaro [12], Jaro-Winkler [13], and N-gram [14], [15].Character-based measure is useful for recognizing typographical errors, but it is useless in recognition of the rearranged terms (e.g.data analyzing and analyzing data) [16].Edit distance is widely used for string matching approximation to handle the existing data inconsistence [17].
The term-based similarity also known as token-based because it models each string as a set of tokens.The similarity between strings can be assessed by manipulating sets of tokens, such as words.The main idea behind this approach is to perform two string similarity measurement based on general tokens, correspond to its token sets [18].If the similarity is denoted, the string pair is flagged as being similar or duplicate.Term-based similarity address drawback on character-based when it works on larger string.In fact, character-based become too computationally expensive and less accurate for imposingly larger strings such as text documents [19].In this section we will discuss some familiar token-based similarity functions.The main characteristic of token-based similarity is the use of the overlap of two token sets as likeness quantification.The overlap is computed based on exactly matched token pairs without considering other similar tokens.Token-based similarity approach is useful for recognizing the term rearrangement by breaking the strings into substrings.Jaccard similarity [20], Dice's coefficient [21], Cosine similarity [22], Manhattan distance [23], and Euclidean distance [24] are some examples of these methods.

Corpus-based Similarity
Corpus-based similarity uses a semantic approach.This similarity approach determines the similarity between two concepts based on the information extracted from a large corpora.A corpus (plural corpora) is a large collection of electronic written or spoken text.Corpus contains a predefined set of sentences and their translation to other language.The aim is to match input text with the text in the corpus and achieve translation [25].Many corpus-based similarity or relatedness measures are based on conceptbased resources, such as Wikipedia.Some of corpus based similarity measures are Hyperspace Analogue to Language (HAL) [26], Latent Semantic Analysis (LSA) [27], Explicit Semantic Analysis (ESA) [28], Pointwise Mutual Information (PMI), Normalized Google Distance (NGD) [29], and Extracting DIstributionally Similar words using CO-occurrence (DISCO) [30].

Knowledge-based Similarity
A semantic similarity measures that uses information from semantic networks to identify the degree of words similarity is a knowledge-based similarity measures [31].Knowledge-based similarity consist of semantic similarity and semantic relatedness.Those concepts have been warmly discussed among worldwide researchers.Similarity specifies two interchangeable concepts while relatedness associates concepts semantically [32].The semantic approach uses an explicit representation of knowledge, such as the interconnection of facts, the meanings of words, and rules to describe conclusions on specific domains.The schema of knowledge representation generally includes the rules of conclusions, logical propositions, and network semantics such as taxonomy and ontology.Some available ontologies are WordNet, SENSUS1, Cyc2, UMLS3, SNOMED4, MeSH, GO5 and STDS6 [33].WordNet is the most popular ontology resource and is widely used in knowledge-based similarity measurement.WordNet is a large English lexical database of a research project developed by Princeton University.WordNet organize nouns, verbs, adverbs and adjectives in one concept of semantic relations, called synonym sets (synsets), which represent one concept.Both conceptual-semantic and lexical relations interlinks the sysnets.The words in WordNet are structured hierarchically using hyponymy and hypernym and the words can easily be seen as concepts.In this way, WordNet can be interpreted as a taxonomy.The knowledge-based similarity approach that uses WordNet ontology can be categorized into four measures, path-based, information content-based (IC-based), feature-based, and other types [34].

1) Path-based Measure
The principal concept (also known as edge-counting measures) is the path length and its position in the taxonomy, is represented by a function of similarity between two concepts [35].This measure uses the shortest length of path between concept, such as the pioneering work of Rada et al. [36], and some of the measures referring to this approach has describe in Lastra-Díaz and García-Serrano [37].

2) IC-based Measure
IC-based approach incorporate a specific concepts in a similarity calculation.The core idea of ICbased similarity measures is applied in an information context (IC) model.The calculation depends on every concept and descendant of frequencies in textual corpus [34].The fundamental hypothesis should related to the more abstract concept with a lower information rather than a specific content.The ICbased approach is seen as very potential and becomes one of the mainstreams of research in the area so it is still widely discussed in recent years.A novel research was conducted by Lastra-Díaz and García-Serrano [37]

3) Feature-based Measure
The main idea of the family of feature-based similarity is using of set-theory operation between concepts feature sets.Feature-based measure describes a set of assumed terms as properties or features.The number of general characteristics are higher than less uncommon characteristics of two terms means that those item are similar [38].
One classical feature-based measure is Tversky's model [39], which argues that similarity is antisymmetric.In between features of subclass and related superclass overcomes the contribution of its inverse direction in terms of similarity evaluation.In recent year, SáNchez and Batet [40] proposed an idea of using the overlapping ancestor sets to estimate the overlapping of unknown feature of the concepts.

Hybrid Similarities
In addition to the three categories previously described, there are still several similarity measures that cannot be categorized into any prior family.The idea of this approach is to combine the previously described approaches, including string-based, corpus-based, and knowledge-based similarity to reach a better metric by adopt their advantages.
Common examples of hybrid metrics are Level2 method proposed by Monge and Elkan [41], SoftTFIDF [42], generalized edit similarity (GES), and Wang et al. [5].Monge and Elkan [41] propose recursive matching scheme to compare two long string.Implementation of this scheme in which substring are tokens, which call level two distances function.Cohen proposed hybrid metric "soft" TF-IDF similarity use the Jaro-Winkler [13] metric as the "secondary" similarity function.Wang et.al [5] also proposed hybrid similarity function based on token concept, but different from the classical tokenbased, he employed fuzzy matching between tokens.To quantify the similarity between tokens, this metric uses character-based similarity function.
The most recent hybrid techniques extract semantic knowledge from the structural representation of WordNet and the statistic information on the Internet.Lin [43] proposed a novel linked data (LD) based on hybrid semantic similarity measure, called TF-IDF (LD).The main idea of this algorithm is combine a novel linked data-based TF-IDF scheme with the classical text-based cosine similarity measure.This algorithm integrated in a semi-automatic system (Sherlock) for quiz generation using linked data and textual descriptions of RDF resources.Al-Hasan [44] proposed a new Inferential Ontology-based Semantic Similarity (IOBSS) semantically measure similarity that concern to explicit hierarchical relationship and shared attributes between specific domain items.Atoum and Otoom [45] introduced a novel hybrid on benchmark datasets called text similarity measure (TSM).TSM involves information in WordNet semantic relation such as exactly match words, comparison of sentences pair length, and similarity between word and its reference.

Experimental Results
The results of the evaluation are shown in Table 1.Test results for the first pair of texts representing two lexically similar terms show that the algorithms in the string-based similarity approach provide a high average score.The Jaro-Winkler and SoftTFIDF algorithms state the highest similarity level with a score of 0.8333.In the semantic-based similarity approach, the highest value is 0.5000 obtained through the Wu Palmer algorithm.Therefore, the text ("book", "cook") pairs have no meaning or semantic relevance.
In the second text pair ("car", "wheel"), almost all of the string-based approaches give a low resemblance value.In a lexical context, these two texts are obviously devoid of character slices.However, the semantic terms "car" and "wheel" are closely related, and the semantic similarity approach expresses the average of high similarity, with the highest value 0.9091 by the Wu Palmer algorithm.
Maximum scores on the third text pairs that represent terms with the highest lexical and semantic proximity are also Wu Palmer's algorithms.In this pair of texts, it appears that the string-based approach also expresses similarities, but more for lexical reasons.In this simple investigation, we can highlight that the measurement of text similarity using a semantic approach is able to reveal the relatedness between texts

Conclusion
This article has summarized surveys of measurements of text similarity categorized into four major groups: String-based, Corpus-based, Knowledge-based, and Hybrid similarities.Most common and familiar algorithms in each category have also been reviewed and can be grouped into lexical and semantic similarity.The results of the investigation show that for the purpose of measuring text which emphasizes lexical similarities by ignoring the substance of meaning, the lexical similarity approach is appropriate.These measurements can be used to identify duplication or plagiarism without concern about the document context.String similarity approaches are principally language-independence so they work well for different country languages.
Semantic approach seems to offer intelligence in the measurement of similarities.This measurement is very appropriate to find text or documents that are really similar and conform to the substance of the context.However, the semantic similarities are usually language and domain dependent, so they are not applicable to all languages.In other word, if language ontology is not yet available, it needs to be built first.Referring to the text similarity approaches, it is seen that semantic similarity is very rational to find document similarities.In the future work, our intention is to apply the semantic similarity is applied to the text documents foreshowing the natural relationships among the terms.

Fig. 1 .
Fig. 1.Four major groups of text similarity methods and algorithms 3.1.1.Categories of text similarity String-based Similarity who introduced a new ontology-based and new IC-based similarity measures.
Prasetya et al.(The performance of text similarity algorithms)

Table 1 .
Lexical and Semantic Similarity Result