Automatic Text Summarization Using Latent Drichlet Allocation (LDA) for Document Clustering

ABSTRACT


I. Introduction
Documents summarization process is a process to perform reduction of the volume of documents to be more concise, by taking the core documents and remove terms that considered unimportant without reduce the meaning of it. There are two types for creating summarization of a document, called abstract and extract. Abstract generate an interpretation of the original text, where a sentence would be transformed into shorter sentences [1]. Extraction is a summary of text obtained by restate passages that are considered as main topics in simplified form [2] [3]. This research will use features of summary extracts as a model of automatic document summarization.
The implementation of summarization techniques for documents clustering has a significant impact. This is due to the process of documents clustering usually constrained by the amount of the volume of documents. This problem is caused by large volumes of documents are identical with the size of document term matrix, whereas not all terms are relevant and sometimes term-redundant causes the process of clustering is not optimal [4].
In the model of automatic document summarization, Feature Based and Latent Dirichlet Allocation algorithm can be used for the sentence reduction process [5]. Previous studies show that Feature Based algorithms in the process of automatic document reduction as a feature for generating document clustering perform better in accuracy compared to standard feature reduction techniques [6] [7]. Document clustering is the process of document dataset grouping that refers to the similarity of document data patterns into a cluster. Meanwhile those document without similarity will be grouped into another clusters. [7]. K-means is one of the well-known cluster algorithm and frequently used to resolve clustering problem by grouping a certain number of k cluster, where the number k has been defined previously [8].

A. Web Crawler
A web crawler is one of the main components of the web search engines. The growth of web crawler is increasing in the same way as the web is growing. A list of URLs is available with the web ARTICLE INFO A B S T R A C T crawler and each URL is called a seed. Each URL is visited by the web crawler. It identifies the different hyperlinks in the page and adds them to the list of URLs to visit. This list is termed as crawl frontier. Using a set of rules and policies the URLs in the frontier are visited individually. Different pages from the internet are downloaded by the parser and the generator and stored in the database system of the search engine. The URLs are then placed in the queue and later scheduled by the scheduler and can be accessed one by one by the search engine one by one whenever required. The links and related files which are being searched can be made available whenever required at later time according to the requirements. With the help of suitable algorithms web crawlers find the relevant links for the search engines and use them further. Databases are very big machines like DB2, used to store large amount of data

C. Clustering Method
The k-means algorithm clusters data by trying to separate samples in n groups of equal variance, minimizing a criterion known as the inertia or within-cluster sum-of-squares. This algorithm requires the number of clusters to be specified. It scales well to large number of samples and has been used across a large range of application areas in many different fields. The k-means algorithm divides a set of N sample X into K disjoint clusters C, each described by the mean j of the samples in the cluster.
The means are commonly called the cluster centroids; note that they are not, in general, points from X, although they live in the same space. The k-means algorithm aims to choose centroids that minimize the inertia, or within-cluster sum of squared criterion: (1) Inertia, or the within-cluster sum of squares criterion, can be recognized as a measure of how internally coherent clusters are. It suffers from various drawbacks:  Inertia makes the assumption that clusters are convex and isotropic, which is not always the case. It responds poorly to elongated clusters, or manifolds with irregular shapes.
 Inertia is not a normalized metric: we just know that lower values are better and zero is optimal. But in very high-dimensional spaces, Euclidean distances tend to become inflated (this is an instance of the so-called "curse of dimensionality"). Running a dimensionality reduction algorithm such as PCA prior to k-means clustering can alleviate this problem and speed up the computations.
K-means is often referred to as Lloyd's algorithm. In basic terms, the algorithm has three steps. The first step chooses the initial centroids, with the most basic method being to choose k samples from the dataset X. After initialization, K-means consists of looping between the two other steps. The first step assigns each sample to its nearest centroid. The second step creates new centroids by taking the mean value of all of the samples assigned to each previous centroid. The difference between the old and the new centroids are computed and the algorithm repeats these last two steps until this value is less than a threshold. In other words, it repeats until the centroids do not move significantly.
Mean shift clustering aims to discover blobs in a smooth density of samples. It is a centroid based algorithm, which works by updating candidates for centroids to be the mean of the points within a given region. These candidates are then filtered in a post-processing stage to eliminate near-duplicates to form the final set of centroids.
Given a candidate centroid xi for iteration t, the candidate is updated according to the following equation: Where N(xi) is the neighborhood of samples within a given distance around xi and m is the mean shift vector that is computed for each centroid that points towards a region of the maximum increase in the density of points. This is computed using the following equation, effectively updating a centroid to be the mean of the samples within its neighborhood: A Gaussian mixture model is a probabilistic model that assumes all the data points are generated from a mixture of a finite number of Gaussian distributions with unknown parameters. One can think of mixture models as generalizing k-means clustering to incorporate information about the covariance structure of the data as well as the centres of the latent Gaussians.

A. Preprocessing Stages
Preprocessing stages are stages prior to the beginning of the process of clustering. This step is necessary in order to the document crawling outcomes are in proper form and can be used in the next following process.
In this paper, three stages of preprocessing are used, namely tokenization, stopword, and stemming.

1) Tokenization
Tokenizing stage is the stage cutting input string based on each word composing a sentence. For example, Input text: "Membuat campuran warna".

2) Stopword
In stopword stage, irrelevant words in topic determination of a document will be eliminated, e.g. the word "di", "pada", "dari", "atau", and some other words in Bahasa Indonesia.

3) Stemming
Stemming is the stage of looking for the word's root or the primary word of each word resulted from filtering. An example of this stage is as follows:

Automatic Text Summarization
Automatic text summarization is a concise form of the document, which aims to eliminate terms that are considered irrelevant or redundant to keep the core meaning of the document. So that even though the related document has a large volume, the users are able to understand the core document meaning quickly and correctly [9] [10].

C. Feature based Method
There are some features based method phase used in this paper, as follows:  Title feature

D. K-Means
In pseudo code, k-means is as follow: Initialize mi, i = 1,…,k, for example, to k random x t Repeat For all x t in X bi t  1 if || x tmi || = minj || x tmj || bi t  0 otherwise For all mi, i = 1,…,k mi  sum over t (bi t x t ) / sum over t (bi t ) Until mi converge The vector m contains a reference to the sample mean of each cluster. x refers to each of our examples, and b contains our "estimated [class] labels"

E. Latent Semantic Analysis
Latent Dirichlet Allocation (LDA) is a statistical model that tries to capture the latent topics in a collection of documents. LDA was first introduced by David Blei in 2003 [7]. The basic idea is that documents are represented as random mixtures over latent topics, where each topic is characterized by a distribution over words. One important assumption about the LDA generative model is that the number of topics is known in advance.

F. Vector Space Model Document Representation
Vector Space Model (VSM) changes document collection into a term-document matrix [9]. In figure 1, d refers to document and w is the weight or value for each term.

H. Similarity Measure
In this study, calculating similarity in documents was carried out by measure the distance the distance between two documents di and dj, using the cosines similarity formula. In the VSM, document is represented as d = {w1, w2, w3, wn}, where d is document and w is the weight of each term in document [14]. Similarity measeure is shown as follow: where nij is document number of category i in cluster j, ni is document number of category i, and nj is document number in cluster j.
F-measure is defined as follow: The mean of F-measure calculation is conducted by: max F(i, j) is the maximum F-measure value of category i in cluster j.

A. Dataset
This research uses dataset which consists of 398 articles in Bahasa Indonesia, obtained from public blog article by using python scrapy crawler and scraper. This dataset then transformed into certain form to acquire relevant attributes, match to input format of the document clustering algorithm.
For these 398 articles, the authors categorize manually into five different section: economy news, market reports, government, finance, and finance information.

B. Performance Evaluation Measure
Evaluation is done by observing the clustering results from testing the proposed method using the LDA algorithm. This study used the F-measure to measure the clustering performance. F-measure is obtained from the measurement of recall and precision. Recall is the ratio of acquired relevant documents by the total number of documents in documents collection. Meanwhile, precision is the ratio of the retrieved relevant documents number with a whole number of retrieved documents. Validation of the results is carried out by comparing evaluation method result of the method. Table 1 below is a comparison of the results from several tested models and the proposed model. Results show that improvement in accuracy occurs in clustering Bahasa Indonesia documents by using LDA method. The highest average accuracy was obtained using the automatic document summarization with LDA that reaches 72% in LDA 40%. Compared to traditional k-means method which only reaches 66%. From table 1 above can also be seen that in overall, automatic text summarization performs better than clustering without automatic text summarization.

V. Conclusion
In this paper, we have presented LDA automatic text summarization for document clustering in Bahasa Indonesia. Our experiments involving 398 data set from public blog article by using python scrapy crawler and scraper. Comparing our summarizer with traditional k-means and feature-based method, the results show that the best average precision text summarization for document clustering produced by the LDA method. Certainly, the experimental result is based LDA could improve the accuracy of document clustering.