Association Rule Algorithm Sequential Pattern Discovery using Equivalent Classes (SPADE) to Analyze the Genesis Pattern of Landslides in Indonesia

Indonesia is a disaster-prone country because it is located between the confluence of three major plates that active in the world like Eurasian plate, Indo-Australian plate and Pacific plate. According to the Law No. 24 of 2007 on Disaster Management, disaster is an event or series of events that threaten and disrupt the lives and livelihood caused by natural factors or factors of non-natural or human factors that lead to the emergence of human lives, environmental damage, loss of property, and the psychological impact.


I. Introduction
Indonesia is a disaster-prone country because it is located between the confluence of three major plates that active in the world like Eurasian plate, Indo-Australian plate and Pacific plate.According to the Law No. 24 of 2007 on Disaster Management, disaster is an event or series of events that threaten and disrupt the lives and livelihood caused by natural factors or factors of non-natural or human factors that lead to the emergence of human lives, environmental damage, loss of property, and the psychological impact.
According to BNPB the highest number of disaster events is landslides as much as 402 events occurs until August 2015.Landslide is one of movement of soil, rock, soil creep, and rock debris that occurred once the move to the slopes.It is caused by steep slopes, high rainfall, deforestation, mining activities, and erosion.The impacts of the landslide are loss of property, damage to facilities such as homes and buildings, casualties, psychological trauma, disrupted economic and environmental damage [1].
Based on the impacts of landslide, mitigation required to take early precautions is to know how the pattern of association between the sequence of events landslides.Search pattern or associative relationship of large-scale data is closely associated with data mining.Data mining is a series of processes for adding additional value of a set of data in the form of knowledge that had been unknown manually [2].Sequential Pattern Mining is one of the methods used to find patterns in order to obtain useful information by searching the frequent sequences or a particular sequence of events that often arise [3].One of algorithm that used is Sequential Pattern Discovery using Equivalent Classes (SPADE).SPADE is using vertical id-list for easy retrieval in the database.SPADE can look for frequent sequences with only a couple of times a database search [3].Based on the background that described above, the issues to be discussed in this research is to know how the patterns formed between the sequences of landslides events using SPADE algorithm.

II. Related Works
Applied research related to disaster especially landslide has been investigated by several researchers.First, Aanalyzing the Land use change and the landslide characteristics for communitybased disaster mitigation.The results show that a change in vegetation cover results in a modified landslide area and frequency and changed land use areas have higher landslide ratios than no changed.Land use management and community-based disaster prevention are needed in mountainous areas of Taiwan for hazard mitigation [4].Second, analyzing the Landslide damage and disaster management system in Nepal.The results show that the landslide in Nepal was mainly caused by the combine effect of high rainfall, a steep slope and unconsolidated rock at the bed.The debris mass flowed along with the flood and caused damage downstream of the watershed.The existing landslide disaster management system in Nepal is weak so the disaster management system in Nepal must be considered as a part of rural development [5].There is method to analyzing a framework for regional association rule mining and scoping in spatial datasets can be applied also in lindslide case.The results of this research are spatial risk pattern and risk zones of arsenic in the Texas water supply were obtained [6].However, these studies used Mining Conjunctive Sequential Pattern.The results from this paper is the new introduced patterns have high potential for real life applications like lindslide case [7].

A. Association Rule
Association rules is one of the main techniques in data mining and the most commonly used in finding a pattern or patterns from a data set [8]. Support is a measure that indicates the degree of dominance of an item or the entire item set transaction [9].Support in this study is the probability of the sequence of events in a single incident of landslide is interconnected with the overall incidence of others landslides [10].Thus, the value of an item support calculated as (1).

  
Where P(X) is a probability of event X, n(X) is a number of event X in transaction, and n(S) is the number of transactions on database S. Confidence is a strong relationship between items in association rules.In this research, confidence is defined as the probability of occurrence of certain items (the chronology of the landslide) in a single event (interconnected) and one of the chronology is certainly due to several causes of the landslide.Thus, the value of a combination of items confidence calculated as (2).
Where is a conditional probability of occurrence of Y when X events occurred, is a probability of occurrence of X and Y simultaneously, and is a probability of occurrence X.Besides to these two parameters, one of the better ways to determine the strength of an association rule is to look at the value of the lift ratio.Lift ratios indicate the power level of the rule on random events of the antecedent (X) and consequence (Y) based on the each support expressed in equation of (3) .

 
Where is the probability of occurrence of events X and Y simultaneously, is a probability of occurrence X, and is a probability of occurrence Y.

B. Sequence Pattern Mining
Sequential pattern mining used for data that has a sequence, the data can be a sequence of transactions.Sequential pattern mining first introduced by Agrawal and Srikant.Sequential pattern mining process can be described as follows, for example given a number of sequences, each sequence consisting of a series of elements and each element of support.Excavation sequential pattern is all of subsequence search repeated, subsequence that has the bigger frequency of occurrence than the minimum-support consists of a number of items, and given the minimum value [11].To settle this sequential problem can be done by several methods.One of the methods is SPADE (Sequential Pattern Discovery Using Equivalence Classes).

C. SPADE (Sequential Pattern Discovery Using Equivalence Classes)
SPADE algorithm (Sequential Pattern Discovery using Equivalence classes = Invention of data sequence pattern using the same class) is a new algorithm for rapid discovery of data sequence pattern [11].The definition of the class is a collection of objects that have the same attributes or parameters, while the frequency is the number of times data has the same value.The problem of data mining sequence patterns can be expressed as follows: I = {i1, i2, … , im} an object consisting of a set of alphabet.While an event is a collection of actions that have orders to do.Sequence is a list of events.An event is denoted as (i1, i2, … , ik ), where ij is the object.If there is an α which is a sequence of objects that can be denoted as follows (α1 → α2 → • • • → αq), where α is an incident.A sequence with k objects denoted by k =Σj|α1| then this means that k is a k-order (k-sequence).SPADE algorithm steps in finding frequent sequence and then to determine the rules of the frequent sequence are as follows [12].

Determine frequent 1-sequnce
 Do the scan for each item set in a sequence database. Save the id-list for each item set (sid and eid pair). Then scan the id-list from each id-list, each encountered sid that did not exist before, then value of the support is added. Sequence that entered in frequent 1-sequence is the support that have value of more than min_sup.

Determine frequent 2-sequence
 The data that used is data of frequent 1-sequence. Combine each frequent 1-sequence with all other frequent 1-sequences.For example, if 1sequence A mergered with 1-sequence B, then the possibility of two sequences that occurs is A,B where A and B appear together in the transaction, A → B where item B appear after item A and B → A where item B appears after item A.  Check the id-list whether the id-list is have the equal sid for every merger of frequent 1sequence, if equal, then check the eid of 1-sequence A is equal or less than or more than eid 1-sequence B.  If equal, then id-list is included in the 2-sequence A,B.If eid B is more than A, then the id-list is included in the 2-sequence A → B. If eid A is more than B, then the id-list is included in the 2-sequence B → A.  Then, as in the frequent 1-sequences, add the support for each sid that did not exist before. From the 2-sequence check the support value whether the support is more than min_sup or not.If the support value is eligible, then it is entered in frequent 2 sequence.

Determine frequent k-sequence
After determined the frequent 2-sequence, do the same process to seek the next frequent sequence, which is to determine frequent k-sequence.To determine a frequent k-sequence is performed to join the frequent (k-1) sequences that have the same prefix.For example, to determine the 3-sequence combine the frequent sequence of 2-sequences that have the same prefix, to determine the 4-sequence combine the frequent sequence of 3-sequences that have the same prefix, and so on.To determine prefix frequent (k-1) sequence remove the last item of the sequence.For example, if there is a 4-sequence A → B → C → D, then the prefix is A → B → C. For each of this merger there are 3 possible outcomes: From each of these possibilities, check the support value.Wether it meets the min_sup or not.If yes then the sequence was included in the frequent k-sequence.Frequent sequence's searching is terminated if there is no frequent (k-1) sequences that could be join or there is no frequent ksequence that found anymore.

Establishment of Rule
 After all frequent sequences are found, determined the rule of these sequences. 1-sequences are not used to establish the rule because it is only consists of one item. To 2-sequence which is antecedent is the first item and the consequent is the second item.
Examples for sequence A → B then established rule is A => B. As for the sequence which is longer than 2 or k-sequence, the last item is used as consequent, while antecedent are all the items before the last item. For example, to 4-sequence A → B → C → D, then the established rule is A → B → C => D.
Calculated the confidence value for each rule.If it meets the limits of min_conf rule, then the rule is accepted.

IV. Results and Discussion
The population in this research is the occurrence landslides' data in Indonesia and the sample is the chronology of landslides events period August 2011 until June 2015.The type of data that will be used is secondary data obtained from the website of Indonesian National Board for Disaster Management.Variables of this research is chronologic.Chronologic is the sequence of events that occurred in landslide.This research using Sequential Pattern Algorithm Discovery using Equivalent Classes (SPADE).There are four step SPADE algorithm as follows:

A. Selection Data
This research using landslide data from Indonesian National Board for Disaster, as examples of the data is shown in Table 1.Based on these data it appears that the data landslides has many attributes such as date, location, victim, Losses, and Description.However, not all of them will be used in this research so that the data do preprocessing to acquire the attributes used in the study.

B. Cleaning Data
This phase will clean up the data that is not needed to reduce data errors and duplication of data.For example, removing the attributes such as Date, Location, Victim, Losses, because that attributes are not used to establishment the rule.The data that used is an example of the result of the cleaning process is shown in Table 2.
are combined with A,C, then the possible result only A, B, C.  If A,B are combined with A → C, then the possible result only A,B → C.  If A → B → C combined with A, then there are 3 possible outcomes: A → B, C., A → B → C and A → C → B.