Clustering stationary and non-stationary time series based on autocorrelation distance of hierarchical and k-means algorithms

ABSTRACT


I. Introduction
Time series model has many methods.The most important process was at the first step when develop the time series model.The process is the parameter identification by using autocorrelation function (ACF) plot of either stationer or non-stationer time series data.When the trend of time series and ACF plots are slightly decreased, the time series data was not stationer [1], [2].It steps easy and fast for the small data dimension.However, it may take longer time for gigantic size of data.So, the research problem is how to find a suitable method for identification large-scale time series data.
Identification and classification method for a large scale time series data has been done by clustering the time series data.This type of clustering is different with the clustering process for crosssection data, especially in deciding the distance technique for each cluster.Manso [3] created an R package for time series clustering with stationer data, used the package for time series analysis combined with clustering and proposed a time series clustering method.D'Urso and Maharaj [4] proposed a fuzzy clustering approach based on autocorrelation function, which applied on data which have a strong autocorrelation.In other words, the process of calculating distance value becomes complex and problematic due to availability in the enormous dataset.
Therefore, the main contribution of this research is finding the most accurate based on ACF's distance for stationary and non-stationary time series data by comparing the hierarchical clustering and K-Means algorithm.The hierarchical clustering technique makes the data into groups over Dendrogram, which is containing cluster tree in different scale [5].As in differently, k-means clustering algorithm develops the groups based on cluster center which each cluster has member depend on closest fitness value, and the cluster centers will be updated until there is no change in any of the clusters centroids [6].The hierarchical clustering and k-means clustering is widely used because its efficiency, scalability, and simplicity.The experiment was conducted on simulated data and real data sample to see the accuracy of both methods.

A. Stationarity Model for Time Series
The process {} i Y fulfilled the stationary assumption if the joint distribution of ...
The current values of the t Y is a linear combination of the past values of itself plus a random variables mentioned as e, as the variable which represent the other factors not explained by the model.
Assume that AR and MA, obtain a quite general time series model.
{} t Y is a mixed ARMA process of orders p and q.

B. Non-Stationary Model of Mean
Time series model called non-stationary of mean if ( ) ( ) , and non-stationary of variance if ( ) ( ) . This paper focused on non-stationary model of mean.One of non-stationary model, in this case, is Autoregressive Integrated Moving Average (ARIMA) [9], [10].To get a stationary model from non-stationary model, we should be differencing method.The model of ARIMA is, 0 ( )(1 ) ( )

D. Cluster Time Series
Clustering is an unsupervised learning task aimed to partition a set of unlabeled data objects into homogeneous groups or clusters.Partition is performed in such a way that objects in the same cluster are more similar to each other than objects in different clusters according to some defined criterion [12].For time series modeling, the type of possibly used cluster is autocorrelation based distance.Let    = ( 1  ,  2  … ,    ) ′ and    = ( 1  ,  2  … ,    ) ′ be estimated autocorrelation vector of   and   respectively, for some L such that    ≈ 0 and    ≈ 0 for  >  define a distance between   and   as follows Where,   (  ,   ) is autocorrelation distance between   and   ,  ̂ is estimation of autocorrelation vector of   ,  ̂ is estimation of autocorrelation vector of   , and  is weight matrices.While ACF distance without weight so that weighted matrices be identity matrices.If weight matrices using identity matrices, so the autocorrelation distance become

E. Cluster Algorithm
Cluster analysis is a type of data mining analysis.One of the function is reducing a cases number by grouping them into homogeneous clusters, and also can be used to recognize groups without no prior information about the number of possible groups and their membership [13].Hierarchical cluster analysis can be divided into two types, they are agglomerative and divisive.Agglomerative hierarchical clustering separates data into its individual cluster.The first step so that the initial number of clusters equals the total number of cases [14], [15].The present paper focused on type of hierarchical agglomerative cluster such as average linkage, complete linkage, and ward linkage.
Complete linkage is one of clustering methods which use the maximum distance between the data.This measure is similar to the single linkage measure, the difference is single linkage using the minimum distance [16], [17].The formula of complete linkage cluster is, Where IJ d and JK d are farthest distance between cluster I-J and J-K [13].
Average linkage have the rules using Unweighted Pair Group Method using Arithmetic Average (UPGMA).To overcome the limitations of single and complete linkage [18] proposes measure the average between the data.This method is supposed to represent a natural compromise between the linkage measures to provide a more accurate evaluation of the distance between clusters.
Where ab d is distance of object from cluster (IJ) to object b of cluster K, IJ N is count of (IJ) cluster's item, K N is count of (IJ) and K cluster's item.
Ward's method also called the incremental sum of squares method, uses the within cluster (square) distances and the between-cluster (squared) distance.Formulas for Ward's distance is [19], Non-hierarchical clustering techniques is one of method in clustering analysis which required to design number of group items before doing the clustering process [15].On the other hand, K-means required to set the number of K before running the process.Afterwards, the algorithm allowed the objects to be clustered based on the nearest centroid.The centroid was calculated using the mean formula between the objects in each cluster.The procedure of k-means can be defined as: 1. Set the number of groups with k groups.
2. Process every object to choose one the closest distance to the centroid.
3. Use ACF distance to recalculate the mean of each cluster to be set as the new centroid.

Repeat
Step 2 and 3 until no more reassignments for each objects.

F. Datasets 1) Simulated Data
The simulation study is conducted by generating 7 data models stationary and 7 data models nonstationary.The generated stationary and non-stationary models are presented on Table 1.

Stationary
Non Stationary Each time series model is generating by 10 different parameters with length of the data (t) is 150.At first, there will be 140 models dataset generation time series with each length (t) 150.Then the model that has been determined, repeated 10 times.

2) Real Dataset
Real data used in this research is the temperature data (C o ) daily in 34 cities in Indonesia.The period is from January 1 st until June 30 th , 2016.The dataset is obtained from the website of Indonesian Agency for Meteorological Climatological and Geophysics (BMKG).

A. Simulated Data Process
Each raised-simulated-data will be calculated based on the accuracy of cluster predetermined algorithm.This research simulate are four algorithms: Complete Linkage, Average Linkage, Ward Linkage, and K-Means.Weight matrices in this research is matrices identity, which is not a weight for each autocorrelation.
Table 2 shows the accuracy of each algorithms where the K-Means algorithm has higher accuracy than a hierarchical algorithm which is equal to 84.13286%.Therefore, K-Means algorithm is better to classify the stationary and non-stationary data than the other algorithms.

B. Real Dataset Process
In real dataset, to identify data characteristics was done by analyzing time series plot and autocorrelation function plots.Fig. 1 and 2 show a time series plot and ACF's plot of stationary and non-stationary data of some cities in Indonesia.Fig. 1(a Based on the identification of times series data models, there were classified that 11-time series data were non-stationary and 23-time series data identified as stationery (Table 3).

Type Data Count
Table 4 shows the accuracy results of each algorithm which is identified that K-Means algorithm has the highest accuracy for distinguishing stationary and non-stationary data with accuracy 85.29412 %.The other three hierarchical algorithms have the same accuracy, 82.35294 %.

IV. Conclusion
This research focuses on data simulation and uses clustering method to generate the data.There were seven models to build data simulation for each stationary and non-stationary data.The best models was applied to be used in real case dataset and compared the result based on accuracy.Then time series model was generated by ten different parameters with 150 periods.The real data used in this research were daily temperature data in 34 cities in Indonesia.The experiment on simulated data and real dataset shown that the K-Means algorithm has the highest accuracy in both data models, stationary and non-stationary data, with accuracy 84.13286% in simulated data and 85.29412% real dataset.Thus, it can be concluded that K-Means is the best algorithm for classifying stationary and non-stationary time series data.

Fig. 1 .Fig. 2 .
Fig. 1.Time Series plot and ACF plot for stationary data significance limits for the autocorrelations)  is a value of correlation represent influence between the time lag (ACF) in time series analysis because they represent the correlation C. AutocorrelationAutocorrelation in time series means correlation between past and future value.For a stationary process{} t Z , we have the mean()t EZ    .As function of k, k t Z and tk Z  from the same process, separated only by k time lags [9]-[11].

Table 4 .
Result Accuracy of Real Dataset