Constraint-based discriminative dimension selection for high-dimensional stream clustering

Clustering data streams is one of active research topic in data mining. However, runtime of the existing stream clustering algorithms increases and their performance drop in the face of large number of dimensions. Complexity of the stream clustering methods is increased when perform on data with large number of dimensions. In order to reduce the clustering complexity, one possible solution consists in determining the appropriate subset of cluster dimensions via dimension projection. SED-Stream is an efficient clustering algorithm that supports high dimension data streams. The aim of this paper is to increase performance of SED-Stream in terms of both clustering quality and execution time. In order to improve the clustering process, background or domain expert knowledge are integrated as “constraints” in SEDC-Stream. The new algorithm, SEDC-Stream, supports the evolving characteristics of the dynamic constraints which are activation, fading, outdating and prioritization. SEDC-Stream algorithm is able to reduce cluster splitting time


Introduction
Clustering data streams is one of active research topic in data mining.The streams clustering processes data in a single pass and summarizes them in real-time, while using limited resources.Many techniques have been proposed for clustering data streams [1]- [7].Nevertheless, small number of them have been developed for monitoring and detecting change of the clustering structures [8]- [10].E-Stream [9] is an evolution-based stream clustering technique that has been developed to detect change of the evolving clustering structures.However, its runtime increases and its performance drops when perform on streams with large number of dimensions.
Complexity of the stream clustering methods is increased when perform on data with large number of dimensions.The use of dimension projection technique [2], [11]- [17] is one possible solution to reduce complexity in dealing high dimensional streams.The concept of "projected clustering" has been introduced in HP-Stream [2].With its dimension projection mechanism, HPStream is able to determine the specific set of cluster dimensions.Similar to HPStream, SED-Stream [18] is developed to deal with high dimensional data streams.By using the standard-deviation of the attributes, SED-Stream is able to select relevant subset of cluster dimensions.Experimental results over several stream datasets demonstrate that SED-Stream is able to generate higher clustering quality.

A R T I C L E I N F O A B S T R A C T
Clustering data streams is one of active research topic in data mining.However, runtime of the existing stream clustering algorithms increases and their performance drop in the face of large number of dimensions.Complexity of the stream clustering methods is increased when perform on data with large number of dimensions.In order to reduce the clustering complexity, one possible solution consists in determining the appropriate subset of cluster dimensions via dimension projection.SED-Stream is an efficient clustering algorithm that supports high dimension data streams.The aim of this paper is to increase performance of SED-Stream in terms of both clustering quality and execution time.In order to improve the clustering process, background or domain expert knowledge are integrated as "constraints" in SEDC-Stream.The new algorithm, SEDC-Stream, supports the evolving characteristics of the dynamic constraints which are activation, fading, outdating and prioritization.SEDC-Stream algorithm is able to reduce cluster splitting time, and place new incoming points to their suitable clusters.Compared to SED-Stream on the three real-world streams datasets, SEDC-Stream is able to generate a better clustering performance in terms of both purity and f-measure.
One way to improve quality of the clustering result is via the use of domain expert knowledge [19], [20].Semi-supervised stream clustering is a technique that performs cluster analysis over data streams by using domain expert knowledge as "constraint".Although large number of stream clustering techniques have been proposed, a small number of them have been involved in semi-supervised manner [21]- [23].A conceptual model for analyzing data streams using constraints has been proposed in Ruiz et al. [24].The traditional K-means is extended to obtain the Constraint-based K-means algorithm [25].C-Denstream [26] is the extended version of Denstream, by utilizing constraints during the clustering process.An instance-level Must-link constraint is defined as a pair of instances (x,y) that must be members to the same cluster [1].Must-Link constraints are integrated as background knowledge in Estream [9] to obtain a semi-supervised stream clustering technique named CE-Steam [27].By adapting its clustering structure over the continuous flow of data points, CE-Stream continuously updates its constraints based on their hit-rates.
The main goal of this paper is to improve performance of the existing evolution-based stream clustering algorithms such as E-Stream [9], CE-Stream [27] and SED-Stream [18].To deal with high dimensional data streams, the idea is to combine dimension selection technique with the use of domain expert knowledge as constraints.The new algorithm is named SEDC-Stream (Constraint-based discriminative dimension selection for clustering data streams with large number of dimensions).For dimension selection, attributes of each cluster are projected to its discriminative dimension attributes.During the progression of data streams, SEDC-Stream locates all the active clusters while determining their discriminative attributes.Two types of instance-level constraints, Must-link and Cannot-link, are introduced and integrated in a semi-supervised manner.SEDC-Stream does support the evolving characteristics of the dynamic constraints which are constraint activation, fading, outdating and prioritization.SEDC-Stream is able to reduce an excessive splitting during the clustering process, and place new incoming points to their suitable clusters, during the data streams progression.Compared to SED-Stream on three real-world streams datasets [28], [29], the results reveal that SEDC-Stream has improved both the clustering output quality and the execution time.
To summarize, the new SEDC-Stream algorithm provides the following mechanisms to support high dimensional data streams are discriminative dimension selection to support the evolving clustering structure; and Integration of activated, faded and obsolete instance-level constraints during the progression of data streams.SEDC-Stream outperforms the existing evolution-based stream clustering algorithms such as E-Stream, CE-Stream and SED-Stream in terms of clustering quality.
The remaining of this paper is structured as follows.Section 2 divided into two sections.In section 2.1, definitions and concepts related to the stream clustering, discriminative dimension selection, and stream constraints are given, while in section 2.2, the SEDC-Stream is presented in detail.In section 3, the performance of SEDC-Stream and SED-Stream are compared over three real-world stream datasets.In section 4, the conclusion and the future works are explained.

Basic Concepts
In the following, basic concepts and techniques related to the stream clustering, discriminative dimension selection, and stream constraints are given.

Cluster Representation
Assume that data streams consist of a set of data points  1 …   arriving at time stamps  1 …   .Each data point   contains  dimensions, denoted as   = ( 1   …    ).During the progression of data streams, there exist a large number of incoming data points that cannot be stored into the limitedsize memory.Instead, cluster representation is used.In Aggarwal et.al. [2], a fading cluster structure (FCS) was introduced.Later, Udommanetanakit et.al. [9] proposed FCH that extended FCS by adding -bin histogram to detect change of the clustering structure.In Chairukwattana et.al. [30], the notion of dimension projection was added into .Finally,  is defined as  = (1(), 2(), (), (), ()).The description of  can be described as follows.
Let  be the total number of data points of such cluster,   be the time when data point xi is retrieved, and  be the current time.The fading weight of data point   is defined as ( −   ) where () = 2 −  and  is the user-defined decay rate.() is a vector of weighted sum of each dimension at time .The  ℎ dimension is    where y m i = { 1, if m • r + min(x j ) ≤ x i j ≤ (m + 1) • r + min(x j ); r = max(x j )-min(x j ) α 0, otherwise.

 
Note that the value of bin width () may be different in each dimension.
Each cluster can be categorized as active or inactive cluster when its weight is greater than or lower than the user-specified threshold active_cluster_weigth respectively.The active clusters are capable of merging with their nearest incoming data points resulted in their self-evolution clustering structure.Contrastingly, to become an active cluster, such inactive clusters must be merged together with other inactive or active clusters.

Distance Functions
Here, the distance functions are modified in order to deal with the different subset of projected dimensions of each cluster.We recall the notion of () (in section 2.1.1)which is a bit vector represented the projected dimensions of such cluster at timestamp .The distances functions can be defined as follows.
Cluster-Point distance (,   ) is measured from a center of the active cluster  to a data point   .For each dimension , the distance is normalized by the radius (standard deviation) of the cluster radius   .The (,   ) function at timestamp  is where  is the number of projected dimensions of cluster  represented by bit vector ().Thus, an incoming data point is then merged into its closet active cluster i.e. having minimum Cluster-Point distance.
Cluster-Cluster distance (  ,   ) is measured between two cluster centers (  and   ).The (  ,   ) function at timestamp  is where n is the total number of projected dimensions represented by bit vector ().Note that () corresponds to union set of the projected dimensions of these two clusters.Therefore, a pair of clusters can be merged by determining its cluster-cluster distance.

Instance-Level Stream Constraint
In the following, definitions related to instance-level stream constraint are given.More detailed explanation of these concepts can be found in [27].We also refer to the notation of data streams, data point and fading weight as introduced in section 2.1.1.
Must-link constraint denoted as  is a pair of data points (  ,   ) that must be assigned to the same cluster.
Cannot-link constraint denoted as  is a pair of data points (  ,   ) that must not be assigned to the same cluster.
Active constraint denoted as  is a must-link or cannot-link constraint where all weights of its data points (  ,   ) are greater than user-specified threshold .
Constraint weight is the minimum weight of the active constraint data points   and   , and can be defined as Constraint fading function is used to gradually reduce weight of active constraints over time.Obviously, lifetime of constraints is much longer than lifetime of data points.Thus, the constraint fading function can be defined as where r is the time interval.
Constraint Hit-Rate specifies the number of times when a constraint is utilized within the stream clustering process.For effective data processing, constraints are sorted by their hit-rate in descending order.
Check Active and Update Constraints is a function to activate constraints before being used.This can be done by matching such constraints with the satisfied incoming data points.However, constraints are faded and obsolete over time.
Prioritize Constraints is a function to sort constraints based on their hit-rate resulted in effective computation.

Discriminative Dimension Selection
When clustering over high-dimensional data streams, the problem of data sparsity frequently happens.One possible solution to deal with this problem is to apply the projected clustering technique which has been proposed in [8].For each cluster, projected clustering technique determines specific subset of relevant dimensions.As result, data points are less spread since their intra-dissimilarity within a cluster is minimized.Unfortunately, the distance between clusters (inter-dissimilarity) may become closer.The reason is that, in some specific dimensions, there exist the overlapped ranges which can be covered by radii of several other clusters, as shown in Fig. 1.
Another approach named "discriminative dimension selection" for projected clustering is proposed [18].Unlike the traditional projected clustering, the discriminative dimension selection consists in identifying all the discriminative dimensions.A discriminative dimension is highly relevant to its cluster and very distinguished from the other clusters.To determine all the discriminative dimensions, two steps performed which are dimension uniqueness filtered-out and dimension radii ranking.Dimension uniqueness is the number of clusters for which their radii are overlapped.For a given dimension, if its dimension uniqueness is more than the overlapped ratio, then the dimensions (and its overlapped dimensions) are filtered out, and all the remaining dimensions are ranked according to their radii.Discriminative dimensions are selected by choosing the remaining dimensions at the top || ranks where || is the number of clusters and  is the average number of selected dimensions for each cluster.and its associated bit vector [18] Let  be the average number of selected dimensions for each cluster, and  be the overlapped threshold.At first step,  and  are set to 1 and 0.7, respectively.The discriminative dimension selection mechanism and its associated bit vector are explained in Fig. 1.Suppose that, at time stamps , the clustering output is composed of 4 clusters using 2 dimensions.First, for each cluster, dimension radius is computed based on its .The radius of one cluster may be overlapped by the other clusters' radius i.e. dimension #1 of cluster #1, #2 and #3.If the number of overlapped clusters ratio is more than v (0.7), then all the overlapped dimensions are filtered out.After that, the remaining dimensions are ranked (in ascending order) based on their radii.Finally, the dimensions that are at the top || ranks will be selected as discriminative dimensions i.e. dimension #1 and #2 of cluster #3 and dimensions #2 of cluster #4 (in grey color).Notice that, the number of discriminative selected dimensions may be different in each cluster.

Use of Must-Link constraints to guide the clustering process
Must-link constraints are utilized to guide the cluster splitting process.An active cluster can be split into two clusters in case of existing of two separate density areas in such cluster dimension.However, this might mislead cluster splitting as shown in Fig. 2. The first and second dense area (represented by blue circle) might be split since there is a largest gap between those two areas.Unfortunately, these two dense areas belong to the same class.To prevent the improper cluster splitting, the must-link constraints must be applied.
Regarding to cluster splitting process, FCH (Fading Cluster Histogram) is used to determine the best split point (i.e. th bin of FCH) as shown in Fig. 2. Here, must-link constraints are taken into account.The cluster splitting is ignored if there exists at least one active constraint (  ,   ) such that both   and   separately appear in the 1 st to  − 1 bin of  and the  th to the last bin of .

Use of Cannot-Link constraints to guide the clustering process
Cannot-link constraints are utilized to guide cluster assignment of a new data Intuitively, a new data point is to be assigned to its closest cluster.However, in some circumstances, the new data point might be in conflict with the members of the closest cluster.With only few numbers of inaccurateassigned new data points, the resulting clustering structure might be changed.To overcome this problem, cannot-link constraints (  ,   ) can be used to guide the clustering process to find suitable cluster for the new incoming data point by considering both smallest distance and without violation of the cannot-link constraint(s).

SEDC-Stream algorithm
This section describes SEDC-Stream main algorithm.SEDC-Stream combines dimension selection technique with the use of background knowledge as constraints as introduced in section 2.2.1 and section 2.2.2 respectively.To guide the clustering process, background knowledge is used as constraints in two scenarios based the types of constraints.First, must-link constraints are used to prevent cluster-splitting.In some circumstances, SEDC-Stream detects the cluster-splitting behavior which results in a separation of data points (  ,   ) of the activated must-link constraint .It implies that all the data points within the cluster are still related, and the splitting operation is not necessary.Second, cannot-link constraints are used in placing a new incoming data point into its suitable cluster.A new incoming data point is normally added into its closest cluster.However, if cannot-link constraint is activated, then the data point will be added to the cluster with smallest distance and without conflict to the cannot-link constraint.For dimension selection, any active cluster is projected to its discriminative dimensions that are highly relevant to its cluster and very distinguished from the other clusters.
The main algorithm of SEDC-Stream is given in Fig. 3.In line 1, a new data point is retrieved.In line 2, constraints are activated and are updated their weights if they satisfied with the new data point.In line 3, all clusters are faded.The clusters with weight less than user-specified threshold (ƛ) are deleted.In line 4, to speed up the execution time, constraints are sorted according to their weight.In line 5, any cluster can be split when behavior inside the cluster is obviously separated except there is a signal from must-link constraint (MLS).In line 6, the overlapping-active clusters are merged.In line 7, when the number of cluster count exceeds the limit, it begins to merge the closet pair of clusters until the number of cluster count reaches the limit.In line 8, constraints are faded and deleted if their weight is less than user-specified threshold (ƛ).In line 9, it checks all clusters whether their statuses are active.In line 10-12, all active clusters are re-projected to their new discriminative dimensions if members in the set of active clusters are changed.In line 13-18, the incoming data point is included into its closet cluster that does not satisfy any cannot-link constraints, if its distance is within radius_factor (as an input parameter).Otherwise, a new isolated data point is created from this incoming data point.Then the algorithm returns to the top and waits for a new data point.

Fig. 3. Main SEDC-Stream algorithm
The modified and added functions from E-stream are given in Fig. 4 and briefly described as follows.
FadingAll: All clusters are faded.Then, clusters with small weight (i.e. less than ) are deleted.

CheckSplit:
In the discriminative dimensions of any active cluster, if there is a splitting point and no established must-link constraints after splitting, then the cluster splitting process can be done.Otherwise, SEDC-Stream ignores splitting.

MergeOverlapCluster:
The pairs of active overlapped clusters can be merged, unless they are the results of cluster splitting process.Notice that, only active clusters that include the new incoming data points are considered due to their self-revolution change.
LimitMaximumCluster: If the total number of clusters reaches the maximum_cluster threshold, the closest pair of clusters is merged until the number of remaining clusters does not exceed the maximum_cluster.If there exists a new active cluster, DiscriminateProjectDimension method is performed after receiving a new data point to extract its discriminate dimensions as described in section 2.2.1.
FindClosestCluster: An incoming data point is assigned to its closet active cluster.This is done by determining the minimum cluster-point distance (CPDistance).Note that the distance is calculated only the discriminative dimensions of such active cluster.Notice that, after merging process in the MergeOvelapCluster and LimitMaximumCluster procedures, there are 3 cases of discriminative projected dimensions (BS).First, if both clusters are active clusters, the result is bitwise OR operation both BS.Second, if it has only one active cluster, the result is the BS of active cluster.Otherwise the result is all dimensions (all bits of BS is set to 1).

Results and Discussion
In this section, clustering quality of SEDC-Stream will be demonstrated.The experiments are performed on three well-known streams datasets: Forest Cover Type dataset [28], KDD-99 dataset [28], and Electricity dataset [29].For comparison, SED-Stream [18] and SEDC-Stream have been implemented using C++.All the experiments are conducted on a 2.6 GHz Intel® Core i5 with 8GB memory.All the parameter settings of SEDC-Stream are as the following: stream speed, horizon and decay rate are set to 500, 2 and 0.1 which are similar to SED-Stream.

Clustering Performance Comparision
The first dataset is Forest Cover Type dataset which contains 581,012 instances with 10 numerical attributes of 7 classes.First 300,000 instances are used to evaluate the performance of both clustering algorithms and the data is split into smaller chunk of size 25,000 instances for evaluation.Fig. 5 shows that SEDC-Stream achieve better f-measure almost all the time of clustering, and gains comparable purity as SED-Stream.The second dataset is KDD-99 dataset, which contains 250,000 instances with 43 attributes of 23 classes.The results in Fig. 6 show that SEDC-Stream outperforms SED-stream in term of f-measure with equivalent purity.The last dataset is Electricity dataset, which contains 45,312 instances with 8 attributes of 2 classes.As shown in Fig. 7, SEDC-Stream can achieve much higher f-measure, with lower purity compared to SED-Stream.

Performance on Integrating Instance-level Constraints
To evaluate performance of SEDC-Stream on constraint selection, two types of constraints are used which are must-link constraints and cannot-link constraints.As explained in the previous section, both types of constraints are the constraints between a pair of instances.Must-link constraints (ML) indicate that the two instances should be in the same cluster, and cannot-link constraints (CL) indicates that the two instances should not be grouped into the same cluster.Constraints are very expensive; therefore, SEDC-Stream is designed to use only a small number of constraints (less than 0.01%).For example, with Forest Cover Type dataset contains 300,000 instances; therefore 300,000 x 299,999/2 is about 45 million pairs of instances.For example, with Forest Cover Type dataset containing 300,000 instances, about 45 million pairs of instances (300,000 x 299,999/2) are generated.Only 0.0004% (200 constraints) of them is used in SEDC-Stream algorithm.Notice also that, even with very small number of constraints, those constraints are able to help improving the quality of the output clustering, as shown in the previous section.
Because SEDC-Stream used both must-link and cannot-link constraints, in this section, both types of constraints are evaluated separately.The results in Fig. 8 show that using both types of constraints can achieve higher f-measure.Note that the original SED-stream with or without constraints obtains very similar level of purity, so those results are omitted.Both types of constraints achieve indifferent purity.

Time Complexity
Compared to SED-Stream, SEDC-Stream require additional time to evaluate constraint conditions and to use those constraints to maintain the output clustering.Thus, SEDC-Stream uses more time to perform the clustering than SED-Stream in almost datasets as shown in Fig. 9.With the highdimensional datasets, it is clearly seen that the execution time of SEDC-Stream is significantly higher than SED-stream (i.e.Forest Cover Type and Electricity datasets).However, SEDC-Stream is faster than SED-Stream on KDD-99 dataset.This can be explained by the following observations.First, only a few attributes are used in the clustering process, and only 13 from 30 pairs of constraints meet the constraint conditions.Thus, a very few additional time is used in order to manage the constraints.Second, unnecessary clustering operations are dramatically decreased.This is due to less numbers of output clusters which are resulted from clustering-splitting prevention of must-link constraints.

Conclusion
This paper proposed SEDC-Stream, an evolution-based clustering algorithm for high dimensional data streams.Two solutions have proposed to alleviate the complexity of high dimensional data streams processing: dimension selection & use of constraints.Dimension selection is able to find evolving clusters with subgroups of discriminative dimensions during the progression of data streams.During the stream progression, instance-level constraints which are must-link and cannot-link constraints are integrated by means of constraints prioritization, activation, fading and outdating in order to improve the clustering output.
Several research directions are possible to improve the proposed algorithm.First, the use of background or domain expert knowledge in a semi-supervised clustering manner.Indeed, the use of constraints may not be appropriate with respect to the dynamic nature of data streams.Exploiting background knowledge as single labeled data points (not pair of points) is more appropriate for data streams.Labeled data points can be immediately utilized for determining the class of clusters, and effectively identifying the most appropriate clustering structure evolution operations.Second, an alternate and efficient version of SEDC-Stream for generating arbitrary shape clusters is needed.Indeed, trying to minimize the squared error, SEDC-Stream is only able to generate spherical shape clusters.Clusters with arbitrary shapes are observed in many application areas of science.Online density-based and hierarchical clustering can be combined to obtain an efficient evolution-based and shape-based clustering algorithm for high dimensional data streams.

Fig. 5 .
Fig. 5. F-measure and purity of SEDC-Stream on Forest Cover Type dataset

Fig. 8 .
Fig. 8. Cluster quality when only must-link or cannot-link or both constraints are allowed

Vol. 4 ,Fig. 9 .
Fig. 9. F-measure and purity of SEDC-Stream on electricity dataset is a vector of weighted sum of square of each dimension at time .The  th dimension is Note that the number of projected dimensions in each cluster may be different.() is a  −  histogram with equal intervals of data values.For the  th dimension at time t, the  th bin histogram is