Constraint-based discriminative dimension selection for high-dimensional stream clustering

(1) * Kitsana Waiyamai Mail (Department of Computer Engineering, Kasetsart University, Bangkok, Thailand)
(2) Thanapat Kangkachit Mail (College of Innovative Technology and Engineering, Dhurakij Pundit University, Bangkok, Thailand)
*corresponding author

Abstract


Clustering data streams is one of active research topic in data mining. However, runtime of the existing stream clustering algorithms increases and their performance drop in the face of large number of dimensions. Complexity of the stream clustering methods is increased when perform on data with large number of dimensions. In order to reduce the clustering complexity, one possible solution consists in determining the appropriate subset of cluster dimensions via dimension projection. SED-Stream is an efficient clustering algorithm that supports high dimension data streams. The aim of this paper is to increase performance of SED-Stream in terms of both clustering quality and execution-time. In order to improve the clustering process, background or domain expert knowledge are integrated as “constraints” in SEDC-Stream. The new algorithm, SEDC-Stream, supports the evolving characteristics of the dynamic constraints which are activation, fading, outdating and prioritization. SEDC-Stream algorithm is able to reduce cluster splitting time, and place new incoming points to their suitable clusters. Compared to SED-Stream on the three real-world streams datasets, SEDC-Stream is able to generate a better clustering performance in terms of both purity and f-measure.

Keywords


Incremental stream clustering; High-dimensional data streams; Dimension selection; Projected clustering; Constraint-based clustering

   

DOI

https://doi.org/10.26555/ijain.v4i3.271
      

Article metrics

Abstract views : 185 | PDF views : 52

   

Cite

   

Full Text

Download

References


[1] C. C. Aggarwal, P. S. Yu, J. Han, and J. Wang, “A Framework for Clustering Evolving Data Streams,” 2003, pp. 81–92, doi: https://doi.org/10.1016/B978-012722442-8/50016-1.

[2] C. C. Aggarwal, J. Han, J. Wang, and P. S. Yu, “A Framework for Projected Clustering of High Dimensional Data Streams,” 2004, pp. 852–863, doi: https://doi.org/10.1016/B978-012088469-8.50075-9.

[3] F. Cao, M. Estert, W. Qian, and A. Zhou, “Density-Based Clustering over an Evolving Data Stream with Noise,” 2006, pp. 328–339, doi: https://doi.org/10.1137/1.9781611972764.29.

[4] J. Gao, J. Li, Z. Zhang, and P.-N. Tan, “An Incremental Data Stream Clustering Algorithm Based on Dense Units Detection,” 2005, pp. 420–425, doi: https://doi.org/10.1007/11430919_49.

[5] S. Mansalis, E. Ntoutsi, N. Pelekis, and Y. Theodoridis, “An evaluation of data stream clustering algorithms,” Stat. Anal. Data Min. ASA Data Sci. J., vol. 11, no. 4, pp. 167–187, Aug. 2018, doi: https://doi.org/10.1002/sam.11380.

[6] M. Ghesmoune, M. Lebbah, and H. Azzag, “State-of-the-art on clustering data streams,” Big Data Anal., vol. 1, no. 1, p. 13, 2016, available at : https://bdataanalytics.biomedcentral.com/articles/10.1186/s41044-016-0011-3.

[7] Y. Chen and L. Tu, “Density-based clustering for real-time stream data,” in Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining, 2007, pp. 133–142, doi: https://doi.org/10.1145/1281192.1281210.

[8] K. Chen and L. Liu, “HE-Tree: a framework for detecting changes in clustering structure for categorical data streams,” VLDB J., vol. 18, no. 6, pp. 1241–1260, Dec. 2009, doi: https://doi.org/10.1007/s00778-009-0134-5.

[9] K. Udommanetanakit, T. Rakthanmanon, and K. Waiyamai, “E-Stream: Evolution-Based Technique for Stream Clustering,” 2007, pp. 605–615, doi: https://doi.org/10.1007/978-3-540-73871-8_58.

[10] S. Gong, Y. Zhang, and G. Yu, “Clustering stream data by exploring the evolution of density mountain,” Proc. VLDB Endow., vol. 11, no. 4, pp. 393–405, 2017, available at : https://dl.acm.org/citation.cfm?id=3164136.

[11] C. C. Aggarwal, J. L. Wolf, P. S. Yu, C. Procopiuc, and J. S. Park, “Fast algorithms for projected clustering,” ACM SIGMOD Rec., vol. 28, no. 2, pp. 61–72, Jun. 1999, doi: https://doi.org/10.1145/304181.304188.

[12] I. Ntoutsi, A. Zimek, T. Palpanas, P. Kröger, and H.-P. Kriegel, “Density-based Projected Clustering over High Dimensional Data Streams,” 2012, pp. 987–998, doi: https://doi.org/10.1137/1.9781611972825.85.

[13] S. Laohakiat, S. Phimoltares, and C. Lursinsap, “A clustering algorithm for stream data with LDA-based unsupervised localized dimension reduction,” Inf. Sci. (Ny)., vol. 381, pp. 104–123, Mar. 2017, doi: https://doi.org/10.1016/j.ins.2016.11.018.

[14] O. Makul and M. Ekinci, “A graph form data stream clustering approach based on dimension reduction,” in 2017 25th Signal Processing and Communications Applications Conference (SIU), 2017, pp. 1–4, doi: https://doi.org/10.1109/SIU.2017.7960504.

[15] R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan, “Automatic subspace clustering of high dimensional data for data mining applications,” ACM SIGMOD Rec., vol. 27, no. 2, pp. 94–105, Jun. 1998, doi: https://doi.org/10.1145/276305.276314.

[16] W. Meesuksabai, T. Kangkachit, and K. Waiyamai, “Evolution-Based Clustering Technique for Data Streams with Uncertainty,” Kasetsart J. (Nat. Sci.), vol. 46, pp. 638–652, 2012, available at : https://pdfs.semanticscholar.org/664b/c9c63f8d88590da15ac33d3f791e1ad9626c.pdf.

[17] I. Ahmed, I. Ahmed, and W. Shahzad, “A Novel High Dimensional and High Speed Data Streams Algorithm: HSDStream,” Int. J. Adv. Comput. Sci. Appl., vol. 7, no. 9, 2016, doi: https://doi.org/10.14569/IJACSA.2016.070952.

[18] K. Waiyamai, T. Kangkachit, T. Rakthanmanon, and R. Chairukwattana, “SED-Stream: discriminative dimension selection for evolution-based clustering of high dimensional data streams,” Int. J. Intell. Syst. Technol. Appl., vol. 13, no. 3, p. 187, 2014, doi: https://doi.org/10.1504/IJISTA.2014.065174.

[19] S. Basu, A. Banerjee, and R. J. Mooney, “Active semi-supervision for pairwise constrained clustering,” in Proceedings of the 2004 SIAM international conference on data mining, 2004, pp. 333–344, available at : https://epubs.siam.org/doi/abs/10.1137/1.9781611972740.31.

[20] P. S. Bradley, K. P. Bennett, and A. Demiriz, “Constrained k-means clustering,” Microsoft Res. Redmond, pp. 1–8, 2000, available at :http://machinelearning102.pbworks.com/f/ConstrainedKMeanstr-2000-65.pdf.

[21] K. Treechalong, T. Rakthanmanon, and K. Waiyamai, “Semi-Supervised Stream Clustering Using Labeled Data Points,” 2015, pp. 281–295, doi: https://doi.org/10.1007/978-3-319-21024-7_19.

[22] V. Antoine, N. Labroche, and V.-V. Vu, “Evidential seed-based semi-supervised clustering,” in 2014 Joint 7th International Conference on Soft Computing and Intelligent Systems (SCIS) and 15th International Symposium on Advanced Intelligent Systems (ISIS), 2014, pp. 706–711, doi: https://doi.org/10.1109/SCIS-ISIS.2014.7044676.

[23] C. Ruiz, M. Spiliopoulou, and E. Menasalvas, “Density-based semi-supervised clustering,” Data Min. Knowl. Discov., vol. 21, no. 3, pp. 345–370, 2010, doi: https://doi.org/10.1007/s10618-009-0157-y.

[24] C. R. Moreno, M. Spiliopoulou, and E. Menasalvas, “User constraints over data streams,” Knowl. Discov. from Data Streams, p. 117, 2006, available at : http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.61.7653&rep=rep1&type=pdf#page=121.

[25] K. Wagstaff, C. Cardie, S. Rogers, S. Schrödl, and others, “Constrained k-means clustering with background knowledge,” in ICML, 2001, vol. 1, pp. 577–584, available at : https://web.cse.msu.edu/~cse802/notes/ConstrainedKmeans.pdf.

[26] C. Ruiz, E. Menasalvas, and M. Spiliopoulou, “C-DenStream: Using Domain Knowledge on a Data Stream,” 2009, pp. 287–301, doi: https://doi.org/10.1007/978-3-642-04747-3_23.

[27] T. Sirampuj, T. Kangkachit, and K. Waiyamai, “CE-Stream : Evaluation-based technique for stream clustering with constraints,” in The 2013 10th International Joint Conference on Computer Science and Software Engineering (JCSSE), 2013, pp. 217–222, doi: https://doi.org/10.1109/JCSSE.2013.6567348.

[28] K. Bache and M. Lichman, “UCI Machine Learning Repository, University of California, School of Information and Computer Science,” Irvine, CA, 2013, available at : http://archive.ics.uci.edu/ml.

[29] M. Harries and N. S. Wales, “Splice-2 comparative evaluation: Electricity pricing,” Citeseer, Sydnesy, 1999, available at : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.43.9013.

[30] R. Chairukwattana, T. Kangkachit, T. Rakthanmanon, and K. Waiyamai, “Efficient evolution-based clustering of high dimensional data streams with dimension projection,” in 2013 International Computer Science and Engineering Conference (ICSEC), 2013, pp. 185–190, doi: https://doi.org/10.1109/ICSEC.2013.6694776.




Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

___________________________________________________________
International Journal of Advances in Intelligent Informatics
ISSN 2442-6571  (print) | 2548-3161 (online)
Organized by Informatics Department - Universitas Ahmad Dahlan , and UTM Big Data Centre - Universiti Teknologi Malaysia
Published by Universitas Ahmad Dahlan
W : http://ijain.org
E : info@ijain.org, andri.pranolo@tif.uad.ac.id (paper handling issues)
     ijain@uad.ac.id, andri.pranolo.id@ieee.org (publication issues)

View IJAIN Stats

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0