Constraint-based discriminative dimension selection for high-dimensional stream clustering

Kitsana Waiyamai; Thanapat Kangkachit

doi:10.26555/ijain.v4i3.271


Constraint-based discriminative dimension selection for high-dimensional stream clustering

^{(1) *} Kitsana Waiyamai

(Department of Computer Engineering, Kasetsart University, Bangkok, Thailand)
⁽²⁾ Thanapat Kangkachit

(College of Innovative Technology and Engineering, Dhurakij Pundit University, Bangkok, Thailand)
^*corresponding author

Abstract

Clustering data streams is one of active research topic in data mining. However, runtime of the existing stream clustering algorithms increases and their performance drop in the face of large number of dimensions. Complexity of the stream clustering methods is increased when perform on data with large number of dimensions. In order to reduce the clustering complexity, one possible solution consists in determining the appropriate subset of cluster dimensions via dimension projection. SED-Stream is an efficient clustering algorithm that supports high dimension data streams. The aim of this paper is to increase performance of SED-Stream in terms of both clustering quality and execution-time. In order to improve the clustering process, background or domain expert knowledge are integrated as â€œconstraintsâ€ in SEDC-Stream. The new algorithm, SEDC-Stream, supports the evolving characteristics of the dynamic constraints which are activation, fading, outdating and prioritization. SEDC-Stream algorithm is able to reduce cluster splitting time, and place new incoming points to their suitable clusters. Compared to SED-Stream on the three real-world streams datasets, SEDC-Stream is able to generate a better clustering performance in terms of both purity and f-measure.

Keywords

Incremental stream clustering; High-dimensional data streams; Dimension selection; Projected clustering; Constraint-based clustering

DOI

https://doi.org/10.26555/ijain.v4i3.271

Article metrics

Abstract views : 2673 | PDF views : 475

Cite

How to cite item

Full Text

Download

References

[1] C. C. Aggarwal, P. S. Yu, J. Han, and J. Wang, â€œA Framework for Clustering Evolving Data Streams,â€ 2003, pp. 81â€“92, doi: https://doi.org/10.1016/B978-012722442-8/50016-1.

[2] C. C. Aggarwal, J. Han, J. Wang, and P. S. Yu, â€œA Framework for Projected Clustering of High Dimensional Data Streams,â€ 2004, pp. 852â€“863, doi: https://doi.org/10.1016/B978-012088469-8.50075-9.

[3] F. Cao, M. Estert, W. Qian, and A. Zhou, â€œDensity-Based Clustering over an Evolving Data Stream with Noise,â€ 2006, pp. 328â€“339, doi: https://doi.org/10.1137/1.9781611972764.29.

[4] J. Gao, J. Li, Z. Zhang, and P.-N. Tan, â€œAn Incremental Data Stream Clustering Algorithm Based on Dense Units Detection,â€ 2005, pp. 420â€“425, doi: https://doi.org/10.1007/11430919_49.

[5] S. Mansalis, E. Ntoutsi, N. Pelekis, and Y. Theodoridis, â€œAn evaluation of data stream clustering algorithms,â€ Stat. Anal. Data Min. ASA Data Sci. J., vol. 11, no. 4, pp. 167â€“187, Aug. 2018, doi: https://doi.org/10.1002/sam.11380.

[6] M. Ghesmoune, M. Lebbah, and H. Azzag, â€œState-of-the-art on clustering data streams,â€ Big Data Anal., vol. 1, no. 1, p. 13, 2016, available at : https://bdataanalytics.biomedcentral.com/articles/10.1186/s41044-016-0011-3.

[7] Y. Chen and L. Tu, â€œDensity-based clustering for real-time stream data,â€ in Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining, 2007, pp. 133â€“142, doi: https://doi.org/10.1145/1281192.1281210.

[8] K. Chen and L. Liu, â€œHE-Tree: a framework for detecting changes in clustering structure for categorical data streams,â€ VLDB J., vol. 18, no. 6, pp. 1241â€“1260, Dec. 2009, doi: https://doi.org/10.1007/s00778-009-0134-5.

[9] K. Udommanetanakit, T. Rakthanmanon, and K. Waiyamai, â€œE-Stream: Evolution-Based Technique for Stream Clustering,â€ 2007, pp. 605â€“615, doi: https://doi.org/10.1007/978-3-540-73871-8_58.

[10] S. Gong, Y. Zhang, and G. Yu, â€œClustering stream data by exploring the evolution of density mountain,â€ Proc. VLDB Endow., vol. 11, no. 4, pp. 393â€“405, 2017, available at : https://dl.acm.org/citation.cfm?id=3164136.

[11] C. C. Aggarwal, J. L. Wolf, P. S. Yu, C. Procopiuc, and J. S. Park, â€œFast algorithms for projected clustering,â€ ACM SIGMOD Rec., vol. 28, no. 2, pp. 61â€“72, Jun. 1999, doi: https://doi.org/10.1145/304181.304188.

[12] I. Ntoutsi, A. Zimek, T. Palpanas, P. KrÃ¶ger, and H.-P. Kriegel, â€œDensity-based Projected Clustering over High Dimensional Data Streams,â€ 2012, pp. 987â€“998, doi: https://doi.org/10.1137/1.9781611972825.85.

[13] S. Laohakiat, S. Phimoltares, and C. Lursinsap, â€œA clustering algorithm for stream data with LDA-based unsupervised localized dimension reduction,â€ Inf. Sci. (Ny)., vol. 381, pp. 104â€“123, Mar. 2017, doi: https://doi.org/10.1016/j.ins.2016.11.018.

[14] O. Makul and M. Ekinci, â€œA graph form data stream clustering approach based on dimension reduction,â€ in 2017 25th Signal Processing and Communications Applications Conference (SIU), 2017, pp. 1â€“4, doi: https://doi.org/10.1109/SIU.2017.7960504.

[15] R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan, â€œAutomatic subspace clustering of high dimensional data for data mining applications,â€ ACM SIGMOD Rec., vol. 27, no. 2, pp. 94â€“105, Jun. 1998, doi: https://doi.org/10.1145/276305.276314.

[16] W. Meesuksabai, T. Kangkachit, and K. Waiyamai, â€œEvolution-Based Clustering Technique for Data Streams with Uncertainty,â€ Kasetsart J. (Nat. Sci.), vol. 46, pp. 638â€“652, 2012, available at : https://pdfs.semanticscholar.org/664b/c9c63f8d88590da15ac33d3f791e1ad9626c.pdf.

[17] I. Ahmed, I. Ahmed, and W. Shahzad, â€œA Novel High Dimensional and High Speed Data Streams Algorithm: HSDStream,â€ Int. J. Adv. Comput. Sci. Appl., vol. 7, no. 9, 2016, doi: https://doi.org/10.14569/IJACSA.2016.070952.

[18] K. Waiyamai, T. Kangkachit, T. Rakthanmanon, and R. Chairukwattana, â€œSED-Stream: discriminative dimension selection for evolution-based clustering of high dimensional data streams,â€ Int. J. Intell. Syst. Technol. Appl., vol. 13, no. 3, p. 187, 2014, doi: https://doi.org/10.1504/IJISTA.2014.065174.

[19] S. Basu, A. Banerjee, and R. J. Mooney, â€œActive semi-supervision for pairwise constrained clustering,â€ in Proceedings of the 2004 SIAM international conference on data mining, 2004, pp. 333â€“344, available at : https://epubs.siam.org/doi/abs/10.1137/1.9781611972740.31.

[20] P. S. Bradley, K. P. Bennett, and A. Demiriz, â€œConstrained k-means clustering,â€ Microsoft Res. Redmond, pp. 1â€“8, 2000, available at :http://machinelearning102.pbworks.com/f/ConstrainedKMeanstr-2000-65.pdf.

[21] K. Treechalong, T. Rakthanmanon, and K. Waiyamai, â€œSemi-Supervised Stream Clustering Using Labeled Data Points,â€ 2015, pp. 281â€“295, doi: https://doi.org/10.1007/978-3-319-21024-7_19.

[22] V. Antoine, N. Labroche, and V.-V. Vu, â€œEvidential seed-based semi-supervised clustering,â€ in 2014 Joint 7th International Conference on Soft Computing and Intelligent Systems (SCIS) and 15th International Symposium on Advanced Intelligent Systems (ISIS), 2014, pp. 706â€“711, doi: https://doi.org/10.1109/SCIS-ISIS.2014.7044676.

[23] C. Ruiz, M. Spiliopoulou, and E. Menasalvas, â€œDensity-based semi-supervised clustering,â€ Data Min. Knowl. Discov., vol. 21, no. 3, pp. 345â€“370, 2010, doi: https://doi.org/10.1007/s10618-009-0157-y.

[24] C. R. Moreno, M. Spiliopoulou, and E. Menasalvas, â€œUser constraints over data streams,â€ Knowl. Discov. from Data Streams, p. 117, 2006, available at : http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.61.7653&rep=rep1&type=pdf#page=121.

[25] K. Wagstaff, C. Cardie, S. Rogers, S. SchrÃ¶dl, and others, â€œConstrained k-means clustering with background knowledge,â€ in ICML, 2001, vol. 1, pp. 577â€“584, available at : https://web.cse.msu.edu/~cse802/notes/ConstrainedKmeans.pdf.

[26] C. Ruiz, E. Menasalvas, and M. Spiliopoulou, â€œC-DenStream: Using Domain Knowledge on a Data Stream,â€ 2009, pp. 287â€“301, doi: https://doi.org/10.1007/978-3-642-04747-3_23.

[27] T. Sirampuj, T. Kangkachit, and K. Waiyamai, â€œCE-Stream : Evaluation-based technique for stream clustering with constraints,â€ in The 2013 10th International Joint Conference on Computer Science and Software Engineering (JCSSE), 2013, pp. 217â€“222, doi: https://doi.org/10.1109/JCSSE.2013.6567348.

[28] K. Bache and M. Lichman, â€œUCI Machine Learning Repository, University of California, School of Information and Computer Science,â€ Irvine, CA, 2013, available at : http://archive.ics.uci.edu/ml.

[29] M. Harries and N. S. Wales, â€œSplice-2 comparative evaluation: Electricity pricing,â€ Citeseer, Sydnesy, 1999, available at : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.43.9013.

[30] R. Chairukwattana, T. Kangkachit, T. Rakthanmanon, and K. Waiyamai, â€œEfficient evolution-based clustering of high dimensional data streams with dimension projection,â€ in 2013 International Computer Science and Engineering Conference (ICSEC), 2013, pp. 185â€“190, doi: https://doi.org/10.1109/ICSEC.2013.6694776.

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

___________________________________________________________
International Journal of Advances in Intelligent Informatics
ISSN 2442-6571 (print) | 2548-3161 (online)
Organized by UAD and ASCEE Computer Society
Published by Universitas Ahmad Dahlan
W: http://ijain.org
E: info@ijain.org (paper handling issues)
andri.pranolo.id@ieee.org (publication issues)

View IJAIN Stats

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0

Username
Password
Remember me