Medoid-based shadow value validation and visualization

(1) * Weksi Budiaji Mail (Sultan Ageng Tirtayasa University, Indonesia; University of Natural Resources and Life Sciences, Austria)
*corresponding author

Abstract


A silhouette index is a well-known measure of an internal criteria validation for the clustering algorithm results. While it is a medoid-based validation index, a centroid-based validation index that is called a centroid-based shadow value (CSV) has been developed.  Although both are similar, the CSV has an additional unique property where an image of a 2-dimensional neighborhood graph is possible. A new internal validation index is proposed in this article in order to create a medoid-based validation that has an ability to visualize the results in a 2-dimensional plot. The proposed index behaves similarly to the silhouette index and produces a network visualization, which is comparable to the neighborhood graph of the CSV. The network visualization has a multiplicative parameter (c) to adjust its edges visibility. Due to the medoid-based, in addition, it is more an appropriate visualization technique for any type of data than a neighborhood graph of the CSV.

Keywords


Cluster validation; Cluster visualization; Internal criteria; Medoid; Shadow value

   

DOI

https://doi.org/10.26555/ijain.v5i2.326
      

Article metrics

Abstract views : 231 | PDF views : 88

   

Cite

   

Full Text

Download

References


[1] A.R. Webb and K. Copsey, Statistical Pattern Recognition, 3rd ed. West Sussex, UK: John Wiley and Sons, 2011, doi: 10.1002/9781119952954.

[2] A.K. Jain and J. V. Moreau, “Bootstrap Technique in Cluster Analysis,” Pattern Recognit., vol. 20, pp. 547–568, 1987, doi: 10.1016/0031-3203(87)90081-1 .

[3] Y. Fang and J. Wang, “Selection of the number of clusters via the bootstrap method,” Comput. Stat. Data Anal., vol. 56, no. 1, pp. 468–477, 2012, doi: 10.1016/j.csda.2011.09.003.

[4] S. Monti, P. Tamayo, J. Mesirov, and T. Golub, “Consensus Clustering: A Resampling-Based Method for Class Discovery and Visualization of Gene Expression Microarray Data,” Mach. Learn., vol. 52, pp. 91–118, 2003, doi: 10.1023/A:1023949509487.

[5] J. Handl, J. Knowles, and D. B. Kell, “Computational cluster validation in post-genomic data analysis,” Bioinformatics, vol. 21, no. 15, pp. 3201–3212, 2005, doi: 10.1093/bioinformatics/bti517.

[6] J. Ji, T. Bai, C. Zhou, C. Ma, and Z. Wang, “An improved k-prototypes clustering algorithm for mixed numeric and categorical data,” Neurocomputing, vol. 120, pp. 590–596, 2013, doi: 10.1016/j.neucom.2013.04.011.

[7] X. Wu et al., “Top 10 algorithms in data mining,” Knowl. Inf. Syst., vol. 14, no. 1, pp. 1–37, 2008, doi: 10.1007/s10115-007-0114-2.

[8] K. Waiyamai and T. Kangkachit, “Constraint-based discriminative dimension selection for high-dimensional stream clustering,” Int. J. Adv. Intell. Informatics, vol. 4, no. 3, pp. 167–179, Nov. 2018, doi: 10.26555/ijain.v4i3.271.

[9] O. Arbelaitz, I. Gurrutxaga, J. Muguerza, J.M. Perez, and I. Perona, “An extensive comparative study of cluster validity indices,” Pattern Recognit., vol. 46, no. 1, pp. 243–256, 2013, doi: 10.1016/j.patcog.2012.07.021.

[10] W. M. Rand, “Objective Criteria for the Evaluation of Clustering Methods,” J. Am. Stat. Assoc., vol. 66, no. 336, pp. 846–850, 1971, doi: 10.1080/01621459.1971.10482356.

[11] L. Hubert and P. Arabie, “Comparing Partitions,” J. Classif., vol. 2, no. 1, pp. 193–218, 1985, doi: 10.1007/BF01908075.

[12] M. Charrad, N. Ghazzali, V. Boiteau, and A. Niknafs, “NbClust: An R Package for Determining the Relevant Number of Clusters in a Data Set,” J. Stat. Softw., vol. 61, no. 6, pp. 1–36, 2014, doi: 10.18637/jss.v061.i06.

[13] P. J. Rousseeuw, “Silhouettes: a graphical aid to the interpretation and validation of cluster analysis,” J. Comput. Appl. Math., vol. 20, pp. 53–65, 1987, doi: 10.1016/0377-0427(87)90125-7.

[14] R. Tibshirani, G. Walther, and T. Hastie, “Estimating the Number of Clusters in a Data Set Via the Gap Statistic,” J. R. Stat. Soc. B, vol. 63, no. 2, pp. 411–423, 2001, doi: 10.1111/1467-9868.00293.

[15] F. Leisch, “Handbook of Data Visualization,” Chen, Hardle, and A. Unwin, Eds. Springer Verlag, 2008, pp. 561–587, doi: 10.1007/978-3-540-33037-0_22.

[16] G. Brock, V. Pihur, S. Datta, and S. Datta, “clValid: An R Package for Cluster Validation,” J. Stat. Softw., vol. 25, no. 4, 2008, doi: 10.18637/jss.v025.i04.

[17] F. Leisch, “A toolbox for K-centroids cluster analysis,” Comput. Stat. Data Anal., vol. 51, pp. 526–544, 2006, doi: 10.1016/j.csda.2005.10.006.

[18] F. Leisch, “Neighborhood graphs, stripes and shadow plots for cluster visualization,” Stat. Comput., vol. 20, pp. 457–469, 2010, doi: 10.1007/s11222-009-9137-8.

[19] G. D. Battista, P. Eades, R. Tamassia, and I. G. Tollis, “Algorithm for drawing graphs: An annotated bibliography,” Comput. Geom., vol. 4, no. 235–282, 1994, doi: 10.1016/0925-7721(94)00014-X.

[20] T. Kamada and S. Kawai, “An Algorithm for Drawing General Undirected Graphs,” Inf. Process. Lett., vol. 31, pp. 7–15, Apr. 1989, doi: 10.1016/0020-0190(89)90102-6.

[21] T. M. Fruchterman and E. M. Reingold, “Graph Drawing by Force-directed Placement,” Software-Practice Exp., vol. 21, no. 11, pp. 1129–1164, Nov. 1991, doi: 10.1002/spe.4380211102.

[22] Qiu and H. Joe, “Generation of Random Clusters with Specified Degree of Separation,” J. Classif., vol. 23, pp. 315–34, 2006, doi: 10.1007/s00357-006-0018-y.

[23] W. Qiu and H. Joe, “Separation Index and Partial Membership for Clustering,” Comput. Stat. Data Anal., vol. 50, no. 3, pp. 585–603, 2006, doi: 10.1016/j.csda.2004.09.009.

[24] L. Kaufman and P. J. Rousseeuw, Finding Groups in Data. New York, USA: John Wiley and Sons, 1990, doi: 10.1002/9780470316801.

[25] C. Hennig, “Cluster-wise Assement of Cluster Stability,” Comput. Stat. Data Anal., vol. 52, pp. 258–271, 2007, doi: 10.1016/j.csda.2006.11.025.

[26] M. Lichman, UCI Machine Learning Repository. 2013, available at: http://archive.ics.uci.edu/ml.

[27] R Core Team, R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing, 2015, available at: https://www.r-project.org/.

[28] W. Qiu and H. Joe, clusterGeneration: Random Cluster Generation (with Specified Degree of Separation). R package version 1.3.4. 2015, available at: https://CRAN.R-project.org/package=clusterGeneration.

[29] M. Maechler, P. Rousseeuw, A. Struyf, M. Hubert, and K. Hornik, cluster: Cluster Analysis Basics and Extensions. R package version 2.0.6 --- For new features, see the “Changelog” file (in the package source). 2017, available at: https://cran.r-project.org/package=cluster.

[30] W. Budiaji, kmed: Distance-Based k-Medoids. R package version 0.2.0. 2019, available at: https://cran.r-project.org/package=kmed.

[31] H. Wickham, ggplot2: Elegant Graphics for Data Analysis. New York: Springer-Verlag, 2016, doi: 10.1007/978-3-319-24277-4_9.

[32] S. Tyner and H. Hofmann, geomnet: Network Visualization in the “ggplot2” Framework. R package version 0.2.0. 2016, available at: https://cran.r-project.org/package=geomnet.




Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

___________________________________________________________
International Journal of Advances in Intelligent Informatics
ISSN 2442-6571  (print) | 2548-3161 (online)
Organized by Informatics Department - Universitas Ahmad Dahlan , and UTM Big Data Centre - Universiti Teknologi Malaysia
Published by Universitas Ahmad Dahlan
W : http://ijain.org
E : info@ijain.org, andri.pranolo@tif.uad.ac.id (paper handling issues)
     ijain@uad.ac.id, andri.pranolo.id@ieee.org (publication issues)

View IJAIN Stats

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0