Le, Thang Viet. Clustering by graph density variation analysis (GDVA) with density-based cluster validity indices (DVI). Retrieved from https://doi.org/doi:10.7282/T3610ZD8
DescriptionCluster analysis is an important problem in data mining and machine learning. In reality, clustering has been a useful exploratory technique for many different applications such as image segmentation, document grouping, gene expression data analysis, paralog/ortholog detection, etc. Typically, clustering algorithms seek to partition a data set into subsets such that data objects in each subset are very similar, while data objects from different subsets are not so similar. Graph-based clustering methods partition proximity graphs, where nodes represent objects and edge weights indicate pairwise similarities, into subgraphs such that nodes of each subgraph are strongly connected, while nodes of different subgraphs are weakly connected. This thesis presents a computationally efficient clustering method based on graph density variation analysis, where the density of an object with respect to a set of objects is defined as its average similarity to the objects of the set. The method partitions a proximity graph using the assumption that each cluster has a core region composed of high-density nodes which are strongly interconnected by edges that have a high and comparable weight. Cores of clusters are identified by extracting nodes from the core regions, and then expanded by assigning non-core nodes to the most similar cluster. Novel density-based cluster validity indices are incorporated to measure the quality of clustering solutions so that parameter values can be determined automatically. For a direct and objective evaluation and comparison with other graph clustering techniques, random graphs are generated as input data and the results of clustering are matched against a known ground truth. The method is robust to noise and has a low complexity with two main parameters which can be used to adjust the granularity of results. Experiments show that the method is less dependent on parameters if clusters in the data are well-separated or have a dense cluster core. Good results on heterogeneous data such as multi-dimensional data, images, text and gene expression data demonstrate the effectiveness and versatility of the method.