Weighted tree-based cluster ensembles for high dimensional data
Smyth, Christine Wendy (2007) Weighted tree-based cluster ensembles for high dimensional data. PhD thesis, James Cook University.
|PDF (Thesis front) - Requires a PDF viewer such as GSview, Xpdf or Adobe Acrobat Reader|
|PDF (Thesis whole) - Requires a PDF viewer such as GSview, Xpdf or Adobe Acrobat Reader|
The increasing size of datasets is particularly evident in the field of bioinformatics. It is unlikely that analyzing these large datasets with a single model will produce an accurate solution. This has led to the ensemble approach, where many models are averaged to give a consensus representation of the data. Taking a weighted average of the individual models has improved the accuracy of both classification and regression ensembles. However, weighting models within a cluster ensemble has remained relatively undeveloped because there is no gold standard available for comparison.
This thesis explores a technique of weighting cluster ensembles. A regression technique, multivariate regression trees, is shown to produce an accurate clustering solution. Each solution (tree) is then weighted purely in terms of its predictive accuracy. Various weighting strategies are trialed to determine the superior technique. After each individual tree is assigned a weight, the trees’ co-occurrence matrices are obtained. The co-occurrence matrices are then aggregated together, weighted according to the trees’ predictive weights. The final result is a single weighted co-occurrence matrix.
A new technique, similarity-based k-means, is developed in order to partition the weighted co-occurrence matrix. Similarity-based k-means is demonstrated to produce accurate partitions of similarity matrices. The resulting clusters agree with the known groups in the investigated datasets.
Furthermore, this thesis develops two other techniques so that maximal information can be obtained in conjunction with the weighted cluster ensemble. The first method suggests an estimate of the natural number of clusters in a dataset, by assessing the predictive performance and variability of similarity-based k-means for various numbers of clusters. The estimates agree with the known numbers of groups within the investigated datasets. The second method elucidates the variables that define the clusters. These variables have high classification power within the studied datasets.
Therefore, this thesis presents a holistic cluster analysis: clusters are accurately unearthed within large datasets; an estimate of the natural number of clusters is obtained; and the variables important in defining the clusters are also established. The weighted cluster ensemble technique is applied to a variety of small and large datasets. All results demonstrate the power of weighting the individual models within the ensemble: the developed weighted cluster ensemble technique consistently outperforms the other techniques. The results of analyzing two DNA microarray datasets are particularly promising. The discovered clusters overlap with the known diagnoses in the datasets, and the variables deemed important in defining the clusters have previously been suggested as biomarkers.
Whilst the size of contemporary datasets presents unique statistical challenges, the potential information within them is immense. Statistical techniques must be developed in order to accurately analyze these datasets. Motivated by the success of weighted regression and classification ensembles applied to large datasets, this thesis suggests a technique of weighting models within a cluster ensemble. The results highlight the potential of weighted cluster ensembles in high dimensional settings, such as the analysis of DNA microarrays.
|Item Type:||Thesis (PhD)|
|Keywords:||large datasets, weighted regression, regression ensembles, quadratic programs, lasso, evolutionary algorithms, post processing, cluster ensembles, cluster analysis, multivariate regression trees, similarity matrices, DNA microarrays|
|FoR Codes:||01 MATHEMATICAL SCIENCES > 0104 Statistics > 010401 Applied Statistics @ 100%|
|SEO Codes:||97 EXPANDING KNOWLEDGE > 970101 Expanding Knowledge in the Mathematical Sciences @ 100%|
|Deposited On:||29 Nov 2011 09:24|
|Last Modified:||29 Nov 2011 09:24|
Last 12 Months: 27
Repository Staff Only: item control page