A performance comparison of modern statistical techniques for molecular descriptor selection and retention prediction in chromatographic QSRR studies
Hancock, Tim, Put, Raf, Coomans, Danny, Vander Heyden, Yvan, and Everingham, Yvette (2005) A performance comparison of modern statistical techniques for molecular descriptor selection and retention prediction in chromatographic QSRR studies. Chemometrics and Intelligent Laboratory Systems, 76 (2). pp. 185-196.
|PDF (Published Version) - Repository staff only - Requires a PDF viewer such as GSview, Xpdf or Adobe Acrobat Reader|
View at Publisher Website: http://dx.doi.org/10.1016/j.chemolab.200...
As datasets are becoming larger, a solution to the problem of variable prediction, this problem is becoming harder. The problem is to define which subset of variables produces optimum predictions. The example studied aims to predict the chromatographic retention of 83 basic drugs on a Unisphere PBD column at pH 11.7 using 1272 molecular descriptors. The goal of this paper is to compare the relative performance of recently developed data mining methods, specifically classification and regression trees (CART), stochastic gradient boosting for tree-based models (Treeboost), and random forests (RF), with common statistical techniques in chemometrics; and genetic algorithms on multiple linear regression (GA-MLR), uninformative variable elimination partial least squares (UVE-PLS), and SIMPLS. The comparison will be performed primarily on predictive performance, but also on the variables found to be most important for the predictions. The results of this study indicated that, individually, GA-MLR (R2=0.93) outperformed all models. Further analysis found that a combination approach of GA-MLR and Treeboost (R2=0.98) further improved these results.
|Item Type:||Article (Refereed Research - C1)|
|Keywords:||chromatographic QSRR studies; molecular descriptor selection; CART; bagging; random forests; gradient boosting; genetic algorithms; QSRR; retention prediction|
|FoR Codes:||01 MATHEMATICAL SCIENCES > 0104 Statistics > 010401 Applied Statistics @ 50%|
03 CHEMICAL SCIENCES > 0301 Analytical Chemistry > 030106 Quality Assurance, Chemometrics, Traceability and Metrological Chemistry @ 50%
|SEO Codes:||97 EXPANDING KNOWLEDGE > 970101 Expanding Knowledge in the Mathematical Sciences @ 100%|
|Deposited On:||12 Jun 2009 15:45|
|Last Modified:||12 Jun 2013 00:42|
Last 12 Months: 0
|Citation Counts with External Providers:||Web of Science: 51|
Repository Staff Only: item control page