The classification of cancer is a major research topic in bioinformatics. The nature of high dimensionality and small size associated with gene expression data,however,makes the classification quite challenging. Although principal component analysis (PCA) is of particular interest for the high-dimensional data,it may overemphasize some aspects and ignore some other important information contained in the richly complex data,because it displays only the difference in the first twoor three-dimensional PC subspaces. Based on PCA,a principal component accumulation (PCAcc) method was proposed. It employs the information contained in multiple PC subspaces and improves the class separability of cancers. The effectiveness of the present method was evaluated by four commonly used gene expression datasets,and the results show that the method performs well for cancer classification.
The application of Raman spectroscopic techniques combined with multivariate chemometrics signal processing promise new means for the rapid multidimensional analysis of metabolites non-destructively, with little or no sample preparation and little sensitivity to water. However, Rayleigh scattering, fluorescence and uncontrolled variance present substantial challenges for the accurate quantitative analysis of metabolites at physiological levels in bio- logically varying samples. Effective strategies include the application of chemometrics pretreatments for reducing Raman spectral interference. However, the arbitrary application of individual or combined pretreatment procedures can significantly alter the outcome of a measurement, thereby complicating spectral analysis. This paper evaluates and compares six signal pretreatment methods for correcting the baseline variances, together with three variable se- lection methods for eliminating uninformative variables, all within the context of multivariate calibration models based on partial least squares (PLS) regression. Raman spectra of 90 artificial bio-fluid samples with eight urine metabolites at near-physiological concentrations were used to test these models. The combination of multiplicative scatter correction (MSC), continuous wavelet transform (CWT), randomization test (RT) and PLS modeling pre- sented the best performance for all the metabolites. The correlation coefficient (R) between predicted and prepared concentration reached as high as 0.96.