Abstract
The diversity of biological omics data provides richness of information, but also presents an analytic challenge. While there has been much methodological and theoretical development on the statistical handling of large volumes of biological data, far less attention has been devoted to characterizing their veracity and variability.
We propose a method of statistically quantifying heterogeneity among multiple groups of datasets, derived from different omics modalities over various experimental and/or disease conditions. It draws upon strategies from analysis of variance and principal component analysis in order to reduce dimensionality of the variability across multiple data groups. The resulting hypothesis-based inference procedure is demonstrated with synthetic and real data from a cell line study of growth factor responsiveness based on a factorial experimental design.
Source code and datasets are freely available at https://github.com/yangzi4/gPCA.
Supplementary data are available at Bioinformatics online.