Motivation

Analyzing genetic data to identify markers and construct predictive models is of great interest in biomedical research. However, limited by cost and sample availability, genetic studies often suffer from the “small sample size, high dimensionality” problem. To tackle this problem, an integrative analysis that collectively analyzes multiple datasets with compatible designs is often conducted. For regularizing estimation and selecting relevant variables, penalization and other regularization techniques are routinely adopted. “Blindly” searching over a vast number of variables may not be efficient.

Results

We propose incorporating prior information to assist integrative analysis of multiple genetic datasets. To obtain accurate prior information, we adopt a convolutional neural network with an active learning strategy to label textual information from previous studies. Then the extracted prior information is incorporated using a group LASSO-based technique. We conducted a series of simulation studies that demonstrated the satisfactory performance of the proposed method. Finally, data on skin cutaneous melanoma are analyzed to establish practical utility.

Availability and implementation

Code is available at https://github.com/ldz7/PAIA. The data that support the findings in this article are openly available in TCGA (The Cancer Genome Atlas) at https://portal.gdc.cancer.gov/.

This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.