Motivation

Medical genomics faces significant challenges in interpreting disease phenotype and genetic heterogeneity. Despite the establishment of standardized disease phenotype databases, computational methods for predicting gene–phenotype associations still suffer from imbalanced category distribution and a lack of labeled data in small categories.

Results

To address the problem of labeled-data scarcity, we propose a self-supervised learning strategy for gene–phenotype association prediction, called SSLpheno. Our approach utilizes an attributed network that integrates protein–protein interactions and gene ontology data. We apply a Laplacian-based filter to ensure feature smoothness and use self-supervised training to optimize node feature representation. Specifically, we calculate the cosine similarity of feature vectors and select positive and negative sample nodes for reconstruction training labels. We employ a deep neural network for multi-label classification of phenotypes in the downstream task. Our experimental results demonstrate that SSLpheno outperforms state-of-the-art methods, especially in categories with fewer annotations. Moreover, our case studies illustrate the potential of SSLpheno as an effective prescreening tool for gene–phenotype association identification.

Availability and implementation

https://github.com/bixuehua/SSLpheno.

This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.