Summary

A concern when conducting genome-wide association studies (GWAS) is the potential for population stratification, i.e. ancestry-based genetic differences between cases and controls, that if not properly accounted for, could lead to biased association results. We developed PCAmatchR as an open source R package for performing optimal case–control matching using principal component analysis (PCA) to aid in selecting controls that are well matched by ancestry to cases. PCAmatchR takes user supplied PCA outputs and selects matching controls for cases by utilizing a weighted Mahalanobis distance metric which weights each principal component by the percentage of genetic variation explained. Results from the 1000 Genomes Project data demonstrate both the functionality and performance of PCAmatchR for selecting matching controls for case populations as well as reducing inflation of association test statistics. PCAmatchR improves genomic similarity between matched cases and controls, which minimizes the effects of population stratification in GWAS analyses.

Availability and implementation

PCAmatchR is freely available for download on GitHub (https://github.com/machiela-lab/PCAmatchR) or through CRAN (https://CRAN.R-project.org/package=PCAmatchR).

Supplementary information

Supplementary data are available at Bioinformatics online.

This work is written by US Government employees and is in the public domain in the US.