Motivation

Species and gene trees represent how species and individual loci within their genomes evolve from their most recent common ancestors. These trees are central to addressing several questions in biology relating to, among other issues, species conservation, trait evolution and gene function. Consequently, their accurate inference from genomic data is a major endeavor. One approach to their inference is to co-estimate species and gene trees from genome-wide data. Indeed, Bayesian methods based on this approach already exist. However, these methods are very slow, limiting their applicability to datasets with small numbers of taxa. The more commonly used approach is to first infer gene trees individually, and then use gene tree estimates to infer the species tree. Methods in this category rely significantly on the accuracy of the gene trees which is often not high when the dataset includes closely related species.

Results

In this work, we introduce a simple, yet effective, iterative method for co-estimating gene and species trees from sequence data of multiple, unlinked loci. In every iteration, the method estimates a species tree, uses it as a generative process to simulate a collection of gene trees, and then selects gene trees for the individual loci from among the simulated gene trees by making use of the sequence data. We demonstrate the accuracy and efficiency of our method on simulated as well as biological data, and compare them to those of existing competing methods.

Availability and implementation

The method has been implemented in PhyloNet, which is publicly available at http://bioinfocs.rice.edu/phylonet.

This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/about_us/legal/notices)