EvoDB: a database of evolutionary rate profiles, associated protein domains and phylogenetic trees for PFAM-A

Author Notes

Abstract

The evolutionary rate at codon sites across protein-coding nucleotide sequences represents a valuable tier of information for aligning sequences, inferring homology and constructing phylogenetic profiles. However, a comprehensive resource for cataloguing the evolutionary rate at codon sites and their corresponding nucleotide and protein domain sequence alignments has not been developed. To address this gap in knowledge, EvoDB (an Evolutionary rates DataBase) was compiled. Nucleotide sequences and their corresponding protein domain data including the associated seed alignments from the PFAM-A (protein family) database were used to estimate evolutionary rate (ω = d N /d S ) profiles at codon sites for each entry. EvoDB contains 98.83% of the gapped nucleotide sequence alignments and 97.1% of the evolutionary rate profiles for the corresponding information in PFAM-A. As the identification of codon sites under positive selection and their position in a sequence profile is usually the most sought after information for molecular evolutionary biologists, evolutionary rate profiles were determined under the M2a model using the CODEML algorithm in the PAML (Phylogenetic Analysis by Maximum Likelihood) suite of software. Validation of nucleotide sequences against amino acid data was implemented to ensure high data quality. EvoDB is a catalogue of the evolutionary rate profiles and provides the corresponding phylogenetic trees, PFAM-A alignments and annotated accession identifier data. In addition, the database can be explored and queried using known evolutionary rate profiles to identify domains under similar evolutionary constraints and pressures. EvoDB is a resource for evolutionary, phylogenetic studies and presents a tier of information untapped by current databases.

Database URL:http://www.bioinf.wits.ac.za/software/fire/evodb

Introduction

Hypothesis testing in molecular evolution and phylogenetics depends upon accurate sequence alignments. These data can then be used to investigate adaptation at the sequence level by, for example, detecting protein domains, individual codon sites or branches under positive selection ( 1 ); or developing resources for molecular evolutionary studies ( 2 ). However, the availability of protein domains and their corresponding nucleotide sequences linked to estimates of sequence evolutionary rates is lacking. Furthermore, a comprehensive database of sequence evolutionary rate profiles, which could be probed with a query sequence of a known evolutionary rate profile, is currently unavailable.

To address this gap, EvoDB (an Evolutionary rates DataBase) was compiled. The CODEML program in the PAML (Phylogenetic Analysis using Maximum Likelihood) suite of software (ver. 4.4) ( 3 ) was utilized for estimating evolutionary rates (by convention designated as ω = d N /d S ) although any similar methodology (e.g. HyPhy) can be used ( 4 ). Evolutionary rates can be determined for whole sequences (the entire protein or protein domain), lineages within a phylogenetic tree or at particular codon sites. Typically, molecular evolutionary biologists are interested in identifying specific codon sites under positive selection or adaptive evolution ( 5 ) or in the pattern of evolutionary rates at codon sites across a sequence ( 6 ). In the PAML suite of software the model (NSsites) M2a, which uses nucleotide coding sequence data to calculate the evolutionary rate at codon sites, was therefore used in the development of this database and for explanatory purposes although any of the alternate models can be substituted. In addition to the M2a model, M1a analyses are also provided. The PFAM-A seed alignments ( 7 ) provided a suitable framework for establishing a database that comprised a catalogue of the evolutionary rate at codon sites for each protein domain entry. The PANDIT (Protein and Associated Nucleotide Domains with Inferred Trees) database is a compilation of nucleic acid sequences and corresponding phylogenetic trees for the PFAM-A database; however, this database has not been updated since version 17.0 of PFAM in 2005 ( 8 ). Nucleotide sequence data for each domain in the PFAM-A database are obtained by cross reference mapping information in the PFAM Stockholm file. This information is cross-referenced to a Swiss-Prot ( 9 ) protein which provides the accession identifier and feature information for the corresponding GenBank ( 10 ) file.

EvoDB comprises protein domains from PFAM-A, the corresponding nucleotide sequences and estimates of their evolutionary rates based upon the PFAM-A (ver. 27.0) seed alignments. The database is a compilation of evolutionary rates linked to amino acid and nucleotide data and can be queried using evolutionary rate estimates (under model M2a). The conceptualization and implementation of such an approach have been described elsewhere ( 11 ) and a newer version is forthcoming (manuscript in review).

Methods

Implementation

The retrieval of sequences and the determination of the ω maximum-likelihood estimate (MLE) profiles using CODEML are computationally expensive. The computational resources of the Wits Core Cluster (ZA-WITS-CORE) of 13 nodes running Scientific Linux 6.3 were utilized for achieving parallelism. PFAM-A seed alignments and the GenBank database, including the Swiss-Prot and TrEMBL in the Uniprot resource ( 9 ) database were made available locally. Resources and scheduling were managed by Maui/PBS and TORQUE systems, respectively. Each PFAM entry was submitted as a job using custom scripts. The compilation pipeline is provided in Figure 1 .

Figure 1.

Workflow for the development and compilation of EvoDB.

Open in new tab Download slide

Sequence retrieval, validation and determination of ω MLEs profiles

Accession identifiers for protein sequences were extracted from PFAM Stockholm files using the generic per-sequence annotation ‘#=GS AC’ tag. These identifiers were used to retrieve protein sequences from the Swiss-Prot database using the ENTRET program in the EMBOSS (European Molecular Biology Open Software Suite) software package (ver. 6.3.1) ( 12 ). Those sequences that could not be found on Swiss-Prot were queried in the computer curated TrEMBL database; otherwise, they were removed from the alignments. Each Swiss-Prot protein sequence file contains mapping accession numbers for the corresponding nucleotide sequences found in the ‘DR EMBL’ database cross-reference annotation. Accession and feature information from the Swiss-Prot files was used to retrieve the corresponding nucleotide files from the GenBank database (ver. 193.0) using the ENTRET program. The accession identifier and feature information were used to retrieve the coding sequences (CDS) from GenBank files using the EXTRACTFEAT program in EMBOSS.

The TRANALIGN program in EMBOSS finds corresponding nucleotide CDS for a protein sequence by comparing the translated nucleotide to the protein sequences in all three forward frames. TRANALIGN verified the retrieved CDS and validated the corresponding nucleotide sequences using the Swiss-Prot protein data. Validation was also repeated using the final gapped nucleotide alignment and the PFAM family alignment to ensure high quality of the sequence data.

Phylogenetic trees available from PFAM were utilized. These were calculated using an approximately maximum-likelihood neighbor joining approach based on 100 resamples using the FastTree algorithm ( 13 ). A computationally less expensive approach to prune those missing nucleotide sequences was adopted. The pruning algorithm removed missing sequences by collapsing the connecting nodes, if there was bifurcation, the branch lengths of the pruned sequences were added to the existing sequences.

Pruned trees and nucleotide sequences were then used to determine the ω profiles. The Bayes Empirical Bayes ( 1 ) ω MLEs at codon sites were calculated using the CODEML program (PAML ver. 4.4) ( 3 ) under the M2a Model (NSsites = 2). This parameter assumes one ratio for all the branches and allows for the detection of positive selection at codon sites. MLEs for ω were extracted from the ‘rst’ CODEML output file and used to compile the ω MLE profiles for each family. In addition, we also provide the analysis results under the M1a model (nearly neutral) for comparison.

Results and discussion

EvoDB is a flat file database of evolutionary rate profiles, associated gapped nucleotide alignment, phylogenetic trees and corresponding PFAM alignments for the PFAM-A seed alignments database. The database statistics are provided in Table 1 . EvoDB contains a total of 501,375 nucleotide sequences, indicating that 176,757 (26%) could not be retrieved, this was mostly due to annotation errors, an increasing challenge which has not been addressed since the work of ( 8 ). Additionally, the corresponding phylogenetic trees, PFAM-A alignments and accession identifier data on all sequences including those that could not be retrieved are provided in the database. Evolutionary rates profiles were determined for 97.1% of PFAM-A entries under the M2a model. In addition to these profiles, CODEML analysis results for the M1a and M2a models are provided for comparison and hypothesis testing. Future versions of EvoDB will provide data for M0, M7 and M8 models. The efficacy of the model used to determine this evolutionary profile can be assessed by using the log-likelihood values or the Likelihood Ratio Test (LRT) ( 3 ) using the CODEML ‘mlc’ and ‘rst’ files provided. While we provide the evolutionary rate profiles (under M2a) for all the domains in EvoDB, the caveat is that calculation of d N /d S may be inappropriate for sequences that may have become highly diverged, say over millions of years or for closely related sequences. We suggest a criterion for total branch d S in the range of 0.1 and 0.9 found in the CODEML ‘mlc’ file, those domains not meeting this criterion may not be appropriate for d N /d S calculation. Users of the web interface are cautioned if a domain has a sequence length less than the 100 nucleotides or a total d S value outside the criterion. However, we provide this as a guideline and suggest caution and further interrogation when using d N /d S profiles from those domains that do not meet this criterion. On the other hand, the sequence data and trees are provided; therefore, different models can be run and assessed using the log-likelihood values or the LRT ( 3 ). The web interface for EvoDB was developed with PHP and JavaScript and can be queried by PFAM accession numbers or identifiers. Query results provide links to all the EvoDB data for the corresponding domain ( Figure 2 ). The EvoDB database and release notes are available for download at http://www.bioinf.wits.ac.za/software/fire/evodb .

Figure 2.

The EvoDB web interface allows for easy query and download of data. The database can be queried using PFAM-A domain identifiers and accession identifiers. The results shown here are for the tumor suppressor p53 domain. The CODEML ‘mlc’ and ‘rst’ analysis results for the M1a and M2ac models are provided and a summary of results is provided for viewing. Graphical plots of evolutionary rate profiles can also be viewed or downloaded in various picture file formats. EvoDB provides an interface for downloading the corresponding nucleotide sequences of PFAM protein domain families.

Open in new tab Download slide

Table 1.

Open in new tab

Statistics of the EvoDB database representation for the PFAM-A seed alignments database

Sequence data	Numbers		Percentage
Sequence data	Pandit	EvoDB	Pandit	EvoDB
Evolutionary rate (ω MLE) profiles	—	13 277	—	97.1
Nucleotide sequence alignments	7738	13 512	56.6	98.83
Nucleotide sequences	174 760	501 375	25.8	74

Sequence data	Numbers		Percentage
Sequence data	Pandit	EvoDB	Pandit	EvoDB
Evolutionary rate (ω MLE) profiles	—	13 277	—	97.1
Nucleotide sequence alignments	7738	13 512	56.6	98.83
Nucleotide sequences	174 760	501 375	25.8	74

The numbers of corresponding sequence data in Pandit (Pandit-Plus) have been provided for comparison. The percentage represents comparison of EvoDB coverage to the total numbers found in the PFAM-A seed alignments database.

Table 1.

Open in new tab

Statistics of the EvoDB database representation for the PFAM-A seed alignments database

Sequence data	Numbers		Percentage
Sequence data	Pandit	EvoDB	Pandit	EvoDB
Evolutionary rate (ω MLE) profiles	—	13 277	—	97.1
Nucleotide sequence alignments	7738	13 512	56.6	98.83
Nucleotide sequences	174 760	501 375	25.8	74

Sequence data	Numbers		Percentage
Sequence data	Pandit	EvoDB	Pandit	EvoDB
Evolutionary rate (ω MLE) profiles	—	13 277	—	97.1
Nucleotide sequence alignments	7738	13 512	56.6	98.83
Nucleotide sequences	174 760	501 375	25.8	74

EvoDB represents a valuable resource for phylogenetic studies, and can be used to test hypotheses in molecular evolution. It represents a tier of information untapped by current databases and will complement the arsenal of tools in phylogenetic studies.

Acknowledgements

The financial assistance of the National Research Foundation (NRF) towards this research is hereby acknowledged. Opinions expressed and conclusions arrived at, are those of the author and are not necessarily to be attributed to the NRF. A.N. is supported by the Durand Foundation Scholarship for Evolutionary Biology and Phycology.

Funding

National Aeronautics and Space Administration (#NNX13AH41G to P.M.D.)—principal investigator Professor R.E. Michod (University of Arizona) (#SFH13091742708 to A.N.). Funding for open access charge: University of the Witwatersrand, Johannesburg.

Conflict of interest . None declared.

References

Yang

Wong

W.S.

Nielsen

(

2005

)

Bayes empirical Bayes inference of amino acid sites under positive selection

Mol. Biol. Evol.

1107

–

1118

Dimitrieva

Anisimova

(

2010

)

PANDITplus: toward better integration of evolutionary view on molecular sequences with supplementary bioinformatics resources

Trends Evol. Biol.

Google Scholar

Crossref

WorldCat

Yang

(

2007

)

PAML 4: phylogenetic analysis by maximum likelihood

Mol. Biol. Evol.

1586

–

1591

Pond

S.L.K.

Frost

S.D.W.

Muse

S.V.

(

2005

)

HyPhy: hypothesis testing using phylogenies

Bioinformatics

676

–

679

Yang

Swanson

W.J.

Vacquier

V.D.

(

2000

)

Maximum-likelihood analysis of molecular adaptation in abalone sperm lysin reveals variable selective pressures among lineages and sites

Mol. Biol. Evol.

1446

–

1455

Durand

P.M.

Naidoo

Coetzer

T.L.

(

2008

)

Evolutionary patterning: a novel approach to the identification of potential drug target sites in Plasmodium falciparum

PLoS One

e3685

Punta

Coggill

P.C.

Eberhardt

R.Y.

et al. . (

2012

)

The Pfam protein families database

Nucleic Acids Res.

D290

–

D301

Whelan

de Bakker

P.I.

Quevillon

et al. . (

2006

)

PANDIT: an evolution-centric database of protein and associated nucleotide domains with inferred trees

Nucleic Acids Res.

D327

–

D331

Consortium

T.U.

(

2014

)

Activities at the Universal Protein Resource (UniProt)

Nucleic Acids Res.

D191

–

D198

Benson

D.A.

Karsch-Mizrachi

Clark

Lipman

D.J.

Ostell

Sayers

E.W.

, (

2012

)

GenBank

Nucleic Acids Res.

D48

–

D53

Durand

P.M.

Hazelhurst

Coetzer

T.L.

(

2010

)

Evolutionary rates at codon sites may be used to align sequences and infer protein domain function

BMC Bioinformatics

151

Rice

Longden

Bleasby

(

2000

)

EMBOSS: the European molecular biology open software suite

Trends Genet.

276

–

277

Price

M.N.

Dehal

P.S.

Arkin

A.P.

(

2010

)

FastTree 2—approximately maximum-likelihood trees for large alignments

PLoS One

e9490

Author notes

Citation details: Ndhlovu,A., Durand,P.M. and Hazelhurst,S. EvoDB: a database of evolutionary rate profiles, associated protein domains and phylogenetic trees for PFAM-A. Database (2015) Vol. 2015: article ID bav065; doi:10.1093/database/bav065

This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( http://creativecommons.org/licenses/by/4.0/ ), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.

Download all slides

Month:	Total Views:
November 2016	1
December 2016	4
January 2017	4
February 2017	3
March 2017	3
April 2017	5
May 2017	10
June 2017	3
July 2017	2
August 2017	5
September 2017	2
October 2017	10
November 2017	1
December 2017	10
January 2018	11
February 2018	10
March 2018	26
April 2018	26
May 2018	10
June 2018	13
July 2018	24
August 2018	13
September 2018	11
October 2018	3
November 2018	13
December 2018	2
January 2019	8
February 2019	11
March 2019	12
April 2019	20
May 2019	16
June 2019	12
July 2019	16
August 2019	13
September 2019	14
October 2019	8
November 2019	12
December 2019	16
January 2020	8
February 2020	19
March 2020	9
April 2020	11
May 2020	19
June 2020	22
July 2020	7
August 2020	18
September 2020	16
October 2020	15
November 2020	22
December 2020	15
January 2021	16
February 2021	10
March 2021	29
April 2021	18
May 2021	9
June 2021	18
July 2021	12
August 2021	6
September 2021	19
October 2021	29
November 2021	14
December 2021	6
January 2022	9
February 2022	24
March 2022	35
April 2022	15
May 2022	18
June 2022	13
July 2022	25
August 2022	13
September 2022	19
October 2022	42
November 2022	18
December 2022	7
January 2023	10
February 2023	10
March 2023	13
April 2023	11
May 2023	12
June 2023	7
July 2023	4
August 2023	15
September 2023	4
October 2023	8
November 2023	7
December 2023	12
January 2024	10
February 2024	18
March 2024	10
April 2024	11
May 2024	9
June 2024	20
July 2024	11
August 2024	11
September 2024	8
October 2024	8
November 2024	11
January 2025	8
February 2025	18
March 2025	22
April 2025	6
May 2025	6
June 2025	16
July 2025	7
August 2025	18
September 2025	19
October 2025	21
November 2025	17
December 2025	10
January 2026	7
February 2026	1

Article Contents

EvoDB: a database of evolutionary rate profiles, associated protein domains and phylogenetic trees for PFAM-A

Abstract

Introduction

Methods

Implementation

Sequence retrieval, validation and determination of ω MLEs profiles

Results and discussion

Acknowledgements

Funding

References

Author notes

Citations

Views

Altmetric

Citing articles via

Latest

Most Read

Most Cited

Article Contents

EvoDB: a database of evolutionary rate profiles, associated protein domains and phylogenetic trees for PFAM-A Open Access

Abstract

Introduction

Methods

Implementation

Sequence retrieval, validation and determination of ω MLEs profiles

Results and discussion

Acknowledgements

Funding

References

Author notes

Citations

Views

Altmetric

Citing articles via

Latest

Most Read

Most Cited

This Feature Is Available To Subscribers Only

Gift article access

Gift article access

Gift article access

Gift article access

EvoDB: a database of evolutionary rate profiles, associated protein domains and phylogenetic trees for PFAM-A