PhyloPro2.0: a database for the dynamic exploration of phylogenetically conserved proteins and their domain architectures across the Eukarya

Author Notes

Abstract

PhyloPro is a database and accompanying web-based application for the construction and exploration of phylogenetic profiles across the Eukarya. In this update article, we present six major new developments in PhyloPro: (i) integration of Pfam-A domain predictions for all proteins; (ii) new summary heatmaps and detailed level views of domain conservation; (iii) an interactive, network-based visualization tool for exploration of domain architectures and their conservation; (iv) ability to browse based on protein functional categories (GOSlim); (v) improvements to the web interface to enhance drill down capability from the heatmap view; and (vi) improved coverage including 164 eukaryotes and 12 reference species. In addition, we provide improved support for downloading data and images in a variety of formats. Among the existing tools available for phylogenetic profiles, PhyloPro provides several innovative domain-based features including a novel domain adjacency visualization tool. These are designed to allow the user to identify and compare proteins with similar domain architectures across species and thus develop hypotheses about the evolution of lineage-specific trajectories.

Database URL : http://www.compsysbio.org/phylopro/

Introduction

Phylogenetic profiling has been widely adopted as a method to visualize evolutionary conservation of genes/proteins. This approach has been facilitated by improvements in sequencing technology resulting in an ever increasing number of fully sequenced, eukaryotic genomes. Aside from using phylogenetic profiles to predict gene function ( 1–3 ), a number of online tools have been developed to allow users to explore and visualize phylogenetic profiles. For the most part, such tools are restricted to providing profiles for a single orthologous group of proteins (orthogroup). For example, EnsemblCompara GeneTrees, which is largely focused on vertebrates, allows the visualization of ortholog gains and losses in the context of a phylogenetic tree ( 4 ). TreeFam offers a summary tree visual, which indicates the proportion of species within lineages that possess orthologs of a selected gene ( 5 ), while EggNOG ( 6 ) and OrthoMCL ( 7 ) provide the ability to generate a taxonomic profile for a specified orthogroup as well as to identify genes with defined phylogenetic profiles. However these tools do not allow the visualization of more than one orthogroup at a time and in the case of EnsemblCompara and Treefam, are largely focused on vertebrates and metazoa, respectively. On the other hand, the OMA ( 8 ) database resource, which captures orthologous relationships from 1706 complete proteomes, does offer the capacity to view taxonomic profiles for closely related orthogroups. However, it does not allow the direct comparison of potentially unrelated orthogroups; furthermore, the lack of clustering of profiles makes it difficult to infer lineage-specific innovations across large groups of genes.

Phylogeny based methods, which are computationally intensive, provide robust support for orthology focused at the level of individual genes. PhylomeDB ( 9 ), for instance, provides phylogenetic trees and alignments as well as orthology and parology predictions for a seed sequence based on calling duplication and speciation events on the tree as determined by a species-overlap algorithm. This differs from the more commonly used approach of reconciling a gene tree to the species tree. As a consequence of building gene trees for every gene, multi-gene families are represented by several, independently derived trees. MetaPhOrs ( 10 ) has exploited this additional information to define a measure of reliability for ortholog predictions based on consistency. MetaPhOrs applies this approach to trees derived from PhylomeDB ( 9 ), EnsemblCompara ( 4 ), TreeFam ( 11 ) and yeast Orthogroups ( 12 ) as well as trees reconstructed from EggNOG ( 6 ), OrthoMCL ( 7 ) and COG ( 13 ) to provide phylogeny-based orthology and paralogy predictions for 4.1 million proteins in 829 fully sequenced genomes. A disadvantage remains that this focused approach comes at the expense of tools to analyze/visualize higher level patterns at the level of functionally related pathways or complexes. Recently, there has been much debate in the orthology community about whether orthology/paralogy predictions which are traditionally genocentric are not more appropriately made at the domain-level or smaller ( 14 ). Protein domains are conserved, relatively short units of selection [typically <200 amino acid residues ( 15 )] that mostly correspond to independent folding units. According to this argument, differences in the linear sequence of domains (deemed the domain architecture) can be common among orthologs and are functionally important. This is especially apparent for multicellular eukaryotes where domain architectures are often highly complex and lineage specific ( 14 , 16 ). For example, within the drosophilids, domain rearrangements were found to occur in 36% of gene families ( 17 ). Consequently, while genocentric methods remain a common approach, a re-examination of the assumptions and definitions underlying orthology prediction has resulted in a community consensus that a range of tools will continue to be required, and is indeed desirable, in addressing different orthology-based questions in a variety of contexts ( 18 ).

Underscoring their importance, ortholog databases are beginning to include domain level information, for example, the latest release of PhylomeDB ( 9 ) includes information on Pfam-A domains. However, aside from the ability to reference and view domain architectures on individual gene trees, there is currently a lack of further facility to explore domain adjacency patterns or overview domain conservation at the level of the phylogeny. The contribution of novel domains and domain combinations in driving protein evolution has been a subject of much recent interest ( 19–23 ). In addition to emphasizing the potential for differing mechanisms to contribute to domain variation in different lineages, these studies reveal that patterns of selective domain gain or loss, including the gain or loss of domain repeats, contribute to the evolutionary trajectory of species. To our knowledge, no sequence-based orthology databases yet incorporate domain-level information to help delineate orthology assignments.

Addressing this gap, we present PhyloPro2.0, a database integrating pre-calculated, Inparanoid-based orthologs. Here, we rely on Inparanoid orthology assignments as it is a well-established, BLAST-based method which has been shown to perform as well as or better than other methods across a wide range of eukaryotic genomes with both sensitivity and specificity >80% ( 24 ). Due to the reliance of PhyloPro2.0 on pairwise orthology assignments, Inparanoid is well suited to our automated workflow. Unlike existing resources that focus on individual orthogroups, PhyloPro2.0 offers the capacity to visualize and study orthology and domain phyletic profiles for large sets of genes (up to 1000 genes) and their orthologs. Further, through an interactive domain adjacency visualization tool, users are able to explore the influence of domain architectures on protein conservation. These data and tools, based on highly confident Pfam-A domain predictions ( 25 ), enable the user to link overall protein conservation with underlying domain conservation patterns. PhyloPro has proved to be a valuable resource for evolutionary and comparative studies of biological systems, such as Chromatin modification ( 26 ), Extracellular matrices ( 27 ), Apicomplexan membrane proteins ( 28 ) and vertebrate multi-domain proteins ( 23 ). PhyloPro is freely available via the web and all underlying datasets are downloadable.

PHYLOPRO2.0: generating evolutionary trajectories

Features

Since its original publication ( 29 ) PhyloPro has been continuously updated. Novel in the current release we include Pfam domain and domain architecture conservation information and tools for their exploration across clades. The heatmap view and clustering capability featured in the protein and domain conservation views are unique to PhyloPro and provide a powerful tool for systems level assessment of broad conservation patterns. Clustering is accomplished using Cluster3.0 ( 30 ) and can be customized prior to visualization using the advanced search tool. One of the strengths of PhyloPro is the ability to visualize relationships across many closely related species. This allows the identification of inconsistencies such as the absence of orthologs across a specific lineage that may indicate issues in orthology assignment (potentially due to quality of the associated genome). Compared with the previous release the number of reference organisms has doubled (from 6 to 12) and the number of available genomes has expanded (from 120 to 164).

Data acquisition

Focusing on 12 model organisms ( Table 1 ), the Inparanoid algorithm ( 31 ) was used to perform pairwise homology searches for each model species against 164 eukaryotes (including other model species) for which a complete genome sequence has been generated. These comprise 6 plant species, 3 green algae, 1 red algae, 4 stramenopiles, 1 haptophyte, 2 ciliates, 13 apicomplexans, 4 kinetoplastids, 1 diplomonad, 1 cryptophyte, 1 parabasilid, 1 heterolobosid, 2 amoebazoa, 1 microsporidium, 36 fungi, 5 ‘basal’ metazoa, 8 lophotrochozoa, 16 nematodes, 21 arthropods, 4 chordates, 9 vertebrates and 24 mammals. The use of Inparanoid readily facilitates the identification of so called in-paralogs representing lineage-specific gene duplication events. Given a list of query genes from a model species (the ‘reference’ species), for each ‘target’ species, we define one of five possible homology relationships: (i) no detectable ortholog, (ii) one to one (1:1)—a single query gene has a single ortholog in the target species, (iii) one to many (1:M)—a single query gene has two or more orthologs in the target species, (iv) many to one (M:1)—a query gene together with at least one additional paralog are orthologs of a single gene in the target species, and (v) many to many (M:M)—a query gene together with at least one additional paralog are orthologs of at least two orthologs in the target species genome. The collation of these relationships for each of the 164 target species defines a phylogenetic profile for each query gene which is stored along with the domain predictions described below in a local PostgreSQL database. We also make these datasets, together with the proteome datasets, available for download.

Table 1..

Open in new tab

List of reference species

No.	Common name	Scientific name	Source (Date)
1	Thale cress	Arabidopsis thaliana	PlantGDB: v.173 (26/08/09)
2	Bakers yeast	Saccharomyces cerevisiae	SGD: (12/12/07)
3	Roundworm	Caenorhabditis elegans	WormBase: WS205 (30/07/09)
4	Fruit fly	Drosophila melanogaster	FlyBase: v.1.3 (25/06/09)
5	House mouse	Mus musculus	ENSEMBL: (23/11/07)
6	Human	Homo sapiens	ENSEMBL: (23/11/07)
7	Malarial parasite	Plasmodium falciparum 3D7	PlasmoDB: v.5.4 (24/09/07)
8	Toxoplasma parasite	Toxoplasma gondii ME49	ToxoDB: v.4.3 (01/11/07)
9	Zebrafish	Danio rerio	ENSEMBL: (23/11/07)
10	Fission yeast	Schizosaccharomyces pombe	SANGER: (11/05/06)
11	Leishmania parasite	Leishmania major strain Friedlin	EMBL: (24/12/07)
12	Blood Fluke	Schistosoma mansoni	ENSEMBL: (31/07/14)

No.	Common name	Scientific name	Source (Date)
1	Thale cress	Arabidopsis thaliana	PlantGDB: v.173 (26/08/09)
2	Bakers yeast	Saccharomyces cerevisiae	SGD: (12/12/07)
3	Roundworm	Caenorhabditis elegans	WormBase: WS205 (30/07/09)
4	Fruit fly	Drosophila melanogaster	FlyBase: v.1.3 (25/06/09)
5	House mouse	Mus musculus	ENSEMBL: (23/11/07)
6	Human	Homo sapiens	ENSEMBL: (23/11/07)
7	Malarial parasite	Plasmodium falciparum 3D7	PlasmoDB: v.5.4 (24/09/07)
8	Toxoplasma parasite	Toxoplasma gondii ME49	ToxoDB: v.4.3 (01/11/07)
9	Zebrafish	Danio rerio	ENSEMBL: (23/11/07)
10	Fission yeast	Schizosaccharomyces pombe	SANGER: (11/05/06)
11	Leishmania parasite	Leishmania major strain Friedlin	EMBL: (24/12/07)
12	Blood Fluke	Schistosoma mansoni	ENSEMBL: (31/07/14)

Table 1..

Open in new tab

List of reference species

No.	Common name	Scientific name	Source (Date)
1	Thale cress	Arabidopsis thaliana	PlantGDB: v.173 (26/08/09)
2	Bakers yeast	Saccharomyces cerevisiae	SGD: (12/12/07)
3	Roundworm	Caenorhabditis elegans	WormBase: WS205 (30/07/09)
4	Fruit fly	Drosophila melanogaster	FlyBase: v.1.3 (25/06/09)
5	House mouse	Mus musculus	ENSEMBL: (23/11/07)
6	Human	Homo sapiens	ENSEMBL: (23/11/07)
7	Malarial parasite	Plasmodium falciparum 3D7	PlasmoDB: v.5.4 (24/09/07)
8	Toxoplasma parasite	Toxoplasma gondii ME49	ToxoDB: v.4.3 (01/11/07)
9	Zebrafish	Danio rerio	ENSEMBL: (23/11/07)
10	Fission yeast	Schizosaccharomyces pombe	SANGER: (11/05/06)
11	Leishmania parasite	Leishmania major strain Friedlin	EMBL: (24/12/07)
12	Blood Fluke	Schistosoma mansoni	ENSEMBL: (31/07/14)

No.	Common name	Scientific name	Source (Date)
1	Thale cress	Arabidopsis thaliana	PlantGDB: v.173 (26/08/09)
2	Bakers yeast	Saccharomyces cerevisiae	SGD: (12/12/07)
3	Roundworm	Caenorhabditis elegans	WormBase: WS205 (30/07/09)
4	Fruit fly	Drosophila melanogaster	FlyBase: v.1.3 (25/06/09)
5	House mouse	Mus musculus	ENSEMBL: (23/11/07)
6	Human	Homo sapiens	ENSEMBL: (23/11/07)
7	Malarial parasite	Plasmodium falciparum 3D7	PlasmoDB: v.5.4 (24/09/07)
8	Toxoplasma parasite	Toxoplasma gondii ME49	ToxoDB: v.4.3 (01/11/07)
9	Zebrafish	Danio rerio	ENSEMBL: (23/11/07)
10	Fission yeast	Schizosaccharomyces pombe	SANGER: (11/05/06)
11	Leishmania parasite	Leishmania major strain Friedlin	EMBL: (24/12/07)
12	Blood Fluke	Schistosoma mansoni	ENSEMBL: (31/07/14)

Domain predictions, based on Pfam-A definitions, were performed on a parallel computing platform using HMMER 3.0 with default parameters as implemented in PfamScan ( 32 ). Data flow was handled in a data processing pipeline written in house using Perl. Note Pfam defines six types of entries: Family, Domain, Repeat, Motifs, Coiled-Coil and Disordered ( http://pfam.xfam.org/help ). For our analysis, we only included Pfam-A definitions for entries labelled as either ‘Domains’ (defined by Pfam as a ‘structural unit’) or ‘Families’ (defined by Pfam as ‘a collection of related protein regions’), as these best fit our criteria as independent folding units. Domains found in each of the proteins in the 12 reference sequences were compared with domains representing the full proteome of every other species. Domain architectures of target and reference orthologs were compared and classified parsimoniously as having gained or lost domains, having the same (conserved) domain architecture, or having rearrangements. Where more than one sequence of gains, losses or rearrangements were equally parsimonious, this resulted in a classification of ‘complex’ type. For purposes of the comparisons, domain order was taken into account. Adjacent domains were defined in the N-terminal to C-terminal orientation. Reverse orientations were considered to be unique (i.e. A − B ≠ B − A).

Functional annotations (GOSlim) for human proteins were acquired from BioMart ( 33 ). We used Ensembl 80 with default parameters and the following additional filters: Status (gene): KNOWN, Status (transcript): KNOWN, Transcript Support Level (TSL): Only, Limit to genes: with Pfscan ID(s). The frequencies of the resulting annotations were calculated using a perl script and available functional categories were limited to a subset with frequencies below what we considered to be a reasonable threshold of 1000 proteins for bulk search.

Querying and browsing in PhyloPro

PhyloPro features several ways to launch a search. A quick search using default options can be performed by entering a space separated list of gene or protein identifiers for a select reference species of interest into the search box, selecting the type of information to return (protein conservation, domain conservation or domain adjacency) and clicking on the ‘Go’ button. For quick searches, the default reference organism for comparison corresponds to the type of identifier first identified among the first 10 listed genes. For example, the use of a mouse gene identifier (e.g. ENSMUSG00000034205) would result in an analysis with mouse as the reference species. Identifier types are not limited to Ensembl but reflect a variety of identifiers in use for various species, depending on cross-referencing available at the time the species was loaded. Alternatively, users have the option of choosing a functional category from a list of Gene Ontology (GO) terms to automatically populate the search list with proteins annotated to the selected term. For performance considerations, available terms are based on GOSlim annotations with a frequency cutoff of 1000 proteins.

Beyond the quick search and GO browsing capabilities, PhyloPro also offers a search option based on sequence similarities, using the well-established BLAST algorithm. It is recognized that as genomes and gene models become updated, gene and protein identifiers may become obsolete. The inclusion of the sequence similarity search option is introduced to guards against such possibilities. After selecting this option from the home page, the user is presented with a sequence similarity search page with options to run a nucleotide-based (BLASTx) or protein-based (BLASTp) search against a reference proteome of their choice. The user pastes in a set of sequences in fasta format and after clicking the ‘Go’ button, PhyloPro retrieves the top BLAST hit associated with each query sequence. The resulting page (protein conservation, domain conservation or domain network view as selected by the user) is then constructed from these hits. Mappings of the user sequences to the identified hits are also presented.

Finally, PhyloPro also offers an advanced search option that allows users to specify a number of parameters for the analysis including: (i) choice of reference species, (ii) limit the range of target species, (iii) choose the similarity metric and clustering method used for clustering the resulting heatmap (if applicable), (iv) choose the type of view (as above) and (v) upload a text file corresponding to the proteins to be searched. Users can review the selected parameters before clicking ‘Go’ to start the search. It is worth mentioning that there is a slight difference in the search depending on whether the user chooses to use a gene identifier vs. a protein identifier in the search box. The use of gene identifiers will result in PhyloPro finding the longest peptide of those which map to the selected gene identifier as the basis for orthology prediction, whereas the use of a protein identifier will result in a protein conservation profile including the exact protein specified as the reference. By design, all domain-based views use the longest peptide for the corresponding gene as the basis for domain comparisons. PhyloPro uses a PostgreSQL ( http://www.postgresql.org/ ) database to speed the retrieval of large amounts of pre-calculated orthology and domain predictions. After a few moments the user will be taken to one of three views depending on their initial choice.

The protein conservation view ( Figure 1A ) displays a heatmap in which colored tiles indicate the presence (color) or absence (black) of an ortholog of the reference organism in a given target species. The exact reference protein and target species corresponding to a particular tile is revealed by a mouse-over event, and selecting the tile displays the protein sequence of the orthologs and any predicted inparalogs arising from one to many (1:M), many to one (M:1) or many to many (M:M) predictions ( Figure 1A inset). The heatmap presented shows a subset of proteins corresponding to the GO functional category, ‘Anatomical structure formation involved in morphogenesis’ with human as the reference organism. For consistency, we use this subset as the basis for subsequent figures. Clustering of this set revealed at least three potentially interesting groupings corresponding to genes of mostly metazoan origin whose genomes have acquired additional paralogs in vertebrates (Group 1), a group including highly conserved genes with few additional paralogs (Group 2) and a smaller group consisting of some genes of mammalian origin as well as those featuring primate-specific paralogs (Group 3). Amongst the Group 1 proteins is SLIT2, a protein thought to act as a molecular guidance cue in cellular migration ( 34 ). Among an assortment of similar functions, SLIT1 and SLIT2 appear to be essential for midline guidance in the forebrain, acting as a repulsive signal preventing inappropriate midline crossing by axons projecting from the olfactory bulb. This may explain the occurrence of additional paralogs in vertebrates. The heatmap image, summary analysis as well as the underlying sequence data may be downloaded from the view.

Figure 1.

Protein and domain conservation views. ( A ) Conservation of proteins corresponding to the GOSlim category, ‘Anatomical structure formation involved in morphogenesis’. Colored tiles indicate the presence (color) or absence (black) of an ortholog of the reference organism (in this case human) in a given target species. Species are indicated across the top, grouped by phylogeny with plants on the left. Proteins are indicated in rows on the left, clustered so that proteins with similar patterns of conservation are grouped together. The sequence of a selected human reference protein (SLIT2) and its mouse ortholog are also shown (inset). ( B ) Domain architecture conservation corresponding to the same group of proteins as in (A) above. Tile colors reflect the comparison between the reference and target domain architectures. The corresponding architectures for SLIT2 are shown (inset). Note that gene order is determined by clustering and is independent between views.

Open in new tab Download slide

The domain conservation view ( Figure 1B ) displays a heatmap similar in layout to the protein view, with black tiles indicating the absence of an ortholog in the target species. However, here tiles are colored to indicate inferences about domain gain, loss or rearrangement resulting from a comparison of domain architecture (defined as the linear sequence of domains in the N to C terminal direction) in the reference vs. target orthologs. For simplicity, gains and losses of domains in repeats are grouped with those for single domains for purposes of coloring tiles in this view. However, a more granular categorization is captured in the summary analysis that can be downloaded along with the image and underlying domain architectures from this view. As with the protein view, a mouse-over event reveals the reference protein and target species corresponding to a particular tile. Selecting the tile displays the domain architecture for the reference and target sequence. Here, we once again focus on the mouse ortholog of the SLIT2 protein which is a red tile indicating fewer domains than the human reference. The pop-up allows us to identify that the mouse ortholog is very similar to the reference protein with a difference of one EGF domain ( Figure 1B inset). Clustering by domain conservation has revealed at least three broad categories of possible interest. The first indicates a subset of proteins (Group 4) characterized by a high variability in their domain architectures with possible clade-specific differences in the lophotrochozoa and arthropods, perhaps corresponding to morphological differences in life cycles. The second group (Group 5) appears to have largely conserved domain architectures, whereas the last group (Group 6) consists of proteins with no detectable domains. The latter may occur for reasons such as high sequence divergence or poor sequence quality in which case the domains that may be present remain below the confidence threshold.

A novel interactive domain visualization tool

Domain architectures may be further explored using the domain adjacency view ( Figure 2 ). Here, selected proteins for the reference organism are listed on the right panel along with the names of orthologous proteins among 12 model species comprising the set of possible reference species. In the provided example, we selected the set of all reference species as the basis for this view because they represent an informative cross section of the available phylogeny. The main view consists of a directed network connecting domains (nodes) into architectures (a linear sequence of domains connected by edges in the N to C terminal direction) on the basis of their occurrence and adjacency in the set of proteins indicated in the right panel. Mousing over a protein in the panel results in a color change in the network, highlighting the domain architecture of the selected protein as well as revealing the specific orthologs of that protein (useful for a mixed pool of proteins). In this way, the domain architectures in a set of functionally related proteins and their orthologs are easily compared. Further, the network of architectures may be dynamically expanded to include neighboring domains. By selecting a species on the right panel, followed by a node in the graph representing a domain, PhyloPro will retrieve additional domain neighbors corresponding to adjacent domains in all other proteins in that species. These temporary additions are highlighted in a different color reflecting their transient nature. However, a neighbor may be permanently added to the network by right clicking it. Once added to the network in this way the new domain may be used to seed further exploration. Finally, a path of nodes representing an ad hoc architecture can be highlighted by double clicking a series of nodes. Each time a node is double clicked it is assigned a number representing its order in this search architecture. Note it is possible to select a node repeatedly. Clicking on the ‘Protein Search’ button will activate a search against the reference species for any proteins containing that pattern of domains. If the resulting proteins contain additional domains, both the additional proteins and their domains will be included in the view. The proteins must contain all three domains in the order specified but the pattern need not be contiguous, i.e. for a search architecture ABC a protein with architecture ADBC will match. It has been shown that domains do not necessarily need to be contiguous in order to contribute to a conserved three-dimensional fold ( 22 , 35 , 36 ). We have further discussed the importance of conserved, higher-order domain architectures elsewhere ( 23 ). As the domain adjacency view tool develops, we will seek to respond to user requests to incorporate additional features.

Figure 2.

Domain adjacency network exploration. ( A ) A domain adjacency graph for a subset of proteins corresponding to the GOSlim category, ‘Anatomical structure formation involved in morphogenesis’. Domains are shown as nodes. Edges indicate the adjacency of domain pairs (N to C terminal direction) within one or more architectures corresponding to the searched proteins listed in the side panel. For the example protein (SLIT2), the highlighted nodes indicate the domain architecture pertaining to this protein (enlargement and arrows added for emphasis). The side panel lists the orthologs of the searched proteins from which the graph has been constructed. ( B ) The area of interest has been expanded from the Laminin_G_1 node to include an additional Laminin_II domain, indicating that this duo appears in one or more additional proteins not in the original search. ( C ) Expansion continues with Laminin_II now added to the network as a permanent addition, further expansion from this domain identifies Laminin_I as a new neighbor. Selection of numbered nodes, presents a green ‘Protein Search’ button which initiates a search for additional proteins with this architecture that are not in the original list of search proteins. ( D ) The search in (C) has returned one additional protein (SLIT1) which was not in the original list of searched proteins. Exploration from LRRCT reveals LRR_4 as an adjacent neighbor. Note that multiple adjacent domains are often returned from the search allowing one to build up a rich network in the direction of interest. Also, by selecting the ortholog in another species, differences in architectures between species may be explored and expansions may be scoped to a particular species.

Open in new tab Download slide

Conclusions and future plans

A number of caveats are associated with orthology detection ( 37 , 38 ). First, in the absence of detailed phylogenetic analyses, domain gains, losses and shuffling events can significantly complicate orthology assignments. Second, horizontal gene transfer introduces an additional problem of xenologs which can lead to confounding outcomes. Third, the quality and coverage of genome annotation varies significantly between genome projects. Genomes of lower quality or with lower fold coverage may be associated with incomplete proteomes, giving rise to apparently missing orthologs. Finally, low quality or incomplete gene model annotations due to, for example incorrect splice sites or merging of unrelated genes can result in protein domains being missed and/or erroneous orthology assignments (for a more in depth discussion of the effects of genome annotation errors on the evaluation of domain architectures, see Ref. 17 ). While attempts have been made to define the quality of genomes based on metrics such as presence of indels ( 39 ) or expectations of gene content ( 40 ), we note that there has been no systematic evaluation of genome quality. Further, the choice of genome inclusion is also dependent on the additional value that a genome brings to an analysis (e.g. increasing phylogenetic coverage). Consequently, we chose to use published genomes that provide a good compromise between phylogenetic coverage and status of genome assembly. Reliance on the use of Pfam-defined domains, while subject to biases in the choice of organisms to generate seed alignments for the definition of domains, nonetheless provides a well-established framework to study domain evolution. However, while future versions of PhyloPro will explore the integration of additional sources of domain predictions, the user should be aware that the current reliance on Pfam definitions may result in errors, such as missed domains, in some descriptions of domain architectures.

Given the recognition of the need for standards ( 18 , 41 ), we anticipate that future updates of PhyloPro will exploit more comprehensive sources of ‘standardized’ genome assemblies that provide comparable accuracies and coverage. Efforts by the ‘Quest for Orthologs’ consortium ( http://questfororthologs.org ) have resulted in some progress ( 42 , 43 ). For example, the development of xml-based file exchanges formats (SeqXML, OrthoXML) as well as benchmarks for algorithm comparison. Nevertheless, a range of methods for determining orthology exist and will likely continue to exist given different approaches for optimizing computational efficiency, scalability or for focusing on specific phylogenetic groups differing in characteristics (e.g. homogeneity/diversity, introns, multidomains) ( 10 , 18 ). The availability of large sets of complementary ortholog predictions from tree-based approaches, e.g. PhylomeDB ( 9 ) or integrated in the form of MetaPhOrs ( 10 ) highlights directions for future expansion of PhyloPro to include alternative sources of ortholog prediction as a way of increasing overall accuracy of assignments. Similarly, complementary sources for domain predictions exist, e.g. the NCBI’s Conserved Domain Database ( 44 ), SMART ( 45 ) and InterPro ( 46 ) and represent an opportunity for the incorporation of additional tracks. At the same time, integration of domain architectures into orthology prediction pipelines may offer an additional route to help resolve complex orthology relationships. Such approaches have recently been applied to decrease search space associated with exhaustive sequence comparisons ( 47 ), but have also shown promise in improving homolog assignments ( 48 ). Given the InParanoid pipeline allows the definition of one–many and many–many orthologous relationships, it may be possible in future studies, to infer through interrogation of domain architectures which, among the set of inparalogs presented, represents the true ortholog.

Recent discussions have highlighted the potential importance of sequence and domain-based similarity approaches for the inference of functional similarity compared with tree-based phylogenetic approaches that appear to more closely adhere to the original definition of orthology as a pattern of inheritance ( 14 , 18 ). PhyloPro is an innovative sequence-similarity-based resource to incorporate domain-level information together with significant tools enabling the exploratory analysis of domain conservation across species. Applied to pathways or complexes, PhyloPro facilitates the rapid identification of core conserved elements of biological processes and potential lineage-specific innovations.

Funding

Canadian Institute of Health Research (MOP# 84556); the Natural Sciences and Engineering Research Council of Canada (RGPIN-2014-06664). N.L. received a postdoctoral fellowship, in part, through the Hospital for Sick Children Research Training Centre. High performance computing resources were provided by the SciNet HPC Consortium at the University of Toronto.

Conflict of interest . None declared.

References

Pellegrini

Marcotte

E.M.

Thompson

M.J.

et al. . (

1999

)

Assigning protein functions by comparative genome analysis: protein phylogenetic profiles

Proc. Natl Acad. Sci. USA

4285

–

4288

Google Scholar

Crossref

WorldCat

Szklarczyk

Franceschini

Wyder

et al. . (

2015

)

STRING v10: protein-protein interaction networks, integrated over the tree of life

Nucleic Acids Res

D447

–

D452

Cheng

Perocchi

(

2015

)

ProtPhylo: identification of protein-phenotype and protein-protein functional associations via phylogenetic profiling

Nucleic Acids Res

W160

–

W168

Vilella

A.J.

Severin

Ureta-Vidal

et al. . (

2009

)

EnsemblCompara GeneTrees: complete, duplication-aware phylogenetic trees in vertebrates

Genome Res

327

–

335

Ruan

Chen

et al. . (

2008

)

TreeFam: 2008 update

Nucleic Acids Res

D735

–

D740

Powell

Forslund

Szklarczyk

et al. . (

2014

)

eggNOG v4.0: nested orthology inference across 3686 organisms

Nucleic Acids Res

D231

–

D239

Chen

Mackey

A.J.

Stoeckert

C.J.

Jr.

et al. . (

2006

)

OrthoMCL-DB: querying a comprehensive multi-species collection of ortholog groups

Nucleic Acids Res

D363

–

D368

Altenhoff

A.M.

Skunca

Glover

et al. . (

2015

)

The OMA orthology database in 2015: function predictions, better plant support, synteny view and other improvements

Nucleic Acids Res

D240

–

D249

Huerta-Cepas

Capella-Gutierrez

Pryszcz

L.P.

et al. . (

2014

)

PhylomeDB v4: zooming into the plurality of evolutionary histories of a genome

Nucleic Acids Res

D897

–

D902

Pryszcz

L.P.

Huerta-Cepas

Gabaldon

(

2011

)

MetaPhOrs: orthology and paralogy predictions from multiple phylogenetic evidence using a consistency-based confidence score

Nucleic Acids Res

e32

Schreiber

Patricio

Muffato

et al. . (

2014

)

TreeFam v9: a new website, more species and orthology-on-the-fly

Nucleic Acids Res

D922

–

D925

Wapinski

Pfeffer

Friedman

et al. . (

2007

)

Natural history and evolutionary principles of gene duplication in fungi

Nature

449

–

Tatusov

R.L.

Fedorova

N.D.

Jackson

J.D.

et al. . (

2003

)

The COG database: an updated version includes eukaryotes

BMC Bioinformatics

Gabaldon

Koonin

E.V.

(

2013

)

Functional and evolutionary implications of gene orthology

Nat. Rev. Genet

360

–

366

Wheelan

S.J.

Marchler-Bauer

Bryant

S.H.

(

2000

)

Domain size distributions can predict domain boundaries

Bioinformatics

613

–

618

Sjolander

Datta

R.S.

Shen

et al. . (

2011

)

Ortholog identification in the presence of domain architecture rearrangement

Brief. Bioinform

413

–

422

Y.C.

Rasmussen

M.D.

Kellis

(

2012

)

Evolution at the subgene level: domain rearrangements in the Drosophila phylogeny

Mol. Biol. Evol

689

–

705

Dessimoz

Gabaldon

Roos

D.S.

et al. . (

2012

)

Toward community standards in the quest for orthologs

Bioinformatics

900

–

904

Basu

M.K.

Carmel

Rogozin

I.B.

et al. . (

2008

)

Evolution of protein domain promiscuity in eukaryotes

Genome Res

449

–

461

Basu

M.K.

Poliakov

Rogozin

I.B.

(

2009

)

Domain mobility in proteins: functional and evolutionary implications

Brief. Bioinform

205

–

216

Forslund

Sonnhammer

E.L.

(

2012

)

Evolution of protein domain architectures

Methods Mol. Biol

856

187

–

216

Google Scholar

PubMed

OpenURL Placeholder Text

WorldCat

Bornberg-Bauer

Alba

M.M.

(

2013

)

Dynamics and adaptive benefits of modular protein evolution

Curr. Opin. Struct. Biol

459

–

466

Cromar

Wong

K.C.

Loughran

et al. . (

2014

)

New tricks for “old” domains: how novel architectures and promiscuous hubs contributed to the organization and evolution of the ECM

Genome Biol. Evol

2897

–

2917

Chen

Mackey

A.J.

Vermunt

J.K.

et al. . (

2007

)

Assessing performance of orthology detection strategies applied to eukaryotic genomes

PLoS One

e383

Finn

R.D.

Bateman

Clements

et al. . (

2014

)

Pfam: the protein families database

Nucleic Acids Res

D222

–

D230

Xiong

et al. . (

2010

)

The evolutionary landscape of the chromatin modification machinery reveals lineage specific gains, expansions, and losses

Proteins

2075

–

2089

Google Scholar

PubMed

OpenURL Placeholder Text

WorldCat

Cromar

G.L.

Xiong

Chautard

et al. . (

2012

)

Toward a systems level view of the ECM and related proteins: a framework for the systematic definition and analysis of biological systems

Proteins

1522

–

1544

Kono

Herrmann

Loughran

N.B.

et al. . (

2012

)

Evolution and architecture of the inner membrane complex in asexual and sexual stages of the malaria parasite

Mol. Biol. Evol

2113

–

2132

Xiong

Song

et al. . (

2011

)

PhyloPro: a web-based tool for the generation and visualization of phylogenetic profiles across Eukarya

Bioinformatics

877

–

878

de Hoon

M.J.

Imoto

Nolan

et al. . (

2004

)

Open source clustering software

Bioinformatics

1453

–

1454

Remm

Storm

C.E.

Sonnhammer

E.L.

(

2001

)

Automatic clustering of orthologs and in-paralogs from pairwise species comparisons

J. Mol. Biol

314

1041

–

1052

Punta

Coggill

P.C.

Eberhardt

R.Y.

et al. . (

2012

)

The Pfam protein families database

Nucleic Acids Res

D290

–

D301

Smedley

Haider

Durinck

. et al. . (

2015

)

The BioMart community portal: an innovative alternative to large, centralized data repositories

Nucleic Acids Res

W589

–

W598

Safran

Chalifa-Caspi

Shmueli

et al. . (

2003

)

Human Gene-Centric Databases at the Weizmann Institute of Science: GeneCards, UDB, CroW 21 and HORDE

Nucleic Acids Res

142

–

146

Uliel

Fliess

Unger

(

2001

)

Naturally occurring circular permutations in proteins

Protein Eng

533

–

542

Fliess

Motro

Unger

(

2002

)

Swaps in protein sequences

Proteins

377

–

387

Kuzniar

van Ham

R.C.

Pongor

et al. . (

2008

)

The quest for orthologs: finding the corresponding gene across genomes

Trends Genet

539

–

551

Ruano-Rubio

Poch

Thompson

J.D.

(

2009

)

Comparison of eukaryotic phylogenetic profiling approaches using species tree aware methods

BMC Bioinformatics

383

Meader

Hillier

L.W.

Locke

et al. . (

2010

)

Genome assembly quality: assessment and improvement using the neutral indel model

Genome Res

675

–

684

Simão

F.A.

Waterhouse

R.M.

Ioannidis

et al. . (

2015

)

BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs

Bioinformatics

3210

–

3212

Chain

P.S.

Grafham

D.V.

Fulton

R.S.

et al. . (

2009

)

Genomics. Genome project standards in a new era of sequencing

Science

326

236

–

237

Gabaldon

Dessimoz

Huxley-Jones

et al. . (

2009

)

Joining forces in the quest for orthologs

Genome Biol

403

Sonnhammer

Gabaldón

Wilter Sousa da Silva

et al. . (

2014

)

Big data and other challenges in the quest for orthologs

Bioinformatics

2993

–

2998

Marchler-Bauer

Derbyshire

M.K.

Gonzales

N.R.

et al. . (

2015

)

CDD: NCBI's conserved domain database

Nucleic Acids Res

D222

–

D226

Letunic

Doerks

Bork

(

2015

)

SMART: recent updates, new developments and status in 2015

Nucleic Acids Res

D257

–

D260

Mitchell

Chang

H.Y.

Daugherty

et al. . (

2015

)

The InterPro protein families database: the classification resource after 15 years

Nucleic Acids Res

D213

–

D221

Bitard-Feildel

Kemena

Greenwood

J.M.

et al. . (

2015

)

Domain similarity based orthology detection

BMC Bioinformatics

154

Song

Sedgewick

R.D.

Durand

(

2007

)

Domain architecture comparison for multidomain homology identification

J. Comput. Biol

496

–

516

Author notes

Present address: Hongyan Song, Computing and Communications Services, Ryerson University, 350 Victoria Street, Toronto, ON M5B 2K3, Canada.

Citation details: Cromar,G.L., Zhao,A., Xiong,X. et al. PhyloPro2.0: a database for the dynamic exploration of phylogenetically conserved proteins and their domain architectures across the Eukarya. Database (2016) Vol. 2016: article ID baw013; doi:10.1093/database/baw013

This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( http://creativecommons.org/licenses/by/4.0/ ), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.

Download all slides

Month:	Total Views:
December 2016	6
January 2017	5
February 2017	9
March 2017	5
April 2017	5
May 2017	5
June 2017	4
July 2017	5
August 2017	11
September 2017	4
October 2017	15
November 2017	1
December 2017	9
January 2018	13
February 2018	12
March 2018	15
April 2018	24
May 2018	8
June 2018	13
July 2018	17
August 2018	15
September 2018	15
October 2018	8
November 2018	15
December 2018	10
January 2019	14
February 2019	8
March 2019	11
April 2019	15
May 2019	15
June 2019	10
July 2019	18
August 2019	32
September 2019	80
October 2019	61
November 2019	3
December 2019	7
January 2020	9
February 2020	13
March 2020	3
April 2020	14
May 2020	13
June 2020	84
July 2020	50
August 2020	5
September 2020	15
October 2020	2
November 2020	6
December 2020	5
January 2021	5
February 2021	3
March 2021	16
April 2021	13
May 2021	9
June 2021	3
July 2021	7
August 2021	8
September 2021	2
October 2021	12
November 2021	19
December 2021	3
January 2022	8
February 2022	6
March 2022	4
April 2022	5
May 2022	12
June 2022	9
July 2022	9
August 2022	5
September 2022	9
October 2022	15
November 2022	6
December 2022	5
January 2023	10
February 2023	6
March 2023	9
May 2023	9
June 2023	2
July 2023	6
August 2023	17
September 2023	16
October 2023	7
November 2023	6
December 2023	20
January 2024	20
February 2024	35
March 2024	18
April 2024	9
May 2024	15
June 2024	8
July 2024	8
August 2024	4
September 2024	6
October 2024	9
November 2024	5
December 2024	4
January 2025	8
February 2025	4
March 2025	8
April 2025	4
May 2025	13
June 2025	24
July 2025	10
August 2025	15
September 2025	15
October 2025	12
November 2025	10
December 2025	7
January 2026	5
February 2026	6
March 2026	10
April 2026	5
May 2026	2

Article Contents

PhyloPro2.0: a database for the dynamic exploration of phylogenetically conserved proteins and their domain architectures across the Eukarya

Abstract

Introduction

PHYLOPRO2.0: generating evolutionary trajectories

Features

Data acquisition

Querying and browsing in PhyloPro

A novel interactive domain visualization tool

Conclusions and future plans

Funding

References

Author notes

Citations

Views

Altmetric

Citing articles via

Latest

Most Read

Most Cited

Article Contents

PhyloPro2.0: a database for the dynamic exploration of phylogenetically conserved proteins and their domain architectures across the Eukarya Open Access

Abstract

Introduction

PHYLOPRO2.0: generating evolutionary trajectories

Features

Data acquisition

Querying and browsing in PhyloPro

A novel interactive domain visualization tool

Conclusions and future plans

Funding

References

Author notes

Citations

Views

Altmetric

Citing articles via

Latest

Most Read

Most Cited

This Feature Is Available To Subscribers Only

Gift article access

Gift article access

Gift article access

Gift article access

PhyloPro2.0: a database for the dynamic exploration of phylogenetically conserved proteins and their domain architectures across the Eukarya