Annotation of functional sites with the Conserved Domain Database

Derbyshire, Myra K.; Lanczycki, Christopher J.; Bryant, Stephen H.; Marchler-Bauer, Aron

doi:10.1093/database/bar058

Abstract

The overwhelming fraction of proteins whose sequences have been collected in comprehensive databases may never be assessed for function experimentally. Commonly, putative function is assigned based on similarity to experimentally characterized homologs, either on the level of the entire protein or for single evolutionarily conserved domains. The annotation of individual sites provides more detailed insights regarding the correspondence between sequence and function, as well as context for the interpretation of sequence variation and the outcomes of experiments. In general, site annotation has to be extracted from the published literature, and can often be transferred to closely related sequence neighbors. The National Center for Biotechnology Information's Conserved Domain Database (CDD) provides a system for curators to record functional (such as active sites or binding sites for cofactors) or characteristic sites (such as signature motifs), which are conserved across domain families, and for the transfer of that annotation to protein database sequences via high-confidence domain matches. Recently, CDD curators have begun to sort-site annotations into seven categories (active, polypeptide binding, nucleic acid binding, ion binding, chemical binding, post-translational modification and other) and here we present a first comparative analysis of sites obtained via domain model matches, juxtaposed with existing site annotation encountered in high-quality data sets. Site annotation derived from domain annotation has the potential to cover large fractions of protein sequences, and we observe that CDD-based site annotation complements existing site annotation in many cases, which may, in part, originate from CDD's curation practice of collecting sites conserved across diverse taxa and supported by evidence from multiple 3D structures.

Introduction

The Conserved Domain Database (CDD) (1) is a manually curated protein annotation resource developed and maintained by the National Center for Biotechnology Information (NCBI). CDD collects a large set of protein and protein domain models, as multiple sequence alignments and derived position-specific score matrices (PSSMs), and uses RPS-BLAST (2), a variant of the widely used PSI-BLAST algorithm (3), to match protein database sequences with these family models. While the majority of models are imported from external sources, the CDD curation team is revisiting larger protein domain superfamilies to establish finer-grained hierarchical classifications that are based on phylogenetic analysis and supported by the published literature, functional annotation, domain architecture and taxonomic distribution. While characterizing individual subfamilies, curators also record conserved functional sites and evidence for those sites, in a way so that sites can be mapped onto protein sequences using pre-computed protein-model alignments as collected in the Conserved Domain Architecture Retrieval Tool (CDART) database (4). CDD-based site annotation is readily visible on Entrez's GenPept summary pages for proteins and in graphical views (Figure 1), and it is being distributed via NCBI's Reference Sequence protein data sets (5). More recently, CDD site annotation is used to verify and rank clusters of interactions observed in 3D structures as presented by the Inferred Biomolecular Interactions Server (IBIS) resource (6), where such clusters can be used to infer interactions for proteins sequence similar to those with known 3D structure. CDD site annotation is also visible in the domain mapping of disease mutations (DMDMs) resource, where it can be contrasted with known disease mutations and polymorphisms (7).

Figure 1.

Open in new tab Download slide

Entrez Protein graphical sequence view for SwissProt sequence P28845.3, gi|118569. At the bottom of the view, site annotation (labeled ‘site Features’) from CDD and as encountered in the original record are visible on top of each other. Note that CDD annotates the homodimerization interface, substrate and cofactor binding sites and active site as relatively large sets of disjoint residue positions. The homodimer interface annotation is not present in the original annotation, but it provides unique labeling of glycosylation sites.

SwissProt, as maintained by the UniProt Knowledgebase (8), is a resource that provides high-quality manually curated annotation of protein sequences. SwissProt-annotated sequences are tracked by NCBI's Entrez protein database, including the site annotation provided by the source data. Here, we present a study that examines a subset of the SwissProt-based sequences tracked by Entrez, namely those already covered by NCBI-curated domain models, and compares site annotations that originate from CDD with annotation originating from SwissProt.

Conserved domain site annotation

The curation of domain models in CDD aims at characterizing protein domain superfamilies as collections of sequence fragments related by common evolutionary descent, organized into multiple sequence alignments and split into subfamilies that reflect ancient gene duplication events and subsequent divergent evolution. Curation of CDD-conserved domain hierarchies has been explained in previous manuscripts (9). Typically, a domain subfamily is created and annotated if it is supported by phylogenetic analysis and contains member sequences from diverse organisms, suggesting an origin several hundred million years in the past. To this end, curators compute and examine sequence tree displays, to select robust branches and will consider taxonomic distribution, domain architecture, protein annotation and existing/external classifications. CDD curators make extensive use of protein 3D structure, when available, as in-house curation tools are tightly coupled to the Entrez 3D structure database Molecular Modeling Database (MMDB) (10) and structure neighboring data computed with Vector Alignment Search Tool (VAST) (11), and the associated 3D viewer Cn3D (12) is the main alignment viewing and editing tool. From examining patterns of sequence conservation, the published literature, and the 3D structures of complexes that may contain proteins interacting with binding partners, curators often notice and record the location of functional sites or motifs characteristic for a domain family. Sites are recorded as addresses on the multiple sequence alignment models that describe the domain family, and this mapping is being transferred into the coordinates of the PSSMs that are used to scan the protein sequence database. From an alignment of a protein sequence to a PSSM, the site coordinates can be again transferred onto the protein sequence itself. This is only done if the mapping of the site is near complete; partially aligned sites are not used to infer sites on protein sequences. Functional sites associated with a domain model are only mapped onto proteins with high-scoring-specific hits to that model. Sites are recorded with a short name, such as ‘active site’ or ‘ATP binding site’. Although common site names are now being selected from a list of pre-defined expressions, the name is stored as free text and can be modified by the curators as they see fit. We have recently started to assign site types and to retrofit existing models with site-type definitions. CDD deliberately picked a small number of seven generic site types, so that the majority of annotations that we will come across can be sorted into the seven types in a straightforward manner. The site types were also selected to match the IBIS classification of interaction sites (6), as CDD curators use IBIS in the curation work flow. Curators pick common site names from a small set of pre-defined and generic options (such as ‘active site’ or ‘dimerization interface’), but also refer to the literature when deciding on a site name, and are free to choose very specific names if deemed appropriate. The site types used in CDD are listed in Table 1.

Table 1.

Site types and names as defined in Conserved Domain Database models and as mapped onto protein sequences in Entrez

Type designation	Examples of common names	Counts
Active	Active site, catalytic site	3300
Polypeptide binding	Dimer interface, oligomer interface	3020
Nucleic acid binding	DNA binding site, RNA binding site	482
Ion binding	Ca binding site, Zn binding site	1500
Chemical binding	ATP binding site, NAD(P) binding site	3310
PTM	Glycosylation site, phosphorylation site	104
Other	Walker A/P-loop, activation loop	4439^a

Type designation	Examples of common names	Counts
Active	Active site, catalytic site	3300
Polypeptide binding	Dimer interface, oligomer interface	3020
Nucleic acid binding	DNA binding site, RNA binding site	482
Ion binding	Ca binding site, Zn binding site	1500
Chemical binding	ATP binding site, NAD(P) binding site	3310
PTM	Glycosylation site, phosphorylation site	104
Other	Walker A/P-loop, activation loop	4439^a

The counts reflect the numbers of site annotations recorded on CDD models in the most recent release, v2.32.

^aNote that sites without any explicit alternative type assignment are flagged ‘other’; as site typing is an ongoing process, this number reflects models that still need to be revisited more than the actual fraction of sites that cannot be sorted into a more specific category.

Open in new tab

Table 1.

Site types and names as defined in Conserved Domain Database models and as mapped onto protein sequences in Entrez

Type designation	Examples of common names	Counts
Active	Active site, catalytic site	3300
Polypeptide binding	Dimer interface, oligomer interface	3020
Nucleic acid binding	DNA binding site, RNA binding site	482
Ion binding	Ca binding site, Zn binding site	1500
Chemical binding	ATP binding site, NAD(P) binding site	3310
PTM	Glycosylation site, phosphorylation site	104
Other	Walker A/P-loop, activation loop	4439^a

Type designation	Examples of common names	Counts
Active	Active site, catalytic site	3300
Polypeptide binding	Dimer interface, oligomer interface	3020
Nucleic acid binding	DNA binding site, RNA binding site	482
Ion binding	Ca binding site, Zn binding site	1500
Chemical binding	ATP binding site, NAD(P) binding site	3310
PTM	Glycosylation site, phosphorylation site	104
Other	Walker A/P-loop, activation loop	4439^a

The counts reflect the numbers of site annotations recorded on CDD models in the most recent release, v2.32.

^aNote that sites without any explicit alternative type assignment are flagged ‘other’; as site typing is an ongoing process, this number reflects models that still need to be revisited more than the actual fraction of sites that cannot be sorted into a more specific category.

Open in new tab

Curators also record evidence together with the conserved site annotation, which is presented to CDD users via conserved domain summary pages. Evidence may be free text comments, references to journal articles or structure evidence, which contains instructions for highlighting a site in a particular 3D structure used in the model, together with a binding partner that exemplifies the biological significance of the site annotation.

Conserved sites are annotated only if it seems reasonable to assume that the site is present in all or nearly all sequence fragments specifically annotated by the respective model. Mapping of sites via homologous relationship will undoubtedly generate false annotation, but that fraction is expected to be small if (1) site annotation is restricted to well-conserved motifs that are linked to the generic function of the domain family, and (2) a conservative procedure is used to qualify a match for mapping sites. Consequently, site annotation in CDD is restricted to sites that tend to be well conserved in divergent evolution. It is evident from Table 1 that relatively few post-translational modification (PTM) sites have been recorded, for example, as these tend to evolve rather quickly and are often not associated with the structurally conserved core segments of conserved domains, which constitute the bulk of CDD's alignment models. The low number of PTM sites is most likely due to the lack of conservation between sites in a single domain model; their annotation would require further fine-grained subfamily classification, as curators only annotate sites that appear conserved in all or nearly all representative sequences of a domain model.

Specific domain hits and site mapping

The collection of domain models in CDD is redundant, as CDD mirrors several external resources. It is quite common to have the same domain family described by models from three or four different sources, and if hierarchical classifications of diverse superfamilies are available, dozens of models may provide overlapping annotation for a particular region on a protein. To deal with this redundancy, CDD presents a simplified default view of domain search results: models describing homologous families are grouped together into superfamily clusters, and the annotation with a superfamily cluster is presented instead of the single model that happened to score the best hit. However, if the highest ranked hit was scored by an NCBI-curated model, and that score exceeds a model-specific threshold, (13) the ‘specific hit’ is presented on top of the superfamily annotation. CDD follows simple rules for mapping site annotations onto protein sequences: functional sites associated with a domain model are only mapped onto proteins with high scoring-specific hits to that model. If only a superfamily annotation is shown, but if the set of redundant hits includes an NCBI-curated model, site annotation is mapped from the root node of the conserved domain hierarchy that model came from—annotating only the most generic sites that are presumed conserved across the entire superfamily.

Methods

A subset of NCBI's Entrez protein sequence records contain site annotation provided by the originating source database. For the analysis presented here, we chose to use sequences that are flagged as originating from SwissProt. Sixty-six percent of all SwissProt sequences in Entrez/protein had site annotation from some source; two-thirds of these had hits to specific CDD-curated domain models; ∼45% of all SwissProt records had such specific hits. We focused the analysis on the latter, SwissProt sequences that had specific domain annotation from CDD, meaning that at least one sequence region comes with high-confidence identification of a conserved domain, which may also include mapped site annotation. This restricts the analysis to a set of protein domain families that have undergone curation by CDD staff to date, and it results in 233,722 sequences (as of September 2011). Site annotations in those sequences were collected, including the site type assigned in each case. Pre-existing (non-CDD) site annotation, which was interpreted as stemming from the SwissProt curation effort, is categorized into a larger set of 12 site types in Entrez, which reflects the site typing undertaken by curatorial staff at the source database, while CDD-based site annotation uses the 7 types outlined in Table 1. We defined two sites from different sources as overlapping if they shared one or more residue coordinate on the protein sequence. In the analysis presented below, we did not try to map site types between CDD and Entrez/protein.

Results and conclusions

CDD maps site annotation onto several million proteins in Entrez. Figure 2 presents the site annotation coverage for the subset analyzed in this manuscript.

Figure 2.

Open in new tab Download slide

The 233 722 protein sequences we analyzed can be categorized based on the source of site annotation. A small number, 1.32% of the SwissProt sequences with specific hits to NCBI-curated domain models, do not have any site annotation. The 1.62% have site annotation only from SwissProt, and 11.16% have CDD site annotation that appears redundant (overlaps with existing SwissProt annotation). For the remaining 85.9%, CDD provides some unique site annotation, and for about one-third of the sequences CDD provides the only site annotation.

It seems evident that CDD site annotations contribute to a large fraction of the proteins that are covered by the current curation effort. Of the 1 491 437 individual site annotations we tracked, just a little more than half (53.3%) came from mapping of CDD sites, and they are spread across 97% of the sequences in the set, reflecting the fact that the majority of NCBI-curated domain models do also come with functional site annotation. In more than half of the proteins, some or all of the CDD annotation overlaps with annotation provided by SwissProt, but CDD also contributes unique sites, and sometimes the only site annotation available at this point. Figures 3 and 4 detail the distributions of site annotations according to the assigned site type, for CDD and SwissProt, accordingly.

Figure 3.

$The 794 228 site annotations on protein sequences we analyzed, which were generated via mapping to CDD models, can be categorized based on the site type assigned by CDD. A large fraction of sites is assigned type ‘0’ or ‘other’, as the typing of all previously recorded sites has not been completed. These are not shown here. CDD annotates only a small number of PTM sites, as these are rarely conserved across somewhat diverse domain families. The bars are colored according to the overlap with SwissProt sites (irrespective of the SwissProt site type). It appears that polypeptide-binding sites, those conferring protein–protein interactions, are most often uniquely annotated by CDD.$

Open in new tab Download slide

The 794 228 site annotations on protein sequences we analyzed, which were generated via mapping to CDD models, can be categorized based on the site type assigned by CDD. A large fraction of sites is assigned type ‘0’ or ‘other’, as the typing of all previously recorded sites has not been completed. These are not shown here. CDD annotates only a small number of PTM sites, as these are rarely conserved across somewhat diverse domain families. The bars are colored according to the overlap with SwissProt sites (irrespective of the SwissProt site type). It appears that polypeptide-binding sites, those conferring protein–protein interactions, are most often uniquely annotated by CDD.

Figure 4.

Open in new tab Download slide

The 697 209 site annotations encountered on the protein sequences we analyzed, which originate from the SwissProt curation effort, categorized based on the site type assigned in Entrez/protein. The bars are colored according to the overlap with CDD-generated sites (irrespective of the CDD site type). It appears that PTM sites, those summarized under the ‘modified’ and ‘glycosylation’ types, are most often uniquely annotated by SwissProt.

While there is a large degree of overlap between CDD-generated site annotation and SwissProt-generated annotation, we notice that the two data sources also complement each other to a certain degree; for ∼33% of the SwissProt sequences with specific CDD domain annotation, CDD provides the only site annotation. Individual sequence curation—and inference of sites between close homologs—can record the presence of functional sites that are not conserved across more diverse families. The comparative analysis of protein 3D structure complexes, on the other hand, enables CDD curators to record the positions of interfaces with which macromolecules interact, including homo and hetero-oligomerization interfaces. It may be helpful to consider both sources of annotation in the study of protein function and the design of experiments, so as to benefit from curation work approaching the issue from different angles.

The strength of CDD's approach is that conserved sites can be annotated on large numbers of protein sequences with relatively little effort, as a single model may provide ‘specific domain hits’ to hundreds or thousands of protein sequences. Naturally, this will also lead to a higher incidence of false positive annotation. We are in the process of implementing curation software that allows for conditional functional sites: curators will be able to specify the amino acid residue types that are allowed in selected positions of a functional site. Consequently, sites will be only mapped onto sequences if the site address matches such a defined sequence motif that is associated with known or proven function. While this is expected to reduce the incidences of false annotation, it will be particularly useful for annotating sites that are known as not strictly conserved across all sequences that define a domain family, such as PTM sites.

Feedback with respect to inaccurate site annotation or supporting and conflicting experimental evidence is welcome and concerns can be addressed efficiently via the CDD curation pipeline.

Funding

This work was funded by the Intramural Research Program of the National Library of Medicine at the National Institutes of Health/DHHS. Funding for open access charge: Intramural Research Program of the National Library of Medicine at National Institutes of Health/DHHS. Comments, suggestions, and questions are welcome and should be directed to: info@ncbi.nlm.nih.gov.

Conflict of interest. None declared.

Acknowledgements

We thank the Conserved Domain Curators for compiling the site annotations analyzed in this work, Farideh Chitsaz, Noreen Gonzales, Marc Gwadz, Fu Lu, Gabriele Marchler, James Song, Narmada Thanki, Roxanne Yamashita, Chanjuan Zheng, as well as the CDD alumni Anastasia Nikolskaya, Raja Mazumder, Natalie Fedorova, Aviva Jacobs, B. Sridhar Rao, Sona Vasudevan, Luning Hao, Jodie Yin, Dmitri Krylov, Asba Tasneem, Zhaoxi Ke, Mikhail Mullokandov, Marina Omelchenko, John Jackson, John Anderson, Cynthia Robertson and Carol DeWeese-Scott. We thank Renata C. Geer for assistance with preparing figures.

References

1

Marchler-Bauer

A

,

Lu

S

,

Anderson

JB

, et al.

CDD: a Conserved Domain Database for the functional annotation of proteins

,

Nucleic Acids Res.

,

2011

, vol.

39

Database Issue

(pg.

D225

-

D229

)

2

Marchler-Bauer

A

,

Panchenko

AR

,

Shoemaker

BA

, et al.

CDD: a database of conserved domain alignments with links to three-dimensional structure

,

Nucleic Acids Res.

,

2002

, vol.

30

(pg.

281

-

283

)

3

Altschul

SF

,

Madden

TL

,

Schäffer

AA

, et al.

Gapped BLAST and PSI-BLAST: a new generation of protein database search programs

,

Nucleic Acids Res.

,

1997

, vol.

25

(pg.

3389

-

3402

)

4

Geer

LY

,

Domrachev

M

,

Lipman

DJ

, et al.

CDART: protein homology by domain architecture

,

Genome Res.

,

2002

, vol.

12

(pg.

1619

-

1623

)

5

Pruitt

KD

,

Tatusova

T

,

Klimke

W

, et al.

NCBI Reference Sequences: current status, policy and new initiatives

,

Nucleic Acids Res.

,

2009

, vol.

37

Database Issue

(pg.

D32

-

D36

)

6

Shoemaker

BA

,

Zhang

D

,

Thangudu

RR

, et al.

Inferred Biomolecular Interaction Server – a web server to analyze and predict protein interacting partners and binding sites

,

Nucleic Acids Res.

,

2010

, vol.

38

Database issue

(pg.

D518

-

D524

)

7

Peterson

TA

,

Aladey

A

,

Santana-Cruz

I

, et al. ,

Bioinformatics

, vol.

26

(pg.

2459

-

2459

)

8

Magrane

M

,

UniProt Consortium

.

UniProt Knowledgebase: a hub of integrated protein data

,

Database

,

2011

, vol.

2011

March 29 2011, doi:10.1093/database/bar009

Google Scholar

OpenURL Placeholder Text

WorldCat

9

Marchler-Bauer

A

,

Anderson

JB

,

Cherukuri

PF

, et al.

CDD: a Conserved Domain Database for protein classification

,

Nucleic Acids Res.

,

2005

, vol.

33

Database Issue

(pg.

D192

-

D196

)

10

Wang

Y

,

Addess

KJ

,

Chen

J

, et al.

MMDB: annotating protein sequences with Entrez's 3D-structure database

,

Nucleic Acids Res.

,

2007

, vol.

35

Database Issue

(pg.

D298

-

D300

)

11

Gibrat

JF

,

Madej

T

,

Bryant

SH

.

Surprising similarities in structure comparison

,

Curr. Opin. Struct. Biol.

,

1996

, vol.

6

(pg.

377

-

385

)

12

Wang

Y

,

Geer

LY

,

Chappey

C

, et al.

Cn3D: sequence and structure views for Entrez

,

Trends Biochem. Sci.

,

2000

, vol.

25

(pg.

300

-

302

)

13

Fong

J

,

Marchler-Bauer

A

.

Protein subfamily assignment using the Conserved Domain Database

,

BMC Res. Notes

,

2008

, vol.

1

pg.

114

Published by Oxford University Press on behalf of US Government 2012.

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0), which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

Download all slides

Month:	Total Views:
December 2016	2
January 2017	2
February 2017	3
March 2017	5
April 2017	10
May 2017	3
June 2017	3
July 2017	2
August 2017	4
October 2017	5
November 2017	1
December 2017	20
January 2018	19
February 2018	7
March 2018	17
April 2018	18
May 2018	15
June 2018	18
July 2018	17
August 2018	26
September 2018	15
October 2018	6
November 2018	26
December 2018	12
January 2019	10
February 2019	18
March 2019	26
April 2019	27
May 2019	34
June 2019	15
July 2019	20
August 2019	22
September 2019	29
October 2019	20
November 2019	29
December 2019	24
January 2020	23
February 2020	25
March 2020	11
April 2020	31
May 2020	23
June 2020	24
July 2020	26
August 2020	13
September 2020	9
October 2020	5
November 2020	16
December 2020	22
January 2021	9
February 2021	6
March 2021	22
April 2021	18
May 2021	18
June 2021	16
July 2021	7
August 2021	5
September 2021	10
October 2021	10
November 2021	25
December 2021	8
January 2022	15
February 2022	5
March 2022	9
April 2022	11
May 2022	14
June 2022	18
July 2022	6
August 2022	13
September 2022	9
October 2022	23
November 2022	13
December 2022	10
January 2023	17
February 2023	7
March 2023	5
April 2023	9
May 2023	3
June 2023	3
July 2023	5
August 2023	12
September 2023	5
October 2023	1
November 2023	14
December 2023	17
January 2024	16
February 2024	28
March 2024	14
April 2024	10

Article Contents

Annotation of functional sites with the Conserved Domain Database

Abstract

Introduction

Conserved domain site annotation

Specific domain hits and site mapping

Methods

Results and conclusions

Funding

Acknowledgements

References

Citations

Views

Altmetric

Citing articles via

Latest

Most Read

Most Cited

Article Contents

Annotation of functional sites with the Conserved Domain Database

Abstract

Introduction

Conserved domain site annotation

Specific domain hits and site mapping

Methods

Results and conclusions

Funding

Acknowledgements

References

Citations

Views

Altmetric

Citing articles via

Latest

Most Read

Most Cited

This Feature Is Available To Subscribers Only