Gramene QTL database: development, content and applications

Ni, Junjian; Pujar, Anuradha; Youens-Clark, Ken; Yap, Immanuel; Jaiswal, Pankaj; Tecle, Isaak; Tung, Chih-Wei; Ren, Liya; Spooner, William; Wei, Xuehong; Avraham, Shuly; Ware, Doreen; Stein, Lincoln; McCouch, Susan

doi:10.1093/database/bap005

Abstract

Gramene is a comparative information resource for plants that integrates data across diverse data domains. In this article, we describe the development of a quantitative trait loci (QTL) database and illustrate how it can be used to facilitate both the forward and reverse genetics research. The QTL database contains the largest online collection of rice QTL data in the world. Using flanking markers as anchors, QTLs originally reported on individual genetic maps have been systematically aligned to the rice sequence where they can be searched as standard genomic features. Researchers can determine whether a QTL co-localizes with other QTLs detected in independent experiments and can combine data from multiple studies to improve the resolution of a QTL position. Candidate genes falling within a QTL interval can be identified and their relationship to particular phenotypes can be inferred based on functional annotations provided by ontology terms. Mutations identified in functional genomics populations and association mapping panels can be aligned with QTL regions to facilitate fine mapping and validation of gene–phenotype associations. By assembling and integrating diverse types of data and information across species and levels of biological complexity, the QTL database enhances the potential to understand and utilize QTL information in biological research.

Introduction

Gramene is a comparative genome database for plants that integrates information about genetic and physical maps, sequences, markers, germplasm resources, genes, proteins, pathways and phenotypes (1–3). Users can browse or query the database to discover relationships between genes and phenotypes of interest. They can draw on data from multiple plant species to compare and contrast the characteristics of genes, genomes, pathways and phenotypes.

This article describes the annotation of quantitative trait loci (QTL) within Gramene and illustrates the way this resource can be used to identify genes and regulatory sequences underlying QTLs using both the forward and reverse genetics approaches. The QTL database is a one of a kind resource; it contains the largest online collection of rice QTL data in the world and serves as a repository for the international rice research community. It facilitates the investigation of QTLs in co-linear regions in other cereals and enables researchers to identify sequences and QTLs associated with similar traits or phenotypes across a wide range of plant species through the use of controlled vocabularies. By providing an integrated set of tools, it offers plant biologists and geneticists a way of exploring the relationship between genome variation and complex forms of phenotypic variation (4–7).

In Gramene, a QTL is identified as a region of the genome that is predicted to contain a gene or genes associated with a specific trait. QTL mapping involves the analysis of a population(s) where individual plants, lines or families within the population have been characterized for a set of well-distributed molecular marker polymorphisms as well as for one or more quantitative traits. A QTL is declared when there is a statistical association between the segregation of a molecular polymorphism(s) and a measurable phenotype, using the individual segregants within the population as replicates. The phenotype of interest may be a feature of the whole plant, an organ or a tissue, or it may be characterized as a feature associated with the DNA, RNA or protein. The objective of QTL analysis is to identify the position and relative importance of genetic factors that collectively determine a trait or a phenotype of interest (6,8).

QTL curation contributes to the functional annotation of the rice genome. QTLs, in effect, function as genomic placeholders; they flag positions in the genome that harbor genes underlying traits of interest. QTL mapping has been particularly relevant to the agricultural community because it provides a way of genetically dissecting quantitative variation found in naturally occurring germplasm resources and offers insight into the linkage and epistatic relationships among genes and QTLs controlling diverse traits of interest. Furthermore, plant breeders are able to make direct use of QTL results for marker-assisted selection in breeding programs. QTL analysis is also used by molecular geneticists as a first step in map-based cloning studies and it provides quantitative geneticists and evolutionary biologists with a global view of gene network architecture, allowing them to identify key rate-limiting steps associated with quantitative variation (5,7). By reducing the search space, QTL information makes it easier to identify individual genes underlying quantitative traits and provides global information about the location and relative importance of each genetic factor (4,9). QTLs are highly informative because they have integrative power to connect diverse domains of information in the plant biology. This is vital to understanding the biology of complex traits and serves a critical function in database curation and design.

One of the most pressing reasons to curate QTLs is that, in recent years, major resources have been invested in plant and animal QTL research worldwide; this has generated a large volume of QTL information in the published literature. The value of an information resource or central repository that can assemble and integrate QTL information across species and levels of biological complexity is underscored by the fact that the information is reported in different formats, and the data are highly heterogeneous and often fragmentary. Public databases have a responsibility to harness as much information as possible and organize it into useful online resources for use by diverse research communities (8,10,11).

A majority of the rice QTLs published between 1994 and 2007 have been curated from the literature and are currently available in the Gramene database. This was accomplished by extracting phenotypic information from highly heterogeneous textual descriptions and codifying it, using structured vocabularies and ontologies (2,10,11). Meanwhile, the thousands of diverse molecular markers that were used to map QTLs in hundreds of different populations were associated to sequences that could be aligned to the rice genome. As a result, the complex relationships between phenotypes and genotypes encoded by each QTL have now been organized into a network of properties and information in the context of a genome that is searchable and retrievable within the Gramene database.

Materials and methods

QTL data sources

Literature

To identify publications reporting rice QTLs, Gramene curators queried public library databases including PubMed, Agricola, BIOSIS Previews and CAB abstracts using keywords such as ‘Oryza/rice and quantitative trait loci/QTL’. After screening search results for relevance, each publication to be curated was assigned a unique reference ID and all curated QTLs were referenced to the original paper. All in all, between 1994 and 2007, thousands of research papers were scanned for QTL data and deposited in the Gramene literature database. The QTL data extracted from the literature included QTL names, symbols, traits, associated co-localized and neighboring markers, parental strains, types of crosses and other pertinent information.

Information to be entered into the database was extracted based on a set of priorities. The top priorities included information required to establish the genome position of a QTL, and information describing the trait or phenotype associated with each QTL. Trait descriptions were mapped to controlled vocabularies including the Trait Ontology (TO), Plant Ontology (PO) and Plant Growth Stage Ontology (GSO) (10–12). A detailed standard operating procedure (SOP) for QTL curation designed for curators and researchers who are at the initial stages of setting up a QTL database is available at http://ascus.plbr.cornell.edu/∼gramene/qtl_ms/sop/.

Integration of QTLs from MaizeGDB and GrainGenes

Collaborating with MaizeGDB (http://www.maizegdb.org/) and GrainGenes (http://wheat.pw.usda.gov/GG2/index.shtml), QTLs from maize, wheat, barley and oats, originally curated by those databases (13,14) have been integrated into Gramene. The schemas of those two databases and Gramene were compared, and the correspondences between data entries were identified. Additional curation was performed as necessary to meet standards set by Gramene.

Defining a single QTL in Gramene

Each experimentally defined QTL is treated as an independent entity in Gramene. This is designed to reduce the number of tracking/merging issues confronted by the curators and it ensures that QTLs are treated as statistical hypotheses rather than as confirmed genetic entities. In this way, if QTLs were reported in two different papers (even from the same group and with similar genetic materials), or if two QTLs were detected for the same trait in the same location but based on experiments in different years or locations, or if QTLs were identified by different QTL analysis methods (e.g. one-way Analysis of variance (ANOVA) versus Composite Interval Mapping), each QTL citation is curated as an independent entity, even where the authors reported them as a single QTL. This allows database users to weigh the evidence for the existence of a QTL based on the data; they can assess the number of times a putative QTL is reported in a similar location in different experiments. This approach avoids the assumption that a single genetic factor is inevitably responsible for a particular phenotype, leaving room for confirmation once a specific gene or functional allele(s) has been cloned and characterized.

QTL nomenclature

Each QTL is assigned a unique accession identifier in Gramene, e.g. ‘NQA001’. In this case, the first character (N) serves as a tracking mechanism to identify the curator who handled the information; Q stands for QTL; the third character (A) indicates the reference; and the following three digits are used to distinguish different QTLs from the same publication. This nomenclature is useful because it provides a mechanism for tracking all QTL/sets of QTL managed by a particular curator or all QTLs that correspond to a particular study or paper. This system also addresses management issues within the schema and most importantly provides a mechanism that allows users to query and download QTLs from the database.

Curation of genetic information in a QTL study

Curation of genetic information associated with QTLs is done in two stages in Gramene, namely QTL map data are first compiled and then QTL intervals are assigned, as outlined below.

QTL map data

Information about a QTL map is filled out by a curator, who first verifies whether the same population and genetic map information may have been used in a prior dataset that has already been curated in Gramene. If this information is already present in the database, it must be checked to see whether there were any modifications to the Map Set between studies. To determine this, the curator must compare the information in both studies to determine whether they are identical in terms of population size and structure, male and female parental identity, type and number of markers and the corresponding map display. If the Map Set is identical to a previously curated Map Set, the data already in the database are used as the reference map. If the details of the Map Set are significantly different, a new data file is created based on the current paper. Discrete positions for all markers on the map must be entered into a map file. In many cases, a published paper may not report the exact map position/interval distances for all markers. In such cases the curator contacts the corresponding or first author of the paper to obtain the required information.

Assignment of QTL intervals

The second important type of information extracted from a published paper is the QTL interval. This interval represents the search space associated with the QTL and within which any gene underlying the QTL is expected to reside. If the QTL interval has been clearly determined (i.e. the paper specifies that the QTL extends from marker a to b, or from cM x to cM y on the map, the positions of the interval for the QTL will be used as the feature start (upper) and feature stop (lower) positions along the chromosome. If the QTL interval was not clearly delineated in the paper, but a ‘peak marker’ was identified, the positions of the two most closely linked flanking markers (on either side of the QTL peak) are used as the start and stop positions for that QTL (the upper marker position as start position, and lower marker as stop position). If only a single marker is mentioned in association with a QTL (i.e. results from single point analysis), that marker's position alone is used as both the start and stop position for the QTL. The linkage group or chromosome number to which a particular QTL is mapped is also included in the data table.

Anchoring of genetic intervals to genomic sequence

The mapping of rice QTLs to genomic positions has been standardized by Gramene as follows. QTLs are defined by flanking or closely linked molecular markers identified from the original published QTL map. When the markers are restriction fragment length polymorphisms (RFLPs), their nucleotide sequence is obtained and BLAT (BLAST-like alignment tool) (15) is performed to obtain the genome position of that RFLP marker. If the markers are microsatellites, or simple sequence repeats (SSRs), the primer sequences are obtained and e-PCR is performed to determine the marker positions on the sequenced genome. In cases where a critical RFLP or SSR marker cannot be mapped unequivocally to the rice genome, but one or more markers in the interval associated with a particular QTL can be mapped to the expected genomic region, that marker is used to anchor the QTL to the genome and its position defines the location of the QTL on the genome map.

Curation of phenotypic information in a QTL study

The curators at Gramene have established protocols for extracting and encoding the highly heterogeneous phenotypic information associated with QTL studies using a combination of ontologies, evidence codes and free text assignments. These protocols are outlined below.

Development of vocabularies to describe plant phenotype

Gramene curators are developers and collaborators of the Plant Ontology Consortium (POC) (http://www.plantontology.org/). The POC develops and maintains controlled vocabularies or ontologies for Plant Ontology (PO) and plant growth and development (GSO) for the purpose of annotation (10,11,16). Additional vocabularies/ontologies used in Gramene, such as the Environmental Ontology (EO) and Trait Ontology (TO), were developed in-house. The TO (12) is developed based on phenotypic assays and vocabulary used widely in the agronomy and plant breeding communities. These ontologies are under continuous maintenance with new terms being actively added, as and when curators or users request them for annotation of new phenotypic traits.

Association to ontology terms

Each QTL is associated with a trait name corresponding to a term in the TO. Trait symbols are derived from the trait name, e.g. plant height carries the symbol PTHT, and are used to display the QTL position on a linkage group. The QTL is displayed as a feature on a map in the comparative map viewer, CMap (http://gmod.org/wiki/CMap) and in the rice genome browser. There are nine trait categories corresponding to agronomic or plant breeding classifications, and they include Yield, Grain Quality, Biotic Stress, Abiotic Stress, Sterility/Fertility, Vigor, Anatomy, Development and Biochemistry.

The published symbol for the QTL corresponds to the name or symbol described by the author in the original publication. Each QTL is also associated with an anatomical portion of the plant and a specific growth or developmental stage corresponding to the organ or tissue and time of development in which the trait was evaluated. In Gramene, terms from PO (10) and GSO (11) are used to describe these anatomical features and developmental stages. In addition, the environmental conditions and any supplemental treatment(s) used to determine the phenotype are recorded using terms from the EO.

Use of evidence codes

In Gramene, all QTL annotations are supported by the use of evidence codes. These codes indicate what data are available in the literature to support a variety of inferences, including ‘Inferred by association of genotype from phenotype’ (IAGP), ‘Inferred by curator’ (IC), ‘Starting material’ (SM). In addition, to the ones developed by the Gene Ontology database, a few evidence codes have been developed in-house. Evidence codes are also used to denote specific associations to ontology terms based on information in the published paper describing the QTL.

Database schema

The schema developed for the QTL database and all QTL data are downloadable from the Gramene ftp site: ftp://ftp.gramene.org/pub/gramene/CURRENT_RELEASE/data/database_dump/mysql-dumps.

Results

Phenotypes associated with QTLs are evaluated using specific genetic materials that provide contrasting phenotypic states. Historically, bi-parental populations derived from controlled crosses were used for identifying QTLs, but increasingly QTLs are being identified via association or linkage disequilibrium mapping (17). In any QTL study, specific traits or phenotypic features are assayed under a defined set of environmental conditions. A genetic study describing the relationship between phenotype and genotype will embody a series of interactions among loci and alleles in the genetic background, as well as interactions between genotype and the environment(s) in which the population is assayed. The curator tackles the complex problem of describing the relationship among each different data element (gene, allele, genetic population, phenotype and environment) and linking each element to all other entries in the database using a combination of bioinformatic tools, ontologies and free text.

QTL data acquisition and significance

The first plant QTL paper was published in 1988 (18), and since that time thousands of plant QTL studies have been published, including 617 papers reporting rice QTLs, 454 on wheat and 364 on maize. Figure 1 summarizes the number of QTL papers cited for rice, wheat, maize, tomato and Arabidopsis in the four major reference databases over the last 21 years. In Gramene, more than 230 papers have been curated for rice. This number is only 38% of the total number of rice publications because the Gramene curators impose certain requirements on what is to be included in the database. Their criteria include: (i) availability of mapset information; (ii) use of sequence-based markers for anchoring QTL to the genome (Amplified fragment length polymorphism (AFLP) and Random amplification of polymorphic DNA (RAPD) markers, for example, cannot be aligned); (iii) availability of published information in English. After eliminating papers that did not meet these criteria, the number of qualifying rice QTL papers was cut to just under 40%.

Figure 1.

Open in new tab Download slide

Number of published QTL papers for rice, wheat, maize, Arabidopsis and tomato between 1987 and 2007. Graph showing the steady increase in publications reporting QTLs in five major plant species between 1987 and 2007 based on nonredundant data from four publicly available literature databases, PubMed, Agricola, CAB Abstracts and BIOSIS Previews.

QTL statistics in Gramene Build 28

The QTL module in Gramene contributes to the functional annotation of the rice genome, a process that involves continuous layering of information onto the sequenced genome, by delineating the genome into thousands of specific regions that have a high probability of containing genes controlling quantitative traits. Active curation of QTLs in the Gramene database began in the year 2003 and since then more than 11 000 QTLs have been curated belonging to rice, pearl millet, foxtail millet, maize, wheat, barley and oat (Table 1). The number of QTLs curated for rice is much higher than those from other species, reflecting Gramene's priority to focus on rice as the first sequenced crop genome and corresponding to the higher rate of QTL identification in rice. The number of rice QTLs that have been projected on to the 12 rice chromosomes are given in Table 2. The rapid increase in the number of published QTL papers (Figure 1) requires a consistent effort to source this information, integrate it into the database and keep it current. The QTL curation is labor intensive and there are few experts prepared to handle this aspect of data curation, so information is extracted in phases and new methods are being developed to help automate this procedure. Extensive quality control protocols have been developed and put into practice to ensure accuracy of the curated data.

Table 1.

Summary of QTL and associated features in the Gramene database for 10 cereal species

Updated based on build 28.	Rice	Maize	Wheat	Tetraploid wheat	Oat	Barley	Sorghum	Pearl millet	Foxtail millet	Wild rice	Total
QTLs	8646	1747	23	8	375	299	136	284	65	41	11624
TO: traits^a	237	77	10	3	7	30	19	27	2	10	332
TO: trait categories^b	9	8	3	2	5	7	5	6	1	5	9
PO: structure^c	38	19	5	2	6	8	7	15	2	5	48
PO: growth-stage^d	19	9	6	2	5	5	7	7	2	5	20
Map sets^e	89	8	9	2	1	8	2	5	1	1	126
Parental germplasm^f	91	4	4	2	2	7	3	12	2	2	108
Co-localized markers^g	30950	3615	73	14	888	335	334	1031	87	42	37369
Neighboring markers^h	16422	3120	37	14	561	258	558	535	122	74	21671
Curated papers	246	56	11	2	1	9	2	6	1	1	335

Updated based on build 28.	Rice	Maize	Wheat	Tetraploid wheat	Oat	Barley	Sorghum	Pearl millet	Foxtail millet	Wild rice	Total
QTLs	8646	1747	23	8	375	299	136	284	65	41	11624
TO: traits^a	237	77	10	3	7	30	19	27	2	10	332
TO: trait categories^b	9	8	3	2	5	7	5	6	1	5	9
PO: structure^c	38	19	5	2	6	8	7	15	2	5	48
PO: growth-stage^d	19	9	6	2	5	5	7	7	2	5	20
Map sets^e	89	8	9	2	1	8	2	5	1	1	126
Parental germplasm^f	91	4	4	2	2	7	3	12	2	2	108
Co-localized markers^g	30950	3615	73	14	888	335	334	1031	87	42	37369
Neighboring markers^h	16422	3120	37	14	561	258	558	535	122	74	21671
Curated papers	246	56	11	2	1	9	2	6	1	1	335

^aTO: traits—the number of unique phenotypic traits defined by TO terms that have been used to annotate QTLs.

^bTO: trait categories—the nine categories of traits named in the TO; this higher order node in the TO serves to cluster related traits.

^cPO: structure—the number of anatomy terms used to describe QTLs (total number of unique terms = 48).

^dPO: growth-stage—the number of growth-stage terms used to describe QTLs (total number of unique terms = 20).

^eMap Sets—the number of unique mapping population marker datasets used in QTL studies.

^fParental germplasm—the number of different strains or accessions used as parents in QTL mapping studies.

^gCo-localized markers—markers that map within QTL intervals; >37 000 markers have been curated and used to anchor QTLs to the sequence map of rice.

^hNeighboring markers—markers flanking QTL intervals; >21 000 neighboring markers have been curated and are used to construct comparative maps.

Open in new tab

Table 1.

Summary of QTL and associated features in the Gramene database for 10 cereal species

Updated based on build 28.	Rice	Maize	Wheat	Tetraploid wheat	Oat	Barley	Sorghum	Pearl millet	Foxtail millet	Wild rice	Total
QTLs	8646	1747	23	8	375	299	136	284	65	41	11624
TO: traits^a	237	77	10	3	7	30	19	27	2	10	332
TO: trait categories^b	9	8	3	2	5	7	5	6	1	5	9
PO: structure^c	38	19	5	2	6	8	7	15	2	5	48
PO: growth-stage^d	19	9	6	2	5	5	7	7	2	5	20
Map sets^e	89	8	9	2	1	8	2	5	1	1	126
Parental germplasm^f	91	4	4	2	2	7	3	12	2	2	108
Co-localized markers^g	30950	3615	73	14	888	335	334	1031	87	42	37369
Neighboring markers^h	16422	3120	37	14	561	258	558	535	122	74	21671
Curated papers	246	56	11	2	1	9	2	6	1	1	335

Updated based on build 28.	Rice	Maize	Wheat	Tetraploid wheat	Oat	Barley	Sorghum	Pearl millet	Foxtail millet	Wild rice	Total
QTLs	8646	1747	23	8	375	299	136	284	65	41	11624
TO: traits^a	237	77	10	3	7	30	19	27	2	10	332
TO: trait categories^b	9	8	3	2	5	7	5	6	1	5	9
PO: structure^c	38	19	5	2	6	8	7	15	2	5	48
PO: growth-stage^d	19	9	6	2	5	5	7	7	2	5	20
Map sets^e	89	8	9	2	1	8	2	5	1	1	126
Parental germplasm^f	91	4	4	2	2	7	3	12	2	2	108
Co-localized markers^g	30950	3615	73	14	888	335	334	1031	87	42	37369
Neighboring markers^h	16422	3120	37	14	561	258	558	535	122	74	21671
Curated papers	246	56	11	2	1	9	2	6	1	1	335