Opportunities for text mining in the FlyBase genetic literature curation workflow Open Access

Data-type flags used in literature curation (taken from an article by Bunt et al., 2012; see Ref. 2)

Data-type flags	Data presented in an article
Drosophila reagents
New allele or aberration	Generation of a new classical allele or chromosomal aberration in a Drosophilid genome
New transgene	Generation of a new transgenic construct
Gene characterization
Initial characterization	Initial characterization of a Drosophilid gene
Merge of gene reports	Evidence suggesting the merge of two or more FlyBase gene reports
Gene rename	New gene symbol or name for an existing gene in FlyBase
Expression
Expression in a wild-type background	New temporal or spatial expression data of any D. melanogaster gene in a wild-type background
Expression in a mutant background	Expression data of any D. melanogaster gene in a mutant background, or after environmental perturbation
Phenotypes and interactions
Phenotypic analysis	Novel phenotypic data
Physical interaction	Physical interactions involving D. melanogaster proteins or nucleic acids
Genome annotation data
Changes to D. melanogaster gene model	New experimental data relevant to D. melanogaster gene model structure
Changes to non-D. melanogaster gene model	New experimental data relevant to the gene model structure of non-D. melanogaster Drosophilid genes
Mapping of features to genome	D. melanogaster molecular mapping data
Cis-regulatory elements defined	Experimental definition of cis-regulatory elements of D. melanogaster genes

Data-type flags	Data presented in an article
Drosophila reagents
New allele or aberration	Generation of a new classical allele or chromosomal aberration in a Drosophilid genome
New transgene	Generation of a new transgenic construct
Gene characterization
Initial characterization	Initial characterization of a Drosophilid gene
Merge of gene reports	Evidence suggesting the merge of two or more FlyBase gene reports
Gene rename	New gene symbol or name for an existing gene in FlyBase
Expression
Expression in a wild-type background	New temporal or spatial expression data of any D. melanogaster gene in a wild-type background
Expression in a mutant background	Expression data of any D. melanogaster gene in a mutant background, or after environmental perturbation
Phenotypes and interactions
Phenotypic analysis	Novel phenotypic data
Physical interaction	Physical interactions involving D. melanogaster proteins or nucleic acids
Genome annotation data
Changes to D. melanogaster gene model	New experimental data relevant to D. melanogaster gene model structure
Changes to non-D. melanogaster gene model	New experimental data relevant to the gene model structure of non-D. melanogaster Drosophilid genes
Mapping of features to genome	D. melanogaster molecular mapping data
Cis-regulatory elements defined	Experimental definition of cis-regulatory elements of D. melanogaster genes

Table 1

Open in new tab Download slide

Data-type flags used in literature curation (taken from an article by Bunt et al., 2012; see Ref. 2)

Data-type flags	Data presented in an article
Drosophila reagents
New allele or aberration	Generation of a new classical allele or chromosomal aberration in a Drosophilid genome
New transgene	Generation of a new transgenic construct
Gene characterization
Initial characterization	Initial characterization of a Drosophilid gene
Merge of gene reports	Evidence suggesting the merge of two or more FlyBase gene reports
Gene rename	New gene symbol or name for an existing gene in FlyBase
Expression
Expression in a wild-type background	New temporal or spatial expression data of any D. melanogaster gene in a wild-type background
Expression in a mutant background	Expression data of any D. melanogaster gene in a mutant background, or after environmental perturbation
Phenotypes and interactions
Phenotypic analysis	Novel phenotypic data
Physical interaction	Physical interactions involving D. melanogaster proteins or nucleic acids
Genome annotation data
Changes to D. melanogaster gene model	New experimental data relevant to D. melanogaster gene model structure
Changes to non-D. melanogaster gene model	New experimental data relevant to the gene model structure of non-D. melanogaster Drosophilid genes
Mapping of features to genome	D. melanogaster molecular mapping data
Cis-regulatory elements defined	Experimental definition of cis-regulatory elements of D. melanogaster genes

Data-type flags	Data presented in an article
Drosophila reagents
New allele or aberration	Generation of a new classical allele or chromosomal aberration in a Drosophilid genome
New transgene	Generation of a new transgenic construct
Gene characterization
Initial characterization	Initial characterization of a Drosophilid gene
Merge of gene reports	Evidence suggesting the merge of two or more FlyBase gene reports
Gene rename	New gene symbol or name for an existing gene in FlyBase
Expression
Expression in a wild-type background	New temporal or spatial expression data of any D. melanogaster gene in a wild-type background
Expression in a mutant background	Expression data of any D. melanogaster gene in a mutant background, or after environmental perturbation
Phenotypes and interactions
Phenotypic analysis	Novel phenotypic data
Physical interaction	Physical interactions involving D. melanogaster proteins or nucleic acids
Genome annotation data
Changes to D. melanogaster gene model	New experimental data relevant to D. melanogaster gene model structure
Changes to non-D. melanogaster gene model	New experimental data relevant to the gene model structure of non-D. melanogaster Drosophilid genes
Mapping of features to genome	D. melanogaster molecular mapping data
Cis-regulatory elements defined	Experimental definition of cis-regulatory elements of D. melanogaster genes

The data-type flags that were added during skim curation are stored in an internal database and used to generate a priority list for full curation. Our current practice is to target articles with the broadest range of curatable data (i.e. the highest number of flags) first. Essentially, the more flags an article has, the more relevant data types it contains and the higher its position in our priority list. The system is flexible so that we can change how the flags are used to prioritize articles as circumstances change (such as an increase in a particular type of data or funding changes). Using this approach, each individual article is prioritized based solely on its relevance to FlyBase users. In ‘full curation’, literature curators extract genetic entities and related molecular and phenotypic data from the results (text, tables and figures) and methods sections of an article. Only primary data are extracted, so no data are curated from the introduction, discussion or when they are referenced to another article. Additional types of data include GO annotations, genetic interactions (e.g. alleles of different genes), allelic interactions (e.g. alleles of the same gene), aberrations (e.g. deficiencies), allele classes (e.g. hypomorphs) and construct data (e.g. transgenic lines).

Data capture methods

Curators interact with an article in differing ways. Depending on the length of the article, some curators read the article from the computer screen, while others print the article out and highlight phrases with a marker. Often, the ‘find’ function of a PDF viewer is used to search for multiple occurrences of the same term during ‘on screen’ curation. Some curators skim through a whole article first in order to get a good overview of the work and then re-read it more thoroughly to extract the detail. Others start curation from a specific section (not necessarily the abstract or the results section) and then move to another section in search of additional information about a specific concept, for example, a particular transgenic construct or gene function. Each curator may have a different method by which they curate an article, but all curators follow strict guidelines as to what type and depth of data are curated and any inconsistencies in curation are identified in regular curator meetings.

Data from an article are added to structured template files using a range of valid symbols (and recorded synonyms), controlled vocabularies (CVs) (Table 2) and free text. These structured text file templates are known as proformae. These proformae roughly correspond to the report pages found on the FlyBase website. For genetic curation, the proformae are arranged in a hierarchy, starting with a single publication proforma, followed by gene proformae and nested allele proformae as necessary, then by proformae for other genetic entities as required (Figure 2a, proformae organization). Several proformae are bundled together to form a ‘curation record’ for each article. Curation records are opened, edited and saved in a text editor.

Figure 2

Literature curation into proformae. Text files composed of various proformae are used to capture data from the literature. (A) The proformae are ordered such that each curation record has to start with a publication proforma, so all objects mentioned subsequently can be attributed to the relevant publication. Allele proformae are added underneath the parent gene proforma, so all allele information can be related back to the parent gene. (B) Proformae are split into four different types of fields. The fields start with an exclamation mark (for processing) and each field has a field code, e.g. GA1a is the allele symbol field (all fields in the allele proforma are coded GAx).

Table 2

The main CVs used in literature curation

Ontology	Example search term (CV ID)
fly_anatomy	dMP2 neuron (FBbt:00001602)
fly_development	Pupal stage P6 (FBdv:00005353)
Term qualifier	Nutrition conditional (FBcv:0000714)
Phenotypic class	Smell perception defective (FBcv:0000404)
Sequence ontology	Engineered_foreign_gene (SO:0000281)
Origin of mutation	P-element activity (FBcv:0000486)
Allele class	Amorphic allele (FBcv:0000688)
Cellular component	Germ cell nucleus (GO:0043073)
Molecular function	Satellite DNA binding (GO:0003696)
Biological process	mRNA processing (GO:0006397)

Ontology	Example search term (CV ID)
fly_anatomy	dMP2 neuron (FBbt:00001602)
fly_development	Pupal stage P6 (FBdv:00005353)
Term qualifier	Nutrition conditional (FBcv:0000714)
Phenotypic class	Smell perception defective (FBcv:0000404)
Sequence ontology	Engineered_foreign_gene (SO:0000281)
Origin of mutation	P-element activity (FBcv:0000486)
Allele class	Amorphic allele (FBcv:0000688)
Cellular component	Germ cell nucleus (GO:0043073)
Molecular function	Satellite DNA binding (GO:0003696)
Biological process	mRNA processing (GO:0006397)

Table 2

Open in new tab Download slide

The main CVs used in literature curation

Ontology	Example search term (CV ID)
fly_anatomy	dMP2 neuron (FBbt:00001602)
fly_development	Pupal stage P6 (FBdv:00005353)
Term qualifier	Nutrition conditional (FBcv:0000714)
Phenotypic class	Smell perception defective (FBcv:0000404)
Sequence ontology	Engineered_foreign_gene (SO:0000281)
Origin of mutation	P-element activity (FBcv:0000486)
Allele class	Amorphic allele (FBcv:0000688)
Cellular component	Germ cell nucleus (GO:0043073)
Molecular function	Satellite DNA binding (GO:0003696)
Biological process	mRNA processing (GO:0006397)

Ontology	Example search term (CV ID)
fly_anatomy	dMP2 neuron (FBbt:00001602)
fly_development	Pupal stage P6 (FBdv:00005353)
Term qualifier	Nutrition conditional (FBcv:0000714)
Phenotypic class	Smell perception defective (FBcv:0000404)
Sequence ontology	Engineered_foreign_gene (SO:0000281)
Origin of mutation	P-element activity (FBcv:0000486)
Allele class	Amorphic allele (FBcv:0000688)
Cellular component	Germ cell nucleus (GO:0043073)
Molecular function	Satellite DNA binding (GO:0003696)
Biological process	mRNA processing (GO:0006397)

Each proforma is composed of a number of fields, which are split into four main types (Figure 2b, proforma fields). The first are the symbol and ID fields. These contain the unique ID and official FlyBase symbol for the genetic entity of interest. These fields also include the symbols and names used to identify the object in the literature, so users can identify an object even if FlyBase labels the object with a different symbol/name to that used in a particular article. An example of this is the gene ‘Wrinkled’, which is often identified as ‘head involution defective’ in the literature.

The second type of field houses CV lines constructed from one or more of the 16 ontologies (structured hierarchies) used in FlyBase (Table 2). CV terms are used to annotate fields for many objects in FlyBase. For example, we use Sequence Ontology (SO) (3) terms to denote the type of gene and GO terms (4) to annotate these genes for molecular function, cellular component and biological process. We also use CVs to record both the class of a phenotype and the anatomical part that is affected. Phenotypes are attached to alleles and genotypes (combinations of alleles, or alleles and chromosomal aberrations). An example piece of text from which we have extracted phenotype CV terms and free text is shown in Figure 3.

Figure 3

Phenotype curation. Example data entries for a section of text [taken from an article by Baines (2003), see Ref. 5]. First, we identify the object we are ascribing the phenotype to, then we concisely curate the phenotype as free text, relating it to the object (which is placed between ‘at sign’ symbols as these symbols are hyperlinked). We then annotate the phenotype to CV terms, in this case, to terms from our ‘phenotypic class’ and ‘fly_anatomy’ ontologies.

The third type of field is the so-called ‘SoftCV’ field, which uses a limited set of terms to describe a feature in a semi-controlled manner. The terms used are ‘structured sentences’ and do not form an ontology. An example of this type of field is the ‘Molecular lesions’ field in the allele proforma, which captures amino acid changes with the prefix ‘Amino acid replacement’.

The fourth type of field houses free text descriptions. These fields provide a level of detail that cannot be captured by CV terms and describe phenotypes and molecular lesions in a more human readable form. An example of this field is the phenotype free text field, where we record the detail of a phenotype (by paraphrasing rather than cutting and pasting from the article) that is associated with an allele or genotype (Figure 3).

Once genetic curation is complete, each curation record is checked using our in-house software, where each line is assessed for the correct structure and CV. The record is also checked for coherency between the fields, so for example, if a gene is renamed, we confirm that the new gene symbol has not been used before, and that a line is added in another field to attribute the rename to that particular article. Once a curation record has been fully checked for clashes both within the record and with the existing database, it is integrated with existing data in the FlyBase Chado database (6). Currently, we update the public website with a new release of the data six times a year, at roughly 2-month intervals.

Problems in curation

Manual curation from the literature is hard. It’s a time-consuming task that can take an experienced curator a number of hours for a standard article. It typically takes 6 months to train a FlyBase Genetic Literature Curator. Even then, curation is a social process, with curators seeking advice among the group. In this section, we will outline some of the problems we encounter in curation and describe how they could impact on the use of text mining in FlyBase.

The most common problem encountered during curation is an ambiguous genetic entity (gene, mutant allele, transgene, etc.). This situation can arise when no unique identifier (such as a FlyBase gene identifier (FBgn) or a computed gene (CG) number for genes), or an accurate and explicit reference for a mutant or transgenic line is given. Ambiguity is a particular problem when a generic symbol/name is used (e.g. ‘Actin’ or UAS-Notch), or when a symbol/name is used that is a synonym for a different entity (e.g. ‘ras’ is the current FlyBase symbol for the ‘raspberry’ gene, FBgn0003204, but is often used in the literature to refer to the ‘Ras85D’ gene, FBgn0003205). A further issue is that some symbols only differ in case-sensitivity for the first character, for example, the genes symbols ‘dl’ (dorsal) and ‘Dl’ (Delta). These ambiguities can usually be resolved by searching for associated details about the entity in the article (e.g. the use of a specific mutant allele can identify the gene being discussed) or by consulting the supplemental information for additional details. Sometimes we have to do some analysis ourselves, such as performing a BLAST search using any sequence data present in the article or supplementary files or executing an in-house script to report those entities used by a specified author in previously curated articles. As a final step, if we cannot resolve a problem, we email the corresponding author for clarification. If the ambiguity still cannot be resolved, then a curator will either associate a generic/unspecified entry for that entity with the article, or else omit the entity and add a (non-public) note to the curation record explaining the situation, with the hope that future publications will resolve the issue.

One of the more esoteric problems found in curation is the fact that multiple relationships exist between the curated data types. For example, the ‘dpp^EP2232 allele’ is caused by the ‘P{EP}dpp^EP2232 insertion’ and disrupts the ‘dpp gene’. This can cause problems for text-mining assisted curation, as the data can be attributed to the wrong object due to sentence structure or the requirement of background or contextual knowledge found in other parts of the article. In cases like this, detailed knowledge of the FlyBase proforma and curation rules, as well as a good knowledge of Drosophila biology, is necessary to ensure the correct proforma field is filled in. This is one of the reasons why we believe text-mining methods will assist manual curation rather than replace it in the near term.

Curation, therefore, is a complex process. However, when broken down into discrete components, we believe, from experience, that the use of text-mining tools can be beneficial.

Use of text-mining tools in curation

We hope that, over the coming years, we will be able to integrate text mining into multiple areas within our curation pipeline and practice, to help streamline the process. In this section, we will outline our current progress and define our aims for the near future.

FlyBase began interacting with the text-mining community in 2002 when we were involved in the KDD CUP challenge (7), a forerunner to the BioCreative initiative (8). In 2008, we developed a natural language processing (NLP) system that marked-up html versions of an article for gene/allele mentions and associated phenotypes (9). This ‘PaperBrowser’ tool, written in Java and on top of a web browser, is equipped with two navigation mechanisms called PaperView and EntitiesView. These are organized in terms of the document sectioning and possible relations between groups of words (noun phrases). More specifically, PaperView lists gene symbols (such as ‘btl’ or ‘Vang’) in the order in which they appear in each section of the article, while EntitiesView lists noun phrases related to the gene symbols such as ‘the wg pathway’. Clicking on a node in either PaperView or EntitiesView redirected the PaperBrowser tool to the sentence that contains the corresponding gene symbol or noun phrase, so allowing the curator to navigate around the article to those sections involving the entity of interest, for example, a particular mutant allele or transgene. Used in conjunction with a simple curation interface, the PaperBrowser tool improved article navigational efficiency (the number of navigation events needed before the data are extracted) by ∼58% and provided curators with enhanced utility (the number of non-navigated events needed outside of the highlighted areas) by over 74% compared to using the ‘find’ function in a PDF viewer. The PaperBrowser system is a good proof of concept, demonstrating that text mining can be successfully integrated with the FlyBase article-by-article curation system, which is something we would like to explore further in the future.

FlyBase has collaborated with WormBase (10) and Textpresso (11) since 2004 when we were consulted in the development of Textpresso for Fly, including during the creation of fly-specific vocabularies to use in the ‘Categories’ searches. In our recent collaboration, we are exploring the use of support vector machine (SVM) methods to triage primary research articles into categories based on our skim/author curation flags (12). We have trained the SVM to triage articles for new alleles, new transgenes and gene renames. We have done this through the generation of positive and negative training sets, composed of articles that we have already curated (where we know whether they contain a particular data type or not). The positive training set contains between 500 and 1000 articles, while the negative training set contains over 2000 articles. We plan to re-train the SVM for these flags periodically, to account for changes in the literature and to ensure continued low false-positive/negative rates. We are in the process of training the SVM for some of the other data triage flags, in the hope of using the SVM to triage those articles that haven’t been curated by the authors through our FTYP tool.

We can envisage text mining being used at multiple points within our curation workflow. Further to our current SVM work, text mining could be a useful adjunct to our manual GO annotation process. There are still many genes that lack functional information and, minimally, automated text mining could be used to identify articles with functional data for these genes. Following the example set by WormBase (13), we hope to use text mining to generate suggested annotations for specific genes (subject to curator review) for at least GO cellular component. We are about to embark on curating disease associations and anticipate that the Textpresso disease category could also be helpful for this aspect of curation.

While SVM methods may replicate the triage aspect of skim curation, in order to fully automate the process, we need to also identify the genes mentioned in each publication. Textpresso and the Genetics Society of America (GSA), in collaboration with WormBase, SGD (14) and FlyBase, have developed a journal article mark-up pipeline that links GSA journal articles and FlyBase gene symbols and IDs (15). False positives are discovered (at a rate of 12%, with false negatives at 27%) and resolved by a curator through a manual quality control step, with the final marked-up document assessed by the authors as part of the proofing process.

Building on our collaboration with WormBase and the GSA, we hope to use text mining to extract genetic entity symbols from all Drosophila-related literature. This, in combination with a document triage system, would form a text-mining equivalent of our skim curation, combining symbol extraction with document triaging. We are also interested in using text mining to suggest anatomy CV terms for specific genes from particular regions of text, along with gene and allele symbols and data triage flags. This would not extract the data and populate proforma, but simply highlight the text for the curator to assess and extract if appropriate. This would build on the success of the PaperBrowser system and accelerate manual curation of an article.

Summary

Over the last 20 years, FlyBase has had to adapt and change to keep abreast of changes in biology and database design. We are continually looking for ways to improve curation efficiency and efficacy. A major challenge over the years is to deal with the massive increase in data that biologists are generating, both in genomic studies and also in the published literature. While it may be too ambitious to anticipate text mining replacing human curators for full curation, we can envisage a time when text mining will be regularly used in FlyBase curation. Our aim when writing this article was to inform the text-mining community about our curation pipeline and practice. We hope that this will encourage collaboration with FlyBase so that we can put technologies in place that will aid curators both in FlyBase and in other databases to deal with this challenge.

Funding

The National Human Genome Research Institute at the National Institutes of Health [P41 HG00739] and the Medical Research Council, UK [G1000968]. Funding for open access charge: the National Human Genome Research Institute, the National Institutes of Health [P41 HG00739].

Conflict of interest. None declared.

Acknowledgements

The author thanks the BioCreative group for the invitation to give a talk at the BioCreative Workshop 2012 on the FlyBase genetic literature curation workflow, from which this article stems. He also thanks colleagues at FlyBase, particularly Steven Marygold, David Osumi-Sutherland and Gillian Millburn, for critical reading of the article, and members of FlyBase Cambridge for their work in the design of the curation pipeline and discussions on the use of text mining in curation.

References

McQuilton

St Pierre

Thurmond

. ,

and the FlyBase Consortium. (2012) FlyBase 101: the basics of navigating FlyBase

Nucleic Acids Res.

, vol.

(pg.

D706

D714

)

Bunt

Grumbling

Marygold

, et al. ,

Directly e-mailing authors of newly published papers encourages community curation

Database

2012

doi: 10.1093/database/bas024

Google Scholar

OpenURL Placeholder Text

Eilibeck

Lewis

Mungall

, et al. ,

The Sequence Ontology: a tool for the unification of genome annotations

Genome Biol.

2005

, vol.

pg.

R44

The Gene Ontology Consortium

(2011) The Gene Ontology: enhancements for 2011

Nucleic Acids Res.

2011

, vol.

(pg.

D559

D564

)

OpenURL Placeholder Text

Baines R.A. (2003) Postsynaptic protein kinase A reduces neuronal excitability in response to increased synaptic excitation in the Drosophila CNS. J. Neurosci. 23: 8664–8672

Mungall

Emmert

the FlyBase Consortium

. ,

A Chado case study: an ontology-based modular schema for representing genome-associated biological information

Bioinformatics

2007

, vol.

(pg.

i337

i346

)

Yeh

Hirschman

Morgan

. ,

Background and overview for KDD Cup 2002 task 1: information extraction from biomedical articles

SIGKDD Explor. Newsl.

2002

, vol.

(pg.

)

Google Scholar

Arighi

Krallinger

, et al. ,

Overview of the BioCreative III workshop

BMC Bioinformatics

2011

, vol.

12 (Suppl. 8)

pg.

Karamanis

Seal

Lewin

, et al. ,

Natural language processing in aid of FlyBase curators

BMC Bioinformatics

2008

, vol.

pg.

193

Yook

Harris

Bieri

, et al. ,

WormBase 2012: more genomes, more data, new website

Nucleic Acids Res.

2012

, vol.

(pg.

D735

D741

)

Müller

H-M

Kenny

Sternberg

. ,

Textpresso: an ontology-based information retrieval and extraction system for biological literature

PLoS Biol.

2004

, vol.

pg.

e309

Fang

Schindelman

Van Auken

, et al. ,

Automatic categorization of diverse experimental information in the bioscience literature

BMC Bioinformatics

2012

, vol.

pg.

Van Auken

Jaffery

Chan

, et al. ,

Semi-automated curation of a protein subcellular localization: a text mining-based approach to Gene Ontology (GO) Cellular Component curation

BMC Bioinformatics

, vol.

pg.

228

Cherry

Hong

Amundsen

, et al. ,

Saccharomyces Genome Database: the genomics resource of budding yeast

Nucleic Acids Res.

, vol.

(pg.

D700

D705

)

Rangarajan

Schedl

Yook

, et al. ,

Toward an interactive article: integrating journals and biological databases

BMC Bioinformatics

, vol.

pg.

175