Abstract

In this article, we describe the architecture of the OntoGene Relation mining pipeline and its application in the triage task of BioCreative 2012. The aim of the task is to support the triage of abstracts relevant to the process of curation of the Comparative Toxicogenomics Database. We use a conventional information retrieval system (Lucene) to provide a baseline ranking, which we then combine with information provided by our relation mining system, in order to achieve an optimized ranking. Our approach additionally delivers domain entities mentioned in each input document as well as candidate relationships, both ranked according to a confidence score computed by the system. This information is presented to the user through an advanced interface aimed at supporting the process of interactive curation. Thanks, in particular, to the high-quality entity recognition, the OntoGene system achieved the best overall results in the task.

Introduction

As a way to cope with the constantly increasing generation of results in molecular biology, some organizations maintain various types of databases that aim at collecting the most significant information in a specific area. For example, UniProt/SwissProt (1) collects information on all known proteins. IntAct (2) is a database collecting protein–protein interactions. The Comparative Toxicogenomics Database (CTD) collects associations between chemicals and genes in order to support the study on the effects of environmental chemicals on health (3). Most of the information in these databases is derived from the primary literature by a process of manual annotation known as ‘literature curation’. Text mining solutions are increasingly requested to support the process of curation of biomedical databases.

Several community-run evaluations have been organized in the past few years in order to assess the advancement of the field and stimulate new developments. Some of the best known are BioCreative (4), BioNLP (5) and CALBC (6). The 2012 BioCreative edition included, in particular, a task aiming at supporting the triage process for the Comparative Toxicogenomics Database. In this article, we describe the approach used for our participation in the triage task of the BioCreative 2012 challenge and the results obtained.

The triage task is the first step of the curation process for several biological databases: it aims at selecting and prioritizing the articles to be curated in the rest of the process. In BioCreative 2012, the task organizers provided a chemical entity to be used as an entry point of the curation process, and a list of articles to be prioritized according to that chemical.

Our solution to this task has been implemented under the assumption that articles should be considered relevant if they are related to the target entity provided as input and additionally, their relevance should be increased by the presence of interactions in which the target chemical is involved.

The work presented here is part of the OntoGene project (http://www.ontogene.org/), which aims at improving biomedical text mining through the usage of advanced natural language processing techniques. Our approach is based on accurate processing of the input articles by a pipeline of advanced NLP tools, which perform increasingly complex task, from sentence splitting and tokenization up to term recognition, phrase chunking and syntactic analysis (7, 8).

In the context of the SASEBio project (Semi-Automated Semantic Enrichment of the Biomedical Literature), the OntoGene group has also developed a user-friendly interface (ODIN: OntoGene Document INspector) which presents the results of the text mining pipeline in an intuitive fashion and allows a deeper interaction of the curator with the underlying text mining system (9).

In the rest of this article, we first explain how our existing OntoGene relation mining system has been customized for the CTD dataset (‘Information extraction’ section), and then how it has been integrated with a conventional information retrieval (IR) system (Lucene) for the purpose of the triage task (‘Integration with a standard IR system’ section). We also provide a brief overview of our ODIN curation interface (‘The ODIN interface’ section), an evaluation of the results obtained by the integrated sytem in the shared task (‘Evaluation’ section) and a discussion on current and future work (‘Discussion’ section).

Information extraction

In this section, we describe the OntoGene Text Mining pipeline which is used to (i) provide all basic pre-processing (e.g. tokenization) of the target documents, (ii) identify all mentions of domain entities and normalize them to database identifiers and (iii) extract candidate interactions. We then describe in detail, a machine-learning approach used to obtain an optimized scoring of candidate interactions based upon global information from the set of interactions existing in the CTD database (excluding data from the test set).

Pre-processing and detection of domain entities

The OntoGene Text Mining pipeline was used in order to transform the input documents into a richly annotated XML format, which is the basis of our relation extraction algorithm. The assumption was that from this format we could derive information useful to improve document ranking and therefore provide a solution for the triage task, which could improve on a conventional IR approach. In a previous work (10), we showed that the inclusion of PubMed metadata, such as the list of chemical substances as well as the annotated MeSH descriptors and qualifiers, improves the detection of important relations and enhances term recognition coverage. Therefore, we added such metadata from the PubMed XML files as a textual list at the end of each abstract. In the OntoGene text mining pipeline, sentence and token boundaries of the enriched abstracts are identified using the LingPipe framework (more information can be found at http://alias-i.com/lingpipe).

In this section, we describe in particular our approach to named entity recognition, i.e. the problem of detecting names of relevant domain entities in biomedical literature (genes, chemicals and diseases for CTD) and grounding them to widely accepted identifiers assigned by the original database.

Terms, i.e. preferred names and synonyms, are automatically extracted from the original CTD database and stored in a common internal format, together with their unique identifiers, as obtained from the original resource. An efficient lookup procedure is used to annotate any mention of a term in the documents with the IDs to which it corresponds. A term normalization step is used to take into account a number of possible surface variations of the terms. The same normalization is applied to the list of known terms at the beginning of the annotation process, when it is read into memory, and to the candidate terms in the input text, so that a matching between variants of the same term becomes possible despite the differences in the surface strings. In case the normalized strings match exactly, the input sequence is annotated with the IDs of the reference terms and no further disambiguation on concepts is done at this point. For more technical details of the OntoGene term recognizer, see (11).

Detection of interactions

As a baseline approach, it is possible to generate candidate interactions among domain entities on the basis of their co-occurrence in a given text span (typically one or more sentences or an even larger observation window). Such an approach might achieve a sufficient recall but suffers from low precision. In order to obtain better precision it is possible to take into account the syntactic structure of the sentence, or the global distribution of interactions in the original database. In this section, we describe in detail how candidate interactions are ranked by our system, according to their relevance for CTD curation, by exploiting the vast amount of curated articles in the CTD database.

For the entities in the CTD database a context window of one sentence for candidate relation generation is too restrictive. In an evaluation limited to those PubMed articles from CTD with explicit evidence for at most 12 relations we found the following distribution: for about 32% of all relations from the CTD, where our term recognizer was able to detect both participating entities, there was no sentence containing both entities in the PubMed abstract. Given these numbers, we chose to use a context window of the entire abstract for candidate pair generation.

An initial ranking of the candidate relations can be generated on the basis of frequency of occurrence of the respective entities only:

where forumla and forumla are the number of times the entities forumla and forumla are observed in the abstract, while forumla is the total count of all identifiers in the abstract. Previous experiments for the extraction of protein–protein interactions from PubMed abstracts (8) and more recent experiments on the PharmGKB database (12) have shown that giving a ‘boost’ of ∼10 to the entities contained in the title produces a measurable improvement of ranking of the results.

This simple approach can be further optimized if we apply a supervised machine-learning method for scoring the probability of an entity to be part of a relation which was manually curated and inserted into the CTD database. There are two key motivations for this approach. First, we need to lower the scores of false positive relations which are generated by too broad entities (frequent but not very interesting). The goal is to model some global properties of the curated CTD relations. Second, we want to penalize false positive concepts that our term recognizer detects. In order to deal with such cases, we need to condition the entities by their normalized textual form forumla. The combination of a term forumla and one of its valid entities forumla is noted as forumla.

For example, according to the term database of the CTD, the word ‘PTEN’ (phosphatase and tensin homolog) may denote nine different diseases (autistic disorder; carcinoma, squamous cell; glioma; hamartoma syndrome, multiple; head and neck neoplasms; melanoma; prostatic neoplasms; endometrial neoplasms; craniofacial abnormalities), apart from denoting the gene ‘PTEN’. Using the techniques described below we can automatically derive the relevancy of the concepts related to the word ‘PTEN’ from the corpus of manually curated CTD relations. Doing so leads to a result which clearly prefers the interpretation of ‘PTEN’ as a gene.

Next, we define a predicate forumla which is true for an article forumla if there is at least one relation in the gold standard where entity forumla is part of and false (i.e. 0) otherwise. We estimate the overall probability forumla with the help of the maximum entropy modeling tool megam (13). For training, we use the set of CTD-referenced PubMed articles having not more than 12 manually curated relations (the threshold of 12 relations is motivated by the observation that the more relations an article has, the less probable it is to find them by processing the abstracts only), additionally removing all articles which are part of the BioCreative training and test set for the respective dataset (this results in 22319 articles for the training set, containing 69320 curated relations. For the test set, we used 22 825 articles with 71 064 relations).

For unseen normalized terms forumla, i.e. terms not present in the training data, the maximum entropy classifier would assign a low default probability based on the distribution of all training instances. However, we can specify better back-off probabilities if we take into account the admissible entity/entities forumla of term forumla. Our current back-off model works as follows: if the entity forumla of an unseen term forumla is seen in the article, the averaged probability of all seen term–entity pairs is used. Otherwise, the averaged probability of all entities of the same type as forumla is used.

The score of an entity forumla in an article forumla is the sum of all zone-boosted term frequencies (as mentioned earlier, occurrences in the title are counted 10 times) weighted by their gold probability:
Having determined the individual score for each entity forumla, we compute the relation score as the harmonic mean of its component scores:

In our previous work on relation ranking (10), the relation score was taken as a sum of the concept scores. By performing systematic cross-validation experiments on all CTD articles, we noticed that using the harmonic mean improves the results considerably. In order to make the relation scores comparable between different articles we normalize all relation scores for a given BioCreative dataset. For the normalization step, all relation candidate scores of a dataset are linearly scaled to a value between 0 and 1.

Integration with a standard IR system

A conventional IR system (Lucene) is used to provide a baseline document ranking from which a classification can be derived by selection of a threshold. Information derived from the OntoGene pipeline, and from the ranking process described in the previous section, is then added as additional features in order to improve the baseline ranking generated by the IR system [the integration of the various components is performed using mainly JRuby (http://jruby.org/), through which the Lucene API is accessed].

Terminology-aware tokenization

The IR system processes the documents in the standard way, selecting different boost values for title and abstract: 10 for title, 3 for abstract, just as in the CTD reference system (notice that the boosting mentioned here is internal to the IR system, while in the previous section we mentioned a similar boosting factor for the OntoGene pipeline). Experiments with different boost values for title and abstract did not show any statistically significant change in the MAP scores, probably because most of the information is in the abstract, not in the title: the existence of relevant information in the title typically implies relevant information in the abstract.

The only significant technical change to Lucene pre-processing is the replacement of the ‘StandardAnalyzer’ component (which is the default analyzer for English, responsible for tokenization, stemming, etc.) with our own tokenization results, as delivered by the OntoGene pipeline. The advantage of this approach is that we can flexibly treat recognized technical terms as individual tokens and map together their synonyms (14). In other words, after this step all known synonyms of a term will be treated as identical by the IR system.

The ‘StandardAnalyzer’ component is replaced by a simple transformation of the XML output of the pipeline into a format suitable for internal processing by Lucene. In particular, tokens and terms as recognized by the pipeline are transformed into Lucene ‘token’ data objects. Whenever a domain entity (denoted by the Term element in the XML representation) is found, it is replaced by a ‘normalized’ version of the token sequence (term normalization involves concatenation of the lowercase version of all tokens into a single token, plus some minor ad-hoc changes that depend on the type of the term). At the same position, a new token with the text of the concept identifier is added to the input stream as seen by the IR system.

For example:

<W C="VBN" id="W151" o1="758" o2="767">inhibited</W>

<Term allvalues="MESH_D015232:chem" id="TW152W153"

matched="prostaglandine2" type="chem">

<W C="NN" id="W152" o1="768" o2="781">prostaglandin</W>

<W C="NN" id="W153" o1="782" o2="784">E2</W>

</Term>

<W C="NN" id="W154" o1="785" o2="794">synthesis</W>

will be converted to the following (square brackets denote token boundaries):

[inhibited] [prostaglandin E2] [synthesis]

   [MESH_D015232]

Synonymous terms (as identified by the pipeline) are mapped to their unique identifiers (for this experiment the term identifier provided by the CTD database), which in the example above is a MeSH term. The initial search is conducted by mapping the target chemical to the corresponding identifier, which is then used as a query term for the IR system application.

Relation-based query expansion

Participants in the shared task were not only required to provide an optimized ranking of target documents, but also to deliver other relevant entities (genes, diseases and chemicals) mentioned in each abstract. The quality of the delivered entities was used as part of the overall evaluation. As described in section 2.2, the OntoGene pipeline is not only capable of delivering an optimized tokenization, it can also be used to annotate all relevant entities and to generate candidate interactions, which can be directly used for curation purposes by CTD curators.

Although the definition of the task did not require the participants to deliver candidate interactions, we worked under the assumption that documents which contain relevant interactions would be relevant themselves. When another term is often seen in relation with the target term, it is probably important for the target. This statistical information can be used to adjust the ranking of the documents.

The organizers provided for each target chemical a set of articles to be ranked by the participants. The OntoGene pipeline delivers candidate interactions as part of its standard output for each single document. Each interaction is assigned a score in the interval (0,1].

All relations that involve a term equivalent to the target (the target or one of its synonyms) were considered. From these relations, we extracted the interacting entity (the second term in those interactions). An expanded query was then created, combining the original search term with all other entities which are seen to interact with it in the target abstract. The additional query terms are weighted according to the normalized score of the interactions from which they are extracted.

As an example, suppose two documents (Document 1 and Document 2) contain the interactions schematically represented in the first two columns below (an interaction is represented as a triple of two arguments and a probability):

Document 1Document 2Expansion terms with score
A C 1A B 1C 1 from doc 1
B C 0.7B D 0.42B 0.75 from doc 1 (score 0.5) and doc 2 (score 1)
A B 0.5D 0.4 from doc 1
A D 0.4
Document 1Document 2Expansion terms with score
A C 1A B 1C 1 from doc 1
B C 0.7B D 0.42B 0.75 from doc 1 (score 0.5) and doc 2 (score 1)
A B 0.5D 0.4 from doc 1
A D 0.4

If the target term is A, the relations marked in boldface are relevant, which gives us new search terms to be added to the query, listed in the third column with their normalized weights (sum of scores divided by the number of relations).

In the search process, Lucene compares the expanded query with all the entities that are found in any given document. We have experimentally verified on the training data that this query expansion process improves the average MAP scores from 0.622 to 0.694.

The ODIN interface

The results of the OntoGene text mining system are made accessible through a curation system called ODIN, which allows a user to dynamically inspect the results of their text mining pipeline. A previous version of ODIN was used for participation in the ‘interactive curation’ task of the BioCreative III competition (15). This was an informal task without a quantitative evaluation of the participating systems. However, the curators who used the system commented positively on its usability for a practical curation tasks. An experiment in interactive curation has been performed in collaboration with curators of the PharmGKB database (16, 17). The results of this experiment are described in (12), which also provides further details on the architecture of the system.

More recently, we adapted ODIN to the aims of CTD curation, allowing the inspection of PubMed abstracts annotated with CTD entities and showing the interactions extracted by our system. Once an input term has been selected, the system will generate a ranking for all the articles that might be relevant for the target term. Figure 2 shows the results provided by the system for the input chemical ‘amsacrine’. The PubMed identifier and the title of each article are provided, together with the relevancy score as computed by the system. The PubMed identifier field is also an active link, which when clicked brings the user to the ODIN interface for the selected article. Figure 3 shows a screenshot of this interface.

Figure 1

General architecture of the OntoGene system. The OntoGene pipeline delivers a richly annotated version of the original document. For the experiments described in this article, we made use of (i) tokens, (ii) domain entities and (iii) relations.

Figure 2

ODIN interface: entry page.

Figure 3

ODIN interface: entity annotations and candidate interactions on a sample PubMed abstract.

At first access the user will be prompted for a ‘curator identifier’, which can be any string. Once inside, ODIN’s two panels are visible: on the left the article panel, on the right the results panel. The panel on the right has two tabs: concepts and interactions. In the ‘concept’ tabs a list of terms/concepts is presented. Selecting any of them will highlight the terms in the article. In the ‘interactions’ panel the candidate interactions detected by the system are shown. Selecting any of them will highlight the evidence in the document.

All items are active. Selecting any concept or interaction in the results panel will highlight the supporting evidence in the article panel. Selecting any term in the article panel prompts the opening of a new panel on the right (annotation panel), where the specific values for the term can be modified (or removed) if needed. It is also possible to add new terms by selecting any token or sequence of tokens in the article.

Evaluation

In order to generally assess the upper limit of our relation recognition system, we evaluated the coverage of the term recognizer on all CTD-referenced articles containing at most 12 curated relations.

Table 1 describes the coverage of term recognition for concepts and relations in experimental data, and shows that we find about three-fourth of all entities. However, the upper limits for relation detection are not the same for all relation types. Relations involving chemicals have substantially lower coverage rates which seems a bit unfortunate for the CTD triage task.

Table 1
CategoryTotalFound (%)
Disease12 6399502 (75.18)
Chemical38 52330 129 (78.21)
Gene39 15029 199 (74.58)

Total90 31268 830 (76.21)

dis-gen69565126 (73.69)
che-dis12 1548356 (68.75)
che-gen52 74634 883 (66.13)

Total71 85648 365 (67.13)
CategoryTotalFound (%)
Disease12 6399502 (75.18)
Chemical38 52330 129 (78.21)
Gene39 15029 199 (74.58)

Total90 31268 830 (76.21)

dis-gen69565126 (73.69)
che-dis12 1548356 (68.75)
che-gen52 74634 883 (66.13)

Total71 85648 365 (67.13)
Table 1
CategoryTotalFound (%)
Disease12 6399502 (75.18)
Chemical38 52330 129 (78.21)
Gene39 15029 199 (74.58)

Total90 31268 830 (76.21)

dis-gen69565126 (73.69)
che-dis12 1548356 (68.75)
che-gen52 74634 883 (66.13)

Total71 85648 365 (67.13)
CategoryTotalFound (%)
Disease12 6399502 (75.18)
Chemical38 52330 129 (78.21)
Gene39 15029 199 (74.58)

Total90 31268 830 (76.21)

dis-gen69565126 (73.69)
che-dis12 1548356 (68.75)
che-gen52 74634 883 (66.13)

Total71 85648 365 (67.13)

Table 2 shows the final results obtained on the training (top) and test (bottom) document sets using the online evaluation tool provided by the organizers of the shared task.

Table 2
TermMAPGenesChemicalsDiseases
Doxorubicin0.8000.1670.8430.793
Indomethacin0.9360.3310.8340.725
Raloxifene0.7980.2440.8180.778
Amsacrine0.6550.6030.6890.500
Aniline0.5430.6250.5610.524
2-Acetylaminofluorene0.6430.4120.8450.421
Aspartame0.3650.6860.7560.720
Quercetin0.8530.4630.6460.653

Cyclophosphamide0.7080.3960.8800.646
Phenacetin0.8090.7160.4670.667
Urethane0.6500.3650.8710.633
TermMAPGenesChemicalsDiseases
Doxorubicin0.8000.1670.8430.793
Indomethacin0.9360.3310.8340.725
Raloxifene0.7980.2440.8180.778
Amsacrine0.6550.6030.6890.500
Aniline0.5430.6250.5610.524
2-Acetylaminofluorene0.6430.4120.8450.421
Aspartame0.3650.6860.7560.720
Quercetin0.8530.4630.6460.653

Cyclophosphamide0.7080.3960.8800.646
Phenacetin0.8090.7160.4670.667
Urethane0.6500.3650.8710.633
Table 2
TermMAPGenesChemicalsDiseases
Doxorubicin0.8000.1670.8430.793
Indomethacin0.9360.3310.8340.725
Raloxifene0.7980.2440.8180.778
Amsacrine0.6550.6030.6890.500
Aniline0.5430.6250.5610.524
2-Acetylaminofluorene0.6430.4120.8450.421
Aspartame0.3650.6860.7560.720
Quercetin0.8530.4630.6460.653

Cyclophosphamide0.7080.3960.8800.646
Phenacetin0.8090.7160.4670.667
Urethane0.6500.3650.8710.633
TermMAPGenesChemicalsDiseases
Doxorubicin0.8000.1670.8430.793
Indomethacin0.9360.3310.8340.725
Raloxifene0.7980.2440.8180.778
Amsacrine0.6550.6030.6890.500
Aniline0.5430.6250.5610.524
2-Acetylaminofluorene0.6430.4120.8450.421
Aspartame0.3650.6860.7560.720
Quercetin0.8530.4630.6460.653

Cyclophosphamide0.7080.3960.8800.646
Phenacetin0.8090.7160.4670.667
Urethane0.6500.3650.8710.633

In the BioCreative 2012 shared task 1, the OntoGene pipeline proved once again its flexibility and efficiency by delivering very effective entity recognition. In particular, our system had the best recognition rate for genes and diseases and the second best for chemicals, leading to the overall best results, as can be seen in Figure 4 (18) [reproduced with permission from the author]. The query expansion approach used in combination with a standard IR system in order to generate the final article ranking did not perform as well in the test phase as the result of the training phase would have suggested. This might have been caused by overfitting to the training data.

Figure 4

Official results of the BioCreative 2012 competition (task 1: ‘triage for the CTD database’). OntoGene was identified as ‘Group 116’. Reproduced from (18).

Discussion

The OntoGene text mining pipeline provides an efficient system for the extraction of entities and relationships from the biomedical literature, as shown by the results discussed in the previous section. Additionally, the ODIN curation interface provides an user-friendly environment for the integration of information derived from the text mining tools into a curation framework.

The OntoGene system has not only been successful in several community-organized evaluations, but it has also been applied in an industrial context, within the NIBR-IT unit of Novartis Pharma AG. At Novartis, scientific annotation is gaining more and more importance. In most recent applications the usage of controlled vocabularies has become mandatory. However, most of the data are still in legacy systems. This is the reason why curation of legacy data and documentation is of crucial importance. Currently, a major focus is being placed on Metadata recovery and the curation of a large variety of data repositories containing valuable knowledge in terms of assay data, scientific documentation or clinical data. The main business driver behind this initiative is that the company has a treasury of knowledge but cannot make use of it because the data are not semantically normalized.

The NIBR-IT unit of Novartis has been using ODIN to annotate textual data from legacy repositories. This application could highly benefit from the fact that the Ontogene framework is open and can easily be customized. This allows the usage of internal terminologies for lexical extraction. The legacy documents were pre-annotated with a customized pipeline and the results displayed using ODIN. The ODIN graphical user interface allows for the verification and falsification of annotation results by selecting or deselecting identified concepts. In addition, new terms can be added manually to the annotations, they can be assigned to the appropriate concept class and then fed into controlled vocabularies thus improving the extraction results of the next annotation cycle.

One of the limitations of the text mining system described above is that it does not provide the type of the detected interactions. This can be a shortcoming for some applications. For example, in the BioCreative 2012 triage task, the capacity of the system to provide a ‘curated action term’ was one of the factors contributing to the overall result.

The OntoGene system performs a complete syntactic analysis of each sentence in the input documents. In most cases, it is relatively easy to recover from such analysis the information which is necessary to provide a relation type. For example, Figure 5 shows a simplified representation of the analysis of the sentence ‘The neuronal nicotinic acetylcholine receptor alpha7 (nAChR alpha7) may be involved in cognitive deficits in Schizophrenia and Alzheimer’s disease.’ from PubMed abstract 15695160. This sentence expresses two interactions between a gene (nAChR) and the diseases Schizophrenia and Alzheimer. From the graphical representation, it can be intuitively seen that the word which indicates the interaction verb ‘involved’ can be recovered as the uppermost node at the intersection of the syntactic paths leading to the arguments. Interaction verbs can then be used to infer a suitable CTD action code.

Figure 5

Example of syntactic analysis of a sentence as performed by the Ontogene parser. Reprinted from Journal of Biomedical Informatics, Volume 45, Issue 5, Fabio Rinaldi, Gerold Schneider, Simon Clematide, ‘Relation Mining Experiments in the Pharmacogenomics Domain’, pages 851–861, 2012, with permission from Elsevier.

Table 3 shows the highest scored head words from a small subset of 93 CTD documents. The table legend explains how the various factors which contribute to the final score (rightmost column) are computed. Notice that the value ‘P’ is often forumla1, as it is not a probability value, but a relative score.

Table 3
HeadTermF = f(Head)A = f(All)P = F/Alog(F) forumla log(A) forumlaP.¬term
Play025171.4713.41
Treat024171.4112.71
Bind01892.0012.70
Inhibit041480.8512.28
Constitute01334.3312.21
Demonstrate030301.0011.57
Exhibit016111.459.67
Reveal020191.059.29
2t01142.759.14
Quinine1818.000.00
Phytoestrogen1761.170.00
Thalidomide16150.400.00
HeadTermF = f(Head)A = f(All)P = F/Alog(F) forumla log(A) forumlaP.¬term
Play025171.4713.41
Treat024171.4112.71
Bind01892.0012.70
Inhibit041480.8512.28
Constitute01334.3312.21
Demonstrate030301.0011.57
Exhibit016111.459.67
Reveal020191.059.29
2t01142.759.14
Quinine1818.000.00
Phytoestrogen1761.170.00
Thalidomide16150.400.00

Relation labels are shown in the first column. The second column is a boolean value indicating whether the head word is itself a term. The third column (‘F’) shows the number of times the head word is seen in a relevant path (notice that the same head word can occur in multiple relevant paths). The fourth column (‘A’) shows the number of times the word occurs in the document collection. The next column shows the ratio among the preceding two values. The final column calculated a weighted score considering the previous factors.

Table 3
HeadTermF = f(Head)A = f(All)P = F/Alog(F) forumla log(A) forumlaP.¬term
Play025171.4713.41
Treat024171.4112.71
Bind01892.0012.70
Inhibit041480.8512.28
Constitute01334.3312.21
Demonstrate030301.0011.57
Exhibit016111.459.67
Reveal020191.059.29
2t01142.759.14
Quinine1818.000.00
Phytoestrogen1761.170.00
Thalidomide16150.400.00
HeadTermF = f(Head)A = f(All)P = F/Alog(F) forumla log(A) forumlaP.¬term
Play025171.4713.41
Treat024171.4112.71
Bind01892.0012.70
Inhibit041480.8512.28
Constitute01334.3312.21
Demonstrate030301.0011.57
Exhibit016111.459.67
Reveal020191.059.29
2t01142.759.14
Quinine1818.000.00
Phytoestrogen1761.170.00
Thalidomide16150.400.00

Relation labels are shown in the first column. The second column is a boolean value indicating whether the head word is itself a term. The third column (‘F’) shows the number of times the head word is seen in a relevant path (notice that the same head word can occur in multiple relevant paths). The fourth column (‘A’) shows the number of times the word occurs in the document collection. The next column shows the ratio among the preceding two values. The final column calculated a weighted score considering the previous factors.

The head words in Table 3 have a high correspondence to the trigger words used in annotation tasks which use relation labels, such as BioNLP [3]. They contain few false positives (e.g. ‘2t’ in Table 3), and they can often be mapped well to CTD action codes. For example, ‘bind’, ‘inhibit’, ‘reduce’, ‘block’, ‘downregulate’, ‘metabolize’, ‘expression’, ‘activate’, ‘regulate’, ‘express’ map to CTD action codes or BioNLP labels. Many heads refer to the investigator’s conclusion (‘demonstrate’, ‘show’, ‘assess’, ‘find’, ‘reveal’, ‘explain’, ‘suggest’) or to methodology (‘treat’, ‘exhibit’). Some are underspecified (e.g. ‘play’ which comes from ‘play a role in’), and some are only syntactic operators (e.g. ‘appear’, ‘ability’). Some are semantically ambiguous: for example, ‘contribute’ can equally be part of an investigator’s conclusion or a syntactic operator (e.g. ‘contributes to the activation’). The process of mapping these values into CTD action codes will require biological expertise for completion.

Conclusions

In this article, we have described our approach towards ranking biomedical abstracts for the triage task of the CTD curation process. The characteristic of the approach is that it gives priority to the identification of candidate interactions, which are then used as additional weighting factors in a conventional IR-based system.

The OntoGene pipeline is capable of delivering all information relevant to CTD curation: entities with their database references, interactions, and interaction terms. In the shared task, however due to insufficient time for customization, we decided to exclude the computation of interaction terms. The results of the system are accessible through an intuitive interactive interface, which will be further customized for CTD curation.

Funding

The Swiss National Science Foundation (grant 100014-118396/1); Novartis Pharma AG, NIBR-IT, Text Mining Services, Switzerland.

Conflict of interest. None declared.

Acknowledgements

We wish to thank the anonymous reviewers for their valuable suggestions.

References

1
UniProt Consortium
The universal protein resource (uniprot)
Nucleic Acids Res.
2007
, vol. 
35
 (pg. 
D193
-
D197
)
2
Hermjakob
H
Montecchi-Palazzi
L
Lewington
C
, et al. 
IntAct: An open source molecular interaction database
Nucleic Acids Res.
2004
, vol. 
32
 
Suppl. 1
(pg. 
D452
-
D455
)
3
Mattingly
CJ
Rosenstein
MC
Colby
GT
, et al. 
The Comparative Toxicogenomics Database (CTD): a resource for comparative toxicological studies
J. Exp. Zool. A Comp. Exp. Biol.
2006
, vol. 
305
 (pg. 
689
-
692
)
4
Krallinger
M
Vazquez
M
Leitner
F
, et al. 
The protein-protein interaction tasks of biocreative iii: classification/ranking of articles and linking bio-ontology concepts to full text
BMC Bioinformatics
2011
, vol. 
12
 
Suppl. 8
pg. 
S3
 
5
Cohen
BK
Demner-Fushman
D
Ananiadou
S
, et al. 
Proceedings of the BioNLP June 2009 Workshop
2009
 
Association for Computational Linguistics, Boulder, Colorado
6
Rebholz-Schuhmann
D
Yepes
A
Li
C
, et al. 
Assessment of ner solutions against the first and second calbc silver standard corpus
J. Biomed. Semantics
2011
, vol. 
2
 
Suppl. 5
pg. 
S11
 
7
Rinaldi
F
Schneider
G
Kaljurand
K
, et al. 
An environment for relation mining over richly annotated corpora: The case of GENIA
BMC Bioinformatics
2006
, vol. 
7
 
Suppl. 3
pg. 
S3
 
8
Rinaldi
F
Kappeler
T
Kaljurand
K
, et al. 
OntoGene in BioCreative II
Genome Biol.
2008
, vol. 
9
 
Suppl. 2
pg. 
S13
 
9
Rinaldi
F
Clematide
S
Garten
Y
, et al. 
Using ODIN for a PharmGKB re-validation experiment
Database
2012
 
2012: article ID bas021; doi:10.1093/database/bas021
10
Clematide
S
Rinaldi
F
Ranking relations between diseases, drugs and genes for a curation task
J. Biomed. Semantics
2012
, vol. 
3
 
Suppl. 3
pg. 
S5
 
11
Rinaldi
F
Kaljurand
K
Saetre
R
Terminological resources for text mining over biomedical scientific literature
J. Artif. Intel. Med.
2011
, vol. 
52
 (pg. 
107
-
114
)
12
Rinaldi
F
Schneider
G
Clematide
S
Relation mining experiments in the pharmacogenomics domain
J. Biomed. Inform.
2012
, vol. 
45
 (pg. 
851
-
861
)
13
Hal Daumé
III
Notes on CG and LM-BFGS optimization of logistic regression
 
14
Rinaldi
F
Dowdall
J
Hess
M
, et al. 
Terminology as knowledge in answer extraction
Proceedings of the 6th International Conference on Terminology and Knowledge Engineering (TKE02)
2002
 
Nancy, France, 28–30 August 2002, pp. 107–113
15
Arighi
C
Roberts
P
Agarwal
S
, et al. 
Biocreative iii interactive task: an overview
BMC Bioinformatics
2011
, vol. 
12
 
Suppl. 8
pg. 
S4
 
16
Klein
KE
Chang
JT
Cho
MK
, et al. 
Integrating genotype and phenotype information: An overview of the PharmGKB project
Pharmacogenomics J.
2001
, vol. 
1
 (pg. 
167
-
170
)
17
Sangkuhl
K
Berlin
DS
Altman
RB
Klein
TE
PharmGKB: Understanding the effects of individual genetic variants
Drug Metabol. Rev.
2008
, vol. 
40
 (pg. 
539
-
551
)
18
Wiegers
TC
Davis
AP
Mattingly
CJ
Collaborative biocuration-text mining development task for document prioritization for curation
Database
2012
 
article ID bas037; doi:10.1093/database/bas037

Author notes

Citation details: Fabio Rinaldi, Simon Clematide, Simon Hafner, Gerold Schneider, Gintarė Grigonytė, Martin Romacker, and Therese Vachon. Using the OntoGene pipeline for the triage task of BioCreative 2012. Database (2012) Vol. 2012: article ID bas053; doi:10.1093/database/bas053.

This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by-nc/3.0/), which permits non-commercial reuse, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact journals.permissions@oup.com.