iSimp in BioC standard format: enhancing the interoperability of a sentence simplification system Open Access

Currently, iSimp can detect six types of simplification constructs: coordination, relative clause, apposition, introductory phrase, subordinate clause and parenthetical element. For a more detailed description of sentence simplification, as well as its challenges (e.g. attachment ambiguities, boundary detection and nested constructs), we refer the reader to ( 1 ).

In comparison with the works using parse trees or dependency graphs, iSimp uses shallow parsing and recursive transition networks to detect all forms of simplifications. Figure 2 shows the workflow of the system. iSimp first tokenizes the text, and then it splits each sentence into a sequence of nonoverlapping chunks. The detection of various simplification constructs is based on the chunks, and from these, iSimp generates simplified sentences. Three types of chunks were investigated here: noun phrases, verb groups and prepositional phrases.

Figure 2.

The workflow of iSimp.

iSimp scans the phrase sequence from left to right. Whenever a trigger word of a simplification construct is found (e.g. ‘and’ for coordination or ‘which’ for relative clause), we attempt to identify the simplification construct using transition networks. If a stop state of the network is found, then a simplification construct was detected successfully. We extended the network to address nested constructs. For an in-depth description of this process, we refer the reader to ( 1 ).

iSimp generates simplified sentences by combining various simplification constructs. To illustrate the problem, consider the following sentence:

E4. Active Raf-2 [ _coordinationphosphorylates and activates] MEK1, [ _{relative clause} which in turn [ _coordinationphosphorylates and activates] the MAP kinases signal regulated kinases, [ _{coordination & appositive} ERK1 and ERK2]]. (PMID-8557975)

iSimp is able to generate several simple sentences from (E4). Five of them are shown below:

E5. (a) Active Raf-2 phosphorylates MEK1.
(b) MEK1 in turn phosphorylates ERK1.
(c) MEK1 in turn phosphorylates ERK2.
(d) The MAP kinases signal regulated kinases is an ERK1.
(e) The MAP kinases signal regulated kinases is an ERK2.

Sometimes, iSimp will introduce new words in the simplified sentence to keep it grammatically correct. For example, in (E5d) and (E5e), we put ‘is an’ between the appositive clause and the singular noun phrase it refers to, to form a new sentence. Adding new words to the corpus is one of the factors that distinguish iSimp from other applications that enhance BioC.

iSimp in BioC format

Because sentence simplification requires a unique schema to add new text in the corpus, we designed a BioC tag set for annotating and sharing the simplification results. Figures 3 and 4 show the key file used in iSimp to define the semantics associated with the data.

Figure 3.

The key file used in iSimp to define the simplification constructs associated with the data.

Figure 4.

The key file used in iSimp to define the simplified sentences associated with the data.

We use the annotation element to mark up the simplification construct components, and we use the relation element to specify how these components are related. In the latter, we further specify the name of the simplification type (e.g. coordination, relative clause, etc.), as well as roles for each component in the relation using the node element (e.g. ‘conjunct’ and ‘conjunction’ for the coordination, ‘referred noun phrase’ and ‘appositive’ for the apposition). For example, Figure 5 shows the coordination ‘phosphorylates and activates’ in BioC format. This coordination contains two conjuncts (‘phosphorylates’ and ‘activates’) and one conjunction (‘and’). Some attributes, like the location elements, are not shown in this figure for lack of space.

Figure 5.

An example of sentence simplification annotation in BioC format. The coordination contains two conjuncts (‘phosphorylates’, ‘activates’) and one conjunction (‘and’). Some attributes, like the location elements, are not shown for the sake of space.

As mentioned before, iSimp generates new simplified sentences. This poses an additional challenge to the integration of the BioC format, as such cases were not directly addressed in the original design of BioC ( 2 ). Hence, we designed and proposed a new way of using BioC framework. Figure 6 shows an example of simplified sentences in the BioC format (left), as well as the corresponding text file (right) with locations highlighted. As mentioned before, we include both original and simplified sentences in the BioC file. The offsets of the original sentences are the same as in the original text. However, the offsets of the simplified sentences start with the offset of the next character after the last character in the original document (offset of document + length of document). This new collection could then be treated as the input collection for the next step in the NLP pipeline.

Figure 6.

An example of simplified sentences in BioC format (left) and the corresponding text file (right) with locations highlighted.

To link text in simplified sentences to that in the original sentence, we introduce the ‘equ’ (equivalence) relation. Figure 7 shows an example of an equivalence relation, in which we link ‘phosphorylates’ back to the original sentence. This way phrases in the simplified sentences can be mapped back to the corresponding phrases in the original sentence. Equivalence relations can be used to ensure that downstream applications recognize the duplicated nature of such ‘equivalent’ phrases and do not report the same information multiple times in the end. Implementation of this mechanism was feasible owing to the extensibility of the BioC format.

Figure 7.

An example showing ‘equ’ (equivalence) relations in iSimp-generated BioC file.

Online iSimp with BioC

For various NLP/TM applications to use sentence simplification, we have made iSimp available online. It adopts the BioC format and supports two interfaces.

Users can submit a document in the standard BioC format, which is described in Figure 5 of ( 2 ). The format requires a document to be specified as a sequence of sentences where the offsets are specified with respect to the whole document. Given the input file, iSimp will output the list of sentences marked with simplification constructions. Moreover, iSimp will append the simplified sentences to the marked input sentences, and provide this output as a zip file for download. For displaying different sentence simplification aspects, we have also developed a web interface where users can provide sentences in plain text and iSimp will output the sentences marked with simplification constructions directly in the browser.

To support interoperable machine-to-machine interaction with other applications, iSimp can be accessed by enclosing the BioC file in the POST requests. The iSimp Web server will accept and process one sentence per request and send back simplification constructs and simplified sentences in an all-in-one BioC file. This will guarantee the response time and avoid loading overly large BioC files. To submit sentences in one BioC file (BioC sentence Document type definition (DTD)), users can use the following format: http://research.bioinformatics.udel.edu/isimp/biocsentence?biocfile=BioCFileContent .

Results and discussion

Evaluation on RE system

To examine the usefulness of iSimp, we considered a very simple rule-based RE system. The first relation we focused on was the phosphorylation relation between the trigger and the theme (substrate) as defined in the GE corpora. We used straightforward rules, as shown below, where X is a noun phrase in which the protein or protein product appears as a headword:

phosphorylate (or, phosphorylates, phosphorylated, phosphorylating) X
phosphorylation of X
X phosphorylation
[ _{noun phrase} phosphorylated X]

These rules are able to match straightforward mentions of phosphorylation in text. However, they will fail to find mentions of phosphorylation in complex sentences, like the one shown in (E4). However, the first rule can apply to the simplified (E5a)–(E5c) and extract <phosphorylates, MEK1>, <phosphorylates, ERK1> and <phosphorylates, ERK2>. As long as the rules for extraction are precise, the simplification step will help improve the recall of the system, without hurting the precision.

We evaluated iSimp in terms of the impact it had on the performance of the RE system. Thus, we compared the results obtained by the RE system when using versus not using iSimp. The BioC XML format and schema described in the previous section were used to transfer the original data to iSimp and the RE system as well as to transfer the enhanced data from iSimp to the RE system. Besides adding and removing iSimp from the pipeline, no additional changes were made to the steps involved in the pipeline. This not only shows the interoperability of iSimp, but also proves that our proposed mechanism of using the BioC framework works as expected.

We tested this basic RE system on the BioNLP-ST 2011 GE task training corpus ( 14 ). Precision/Recall/F-value without simplification were 97.32/78.38/86.83 versus 97.42/81.62/88.82 with simplification. These results show that with the help of iSimp, the recall gap of 21.62 was reduced by 15% to 18.38, without introducing precision errors. In our previous and ongoing work ( 15 ), we have observed similar improvements in recall for various RE tasks.

In this exercise, we did not include agents because the GE corpora did not consider agents. But, because the above rules are most likely to be affected by noun phrase coordination, we believe simplification will benefit the agent extraction as well.

This exercise also illustrates the ability of sentence simplification to keep rules simple and yet achieve good results. Because patterns for simplification and RE are orthogonal, we do not need to multiply rules to consider all their combinations. An alternative way, as shown in the above example, is to treat sentence simplification as an independent task, and not for a particular RE. This way, we can focus on simple rules only. Sentence simplification is then applied to increase the recall of the original system.

Simplification-annotated corpora in the BioC format

We provide a corpus marked with simplification constructs, using the BioC format ( http://research.bioinfor matics.udel.edu/isimp/corpus.html ). This corpus can be used by others to evaluate the performance of iSimp or other sentence simplifiers. The corpus consists of 130 Medline abstracts mentioning proteins and genes, with a total of 1199 sentences. The corpus contains three BioC files: (i) Medline abstracts of raw text, (ii) sentences that are split using the OpenNLP sentence detector and (iii) annotations of simplification constructs at the sentence level. Key files are also provided with additional information that describes the meaning of tags used in the BioC files and the annotation schema. The corpus uses the same DTD provided by BioC for validation.

Additionally, we have converted the BioNLP-ST 2011 GE corpus to the BioC format for our evaluation purposes, and this corpus can also be downloaded from the link given above.

Conversion script

We provide a script to convert the BioNLP-ST corpus to the BioC format ( https://bitbucket.org/udbiotmgroup/bionlp2bioc ). The original text files (.txt) are split based on ‘newline’, and the various parts are stored into passage elements. Entities (in files.1) and event triggers (in files.2) are stored into appropriate passages based on their positions in the text files. Target annotations (in files.2), including events, relations, event modifications and equivalences, are recorded at the document level. If the annotation is marked by more than one continuous span of characters, the script creates several location elements. This also shows the generalizability of the BioC format, which allows multi-segmented annotations.

Conclusion

In this study, we enhanced our sentence simplifier system, iSimp, to fully adopt the BioC format. We defined a unique BioC tag set for annotating simplification results and proposed a schema, which allows simplified sentences to be included in the BioC annotation file and be treated as part of the original collection. The proposed schema is different than the standard schema in that it can include words that are not part of the original text.

To illustrate the usefulness of iSimp with BioC, we examined its impact on a basic RE system. Evaluation on the BioNLP-ST 2011 GE task training corpus showed that, with sentence simplification provided by iSimp, the recall increased by 3.2%, which corresponds to a 15% reduction in recall error, without introducing precision errors. These corpora converted into the BioC format were made publically available together with the conversion script. Additionally, corpora we had previously developed for evaluating simplification performance of iSimp were made available in the BioC format, which may be used as public benchmarking corpora.

The corpora and the online demo of iSimp, using the BioC format, are available at http://research.bioinformatics.udel.edu/isimp/ .

Acknowledgements

The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.

Funding

Research reported in this article was supported by the National Library of Medicine of the National Institutes of Health under award number G08LM010720. This material is also based upon work supported by the National Science Foundation under Grant No. DBI-1062520. Funding for open access charge: National Science Foundation (DBI-1062520).

Conflict of interest . None declared.

References

Peng

Tudor

C.O.

Torii

C.H.

Vijay-Shanker

(

2012

)

iSimp: a sentence simplification system for biomedical text

. In:

Proceedings of the 2012 IEEE International Conference on Bioinformatics and Biomedicine

Philadelphia, PA

, pp.

211

–

216

Comeau

D.C.

Doğan

R.I.

Ciccarese

et al. . (

2013

)

BioC: a minimalist approach to interoperability for biomedical text processing

Database (Oxford)

2013

bat064

Chandrasekar

Doran

Srinivas

(

1996

)

Motivations and methods for text simplification

. In:

Proceedings of the 16th Conference on Computational Linguistics

Vol. 2

Copenhagen, Denmark

, pp.

1041

–

1044

Torii

Arighi

Wang

et al. . (

2013

)

Text mining of protein phosphorylation information using a generalizable rule-based approach

ACM Conference on Bioinformatics, Computational Biology and Biomedical

Washington DC, USA

, pp.

201

–

210

Miwa

Sætre

Miyao

Tsujii

(

2010

)

Entity-focused sentence simplification for relation extraction

. In:

Proceedings of International Conference on Computational Linguistics

Beijing, China

, pp.

788

–

796

Buyko

Faessler

Wermter

Hahn

(

2011

)

Syntactic simplification and semantic enrichment–Trimming dependency graphs for event extraction

Comput. Intell.

610

–

644

Crossref

Huang

Zhu

(

2005

)

A hybrid method for relation extraction from biomedical literature

Int. J. Med. Inform.

443

–

455

Jonnalagadda

Gonzalez

(

2010

)

BioSimplify: an open source sentence simplification engine to improve recall in automatic biomedical information extraction

AMIA Annu. Symp. Proc.

2010

351

–

355

PubMed

Ong

Damay

Lojico

Tarantan

(

2007

)

Simplifying text in medical literature

J. Res. Sci. Comput. Eng.

–

Comeau

D.C.

Liu

Doğan

R.I.

Wilbur

W.J.

(

2013

)

Natural language processing pipelines to annotate BioC collections with an application to the NCBI disease corpus

. In:

Proceedings of the Fourth BioCreative Challenge Evaluation Workshop

Bethesda, MD, USA

, pp.

–

Doğan

R.I.

Comeau

D.C.

Yeganova

Wilbur

W.J.

(

2013

)

Finding abbreviations in biomedical literature: three BioC-compatible modules and tree BioC-formatted corpora

. In:

Proceedings of the Fourth BioCreative Challenge Evaluation Workshop

Bethesda, MD, USA

, pp.

–

Khare

Wei

C.H.

Mao

Leaman

(

2013

)

Improving interoperatbility of text mining tools with BioC

. In:

Proceedings of the Fourth BioCreative Challenge Evaluation Workshop

Bethesda, MD, USA

, pp.

–

Lai

Dai

J.C.

Tsai

R.T.

(

2013

)

A biomedical semantic role labeling BioC module for BioCreative IV

. In:

Proceedings of the Fourth BioCreative Challenge Evaluation Workshop

. pp.

–

Kim

J.D.

Nguyen

Wang

Tsujii

Takagi

Yonezawa

(

2012

)

The Genia event and protein coreference tasks of the BioNLP shared task 2011

BMC Bioinformatics

(

Suppl 11

Tudor

C.O.

Vijay-Shanker

(

2012

)

RankPref: ranking sentences describing relations between biomedical entities with an application

. In:

Proceedings of the 2012 Workshop on Biomedical Natural Language Processing

Montreal, Canada

, pp.

163

–

171