CD-REST: a system for extracting chemical-induced disease relation in literature Open Access

The entity and context information features used for the sentence-level classifier C_S and the document-level classifier C_D

#	Name	Gloss	C_S	C_D
Entity information
1	Entity mention	Bag of words & bigrams of the entity mentions	√	√
2	Chemical first	Is chemical the first entity in the sentence	√
3	MeSH Ids	The corresponding MeSH IDs of each entity	√	√
4	Core chemical	Whether target chemical is a core chemical	√	√
Context information
5	Before	Bag of words & bigrams before the entities	√
6	Between	Bag of words & bigrams between the entities	√
7	After	Bag of words & bigrams after the entities	√
8	Same sentence	Whether the $< c, d >$ pair locates in the same sentence		√
9	Adjacent sentences	Whether the $< c, d >$ pair locates in adjacent sentences		√
10	More than two sentences	Whether the $< c, d >$ pair crosses more than two sentences		√
11	Match terms ( i )	Whether the words between the entities contains any term in terms ( i ) that indicated the i nduced relation, such as “caused”, “induced” etc.	√	√
12	Match terms ( h )	Whether the sentence contains d has any term in terms ( h )that indicated the h older of d , e.g. “patient”, “groups” and “rats” etc.	√	√(if feature 8 or 9 is true)

#	Name	Gloss	C_S	C_D
Entity information
1	Entity mention	Bag of words & bigrams of the entity mentions	√	√
2	Chemical first	Is chemical the first entity in the sentence	√
3	MeSH Ids	The corresponding MeSH IDs of each entity	√	√
4	Core chemical	Whether target chemical is a core chemical	√	√
Context information
5	Before	Bag of words & bigrams before the entities	√
6	Between	Bag of words & bigrams between the entities	√
7	After	Bag of words & bigrams after the entities	√
8	Same sentence	Whether the $< c, d >$ pair locates in the same sentence		√
9	Adjacent sentences	Whether the $< c, d >$ pair locates in adjacent sentences		√
10	More than two sentences	Whether the $< c, d >$ pair crosses more than two sentences		√
11	Match terms ( i )	Whether the words between the entities contains any term in terms ( i ) that indicated the i nduced relation, such as “caused”, “induced” etc.	√	√
12	Match terms ( h )	Whether the sentence contains d has any term in terms ( h )that indicated the h older of d , e.g. “patient”, “groups” and “rats” etc.	√	√(if feature 8 or 9 is true)

Table 1.

The entity and context information features used for the sentence-level classifier C_S and the document-level classifier C_D

#	Name	Gloss	C_S	C_D
Entity information
1	Entity mention	Bag of words & bigrams of the entity mentions	√	√
2	Chemical first	Is chemical the first entity in the sentence	√
3	MeSH Ids	The corresponding MeSH IDs of each entity	√	√
4	Core chemical	Whether target chemical is a core chemical	√	√
Context information
5	Before	Bag of words & bigrams before the entities	√
6	Between	Bag of words & bigrams between the entities	√
7	After	Bag of words & bigrams after the entities	√
8	Same sentence	Whether the $< c, d >$ pair locates in the same sentence		√
9	Adjacent sentences	Whether the $< c, d >$ pair locates in adjacent sentences		√
10	More than two sentences	Whether the $< c, d >$ pair crosses more than two sentences		√
11	Match terms ( i )	Whether the words between the entities contains any term in terms ( i ) that indicated the i nduced relation, such as “caused”, “induced” etc.	√	√
12	Match terms ( h )	Whether the sentence contains d has any term in terms ( h )that indicated the h older of d , e.g. “patient”, “groups” and “rats” etc.	√	√(if feature 8 or 9 is true)

#	Name	Gloss	C_S	C_D
Entity information
1	Entity mention	Bag of words & bigrams of the entity mentions	√	√
2	Chemical first	Is chemical the first entity in the sentence	√
3	MeSH Ids	The corresponding MeSH IDs of each entity	√	√
4	Core chemical	Whether target chemical is a core chemical	√	√
Context information
5	Before	Bag of words & bigrams before the entities	√
6	Between	Bag of words & bigrams between the entities	√
7	After	Bag of words & bigrams after the entities	√
8	Same sentence	Whether the $< c, d >$ pair locates in the same sentence		√
9	Adjacent sentences	Whether the $< c, d >$ pair locates in adjacent sentences		√
10	More than two sentences	Whether the $< c, d >$ pair crosses more than two sentences		√
11	Match terms ( i )	Whether the words between the entities contains any term in terms ( i ) that indicated the i nduced relation, such as “caused”, “induced” etc.	√	√
12	Match terms ( h )	Whether the sentence contains d has any term in terms ( h )that indicated the h older of d , e.g. “patient”, “groups” and “rats” etc.	√	√(if feature 8 or 9 is true)

Table 2.

Features extracted by incorporating knowledge bases

#	Name	Gloss
MeSH features
1	Categories of d	All direct or indirect hypernyms of d
2	Categories of c	All direct or indirect hypernyms of c
3	Has a specific disease	Whether the document has a more specific disease
4	Has a general disease	Whether the document has a more general disease
MEDI features
5	$r (< c, d >)$	Relation of $< c, d >$ in MEDI: null or treatment
6	$r (< c, d >)$	Relation of $< c, d >$ in MEDI’s high precision subset
SIDER features
7	$r (< c, d >)$	Relation of $< c, d >$ in SIDER: null, treatment or aderver-drug-reaction
8	$r (< c, d >)$	Relation of $< c, d >$ in SIDER subset confirmed by FDA Adverse Event Reporting System ( 26 )
9	$i s A D R (d)$	Whether d is an adverse drug event in SIDER
CTD features
10	$r (< c, d >)$	Relation of $< c, d >$ in CTD: null, inferred-association, therapeutic or marker/mechanism
11	$i s I n d u c e d (d)$	Whether d has a marker/mechanism association with any chemicals in CTD

#	Name	Gloss
MeSH features
1	Categories of d	All direct or indirect hypernyms of d
2	Categories of c	All direct or indirect hypernyms of c
3	Has a specific disease	Whether the document has a more specific disease
4	Has a general disease	Whether the document has a more general disease
MEDI features
5	$r (< c, d >)$	Relation of $< c, d >$ in MEDI: null or treatment
6	$r (< c, d >)$	Relation of $< c, d >$ in MEDI’s high precision subset
SIDER features
7	$r (< c, d >)$	Relation of $< c, d >$ in SIDER: null, treatment or aderver-drug-reaction
8	$r (< c, d >)$	Relation of $< c, d >$ in SIDER subset confirmed by FDA Adverse Event Reporting System ( 26 )
9	$i s A D R (d)$	Whether d is an adverse drug event in SIDER
CTD features
10	$r (< c, d >)$	Relation of $< c, d >$ in CTD: null, inferred-association, therapeutic or marker/mechanism
11	$i s I n d u c e d (d)$	Whether d has a marker/mechanism association with any chemicals in CTD

These features were used for both C_S and C_D classifiers

Table 2.

Features extracted by incorporating knowledge bases

#	Name	Gloss
MeSH features
1	Categories of d	All direct or indirect hypernyms of d
2	Categories of c	All direct or indirect hypernyms of c
3	Has a specific disease	Whether the document has a more specific disease
4	Has a general disease	Whether the document has a more general disease
MEDI features
5	$r (< c, d >)$	Relation of $< c, d >$ in MEDI: null or treatment
6	$r (< c, d >)$	Relation of $< c, d >$ in MEDI’s high precision subset
SIDER features
7	$r (< c, d >)$	Relation of $< c, d >$ in SIDER: null, treatment or aderver-drug-reaction
8	$r (< c, d >)$	Relation of $< c, d >$ in SIDER subset confirmed by FDA Adverse Event Reporting System ( 26 )
9	$i s A D R (d)$	Whether d is an adverse drug event in SIDER
CTD features
10	$r (< c, d >)$	Relation of $< c, d >$ in CTD: null, inferred-association, therapeutic or marker/mechanism
11	$i s I n d u c e d (d)$	Whether d has a marker/mechanism association with any chemicals in CTD

#	Name	Gloss
MeSH features
1	Categories of d	All direct or indirect hypernyms of d
2	Categories of c	All direct or indirect hypernyms of c
3	Has a specific disease	Whether the document has a more specific disease
4	Has a general disease	Whether the document has a more general disease
MEDI features
5	$r (< c, d >)$	Relation of $< c, d >$ in MEDI: null or treatment
6	$r (< c, d >)$	Relation of $< c, d >$ in MEDI’s high precision subset
SIDER features
7	$r (< c, d >)$	Relation of $< c, d >$ in SIDER: null, treatment or aderver-drug-reaction
8	$r (< c, d >)$	Relation of $< c, d >$ in SIDER subset confirmed by FDA Adverse Event Reporting System ( 26 )
9	$i s A D R (d)$	Whether d is an adverse drug event in SIDER
CTD features
10	$r (< c, d >)$	Relation of $< c, d >$ in CTD: null, inferred-association, therapeutic or marker/mechanism
11	$i s I n d u c e d (d)$	Whether d has a marker/mechanism association with any chemicals in CTD

These features were used for both C_S and C_D classifiers

Context Information : uni- and bi-gram of words before, between and after the target chemical and disease entities. Also, the presence of trigger words (e.g. induce) in the sentence was also used as features.

Entity Information : mentions and normalized values of the target chemical and disease entities. In addition, we defined a binary feature called “core chemical.” If a chemical entity occurs in the title or it is the most frequently mentioned chemical in the abstract, we define it as a “core chemical.”

Information from domain knowledge : the existing domain knowledge of the target chemicals and diseases. We explored four different knowledge bases: MeSH, CTD, MEDication Indication Resource (MEDI) ( 27 ) and Side Effect Resource (SIDER) ( 28 ). We converted all terms (chemicals/drugs and disease/ADRs) in the MEDI and SIDER into MeSH ID using UMLS. As shown in Table 2 , we extracted all relations of the chemical–disease pair in the CTD, MEDI and SIDER as features. Chemicals or diseases from the same category are more likely to have similar biological properties. Thus, we extracted category-related features for each entity from its MeSH hierarchical tree structures, which were represented by several MeSH Tree Numbers (TN). Take the disease “retrograde amnesia” as an instance, all direct and indirect hypernyms, i.e. “C10”, “C10.597”, “C10.597.606” and “C10.597.606.525”, were extracted as categories by parsing its MeSH TN “C10.597.606.525.100”. In addition, based on the MeSH Tree Structures, we also re-visited the document to query whether the document had a more specific (hyponym) or general (hypernym) disease than the target disease. For example, “retrograde amnesia (C10.597.606.525.100.150)” is more specific than “amnesia (C10.597.606.525.100)”. Therefore, we were able to extract two binary features for each disease to denoting whether the source document has diseases more specific or general than the target disease.

Document-level classifier

The C _D classifier utilized document-level information as well as domain knowledge to classify the relations between chemicals and diseases at the document level. The C _D used above three groups of features from the C _s . As shown in Table 1 , compared to C _s , C _D also used the co-occurrence information of the target chemical and disease entities, but did not use the uni- and bi-gram features as in the C _s .

Machine learning

For both sentence- and document-level relation classification, we employed SVMs algorithm and used the LIBSVM ( 29 ) package for SVMs implementation.

Training corpus generation

The training of the document-level classifier was straightforward as the relations were annotated at the document level in the gold standard. However, we needed to construct sentence-level annotations to train the sentence-level classifier. We extracted all sentences that had at least one chemical–disease pair, denoted as $< c, d >$ , and generated the sentence-level annotations according to the document-level annotations by following a simple rule: a sentence-level relation pair $< c, d >$ would be annotated as “ true ” if and only if the $< c, d >$ pair is in the document-level annotations; otherwise, the < c, d > pair would be annotated as “ false ”.

Experiments and evaluation

We developed our machine-learning models using the training set and optimized the parameters using the development set. Then we combined the training and the development datasets to build the final models.

NER and normalization

We tried two different approaches: ( 1 ) NER-S: trained two separate CRFs models, one for disease entities and the other for chemical entities, and ( 2 ) NER-U: trained a unified CRFs model for both disease and chemical entities. In the NER-S approach, additional external corpora were also investigated. We used the BioCreative IV CHEMDNER corpus ( 30 ) for chemical NER and the NCBI Disease Corpus ( 31 ) for disease NER.

Relation extraction

The CID task in the BioCreative V CDR Track was designed to extract CDRs in an end-to-end setting, in which predicted chemicals and diseases were provided as inputs to the relation extraction system. To better understand the performance of the relation extraction system, we also evaluated and reported the performance of the CDR extraction system using the gold-standard chemical and disease entities as the inputs. Three different strategies for generating chemical–disease pairs were used: ( 1 ) $C_{S}$ , which applies $C_{S}$ for those $< c, d >$ pairs located in the same sentences only; ( 2 ) $C_{D}$ , which applies $C_{D}$ for all $< c, d >$ pairs in the same document; and ( 3 ) C _S + C _D , a combination strategy of C _S and C _D in which the union set of the two classifiers’ predictions were used as our system’s predictions. Moreover, we evaluated the contribution of features from different domain knowledge bases.

Evaluation metrics

The evaluation metrics of the CDR track include F -score (F), precision (P) and recall (R). For DNER, the evaluation scores were calculated based on tuples of the document ID and the disease concept ID. In addition to the concept-level evaluation scores, we further reported P, R and F on the mention-level using exact span matching. This evaluation setting was also used for CNER. For the CID task, the evaluation scores were calculated based on 3-tuple composed of document ID, chemical and disease concept ID. Please refer to the task description ( 22 ) for more details.

Results

Table 3 shows the performance of the CD-REST on chemical and disease NER and normalization task. The NER-S approach, which trained individual models for CNER and DNER, outperformed the NER-U approach that combined the chemical and disease entities recognition in one model. The best performance of DNER was achieved by the NER-S approach that used the CDR corpus only for model training. The best performance for CNER was achieved by the NER-S approach that used both the CDR corpus and the BioCreative IV CHEMDNER corpus for model training.

Table 3.

Performance of the CD-REST in the CNER and DNER tasks on the test set with different approaches

Task	# Run	Approach	Training dataset	Concept-level			Mention-level
Task	# Run	Approach	Training dataset	P	R	F	P	R	F
CNER	1	U	V	0.8850	0.9115	0.8980	0.9278	0.8858	0.9063
	2	S	V	0.8941	0.9112	0.9027	0.9339	0.8819	0.9072
	3	S	V+IV	0.9010	0.9199	0.9103	0.9376	0.8698	0.9024
DNER	1	U	V	0.8254	0.8395	0.8324	0.8648	0.8230	0.8434
	2*	S	V	0.8312	0.8395	0.8353	0.8689	0.8210	0.8443
	3	S	V+N	0.8158	0.8355	0.8255	0.8636	0.8232	0.8429

Task	# Run	Approach	Training dataset	Concept-level			Mention-level
Task	# Run	Approach	Training dataset	P	R	F	P	R	F
CNER	1	U	V	0.8850	0.9115	0.8980	0.9278	0.8858	0.9063
	2	S	V	0.8941	0.9112	0.9027	0.9339	0.8819	0.9072
	3	S	V+IV	0.9010	0.9199	0.9103	0.9376	0.8698	0.9024
DNER	1	U	V	0.8254	0.8395	0.8324	0.8648	0.8230	0.8434
	2*	S	V	0.8312	0.8395	0.8353	0.8689	0.8210	0.8443
	3	S	V+N	0.8158	0.8355	0.8255	0.8636	0.8232	0.8429

U: the NER-U approach; S: the NER-S approach; V: the BioCreative V CDR Corpus; IV: the BioCreative IV CHEMDNER Corpus; N: the NCBI Disease Corpus. * was the best run the CD-REST achieved on DNER task in the CDR challenge. DNER Run #3 was not submitted to the challenge. Where applicable, the best performance in each category is highlighted in bold.

Table 3.

Performance of the CD-REST in the CNER and DNER tasks on the test set with different approaches

Task	# Run	Approach	Training dataset	Concept-level			Mention-level
Task	# Run	Approach	Training dataset	P	R	F	P	R	F
CNER	1	U	V	0.8850	0.9115	0.8980	0.9278	0.8858	0.9063
	2	S	V	0.8941	0.9112	0.9027	0.9339	0.8819	0.9072
	3	S	V+IV	0.9010	0.9199	0.9103	0.9376	0.8698	0.9024
DNER	1	U	V	0.8254	0.8395	0.8324	0.8648	0.8230	0.8434
	2*	S	V	0.8312	0.8395	0.8353	0.8689	0.8210	0.8443
	3	S	V+N	0.8158	0.8355	0.8255	0.8636	0.8232	0.8429

Task	# Run	Approach	Training dataset	Concept-level			Mention-level
Task	# Run	Approach	Training dataset	P	R	F	P	R	F
CNER	1	U	V	0.8850	0.9115	0.8980	0.9278	0.8858	0.9063
	2	S	V	0.8941	0.9112	0.9027	0.9339	0.8819	0.9072
	3	S	V+IV	0.9010	0.9199	0.9103	0.9376	0.8698	0.9024
DNER	1	U	V	0.8254	0.8395	0.8324	0.8648	0.8230	0.8434
	2*	S	V	0.8312	0.8395	0.8353	0.8689	0.8210	0.8443
	3	S	V+N	0.8158	0.8355	0.8255	0.8636	0.8232	0.8429

Table 4 shows the performance of different approaches on the CID task in the end-to-end setting and the gold-standard setting. The C _S₊ C _D approach outperformed individual classifiers ( C _S or C _D ), achieving the highest F -scores of 0.5853 in the end-to-end setting and 0.6716 when gold-standard chemical and disease entities were used.

Table 4.

The performance of the CD-REST in the CID task using the end-to-end setting (CNER #1, DNER #1) and the gold-standard setting on the test set with different approaches. Where applicable, the best performance in each category is highlighted in bold.

Approach	End-to-end			Gold-standard
Approach	P	R	F	P	R	F
C_S	0.6424	0.4381	0.5209	0.6763	0.5487	0.6059
C_D	0.6412	0.5047	0.5648	0.6836	0.6182	0.6493
C_S + C_D	0.6186	0.5553	0.5853	0.6580	0.6857	0.6716

Approach	End-to-end			Gold-standard
Approach	P	R	F	P	R	F
C_S	0.6424	0.4381	0.5209	0.6763	0.5487	0.6059
C_D	0.6412	0.5047	0.5648	0.6836	0.6182	0.6493
C_S + C_D	0.6186	0.5553	0.5853	0.6580	0.6857	0.6716

Table 4.

Approach	End-to-end			Gold-standard
Approach	P	R	F	P	R	F
C_S	0.6424	0.4381	0.5209	0.6763	0.5487	0.6059
C_D	0.6412	0.5047	0.5648	0.6836	0.6182	0.6493
C_S + C_D	0.6186	0.5553	0.5853	0.6580	0.6857	0.6716

Approach	End-to-end			Gold-standard
Approach	P	R	F	P	R	F
C_S	0.6424	0.4381	0.5209	0.6763	0.5487	0.6059
C_D	0.6412	0.5047	0.5648	0.6836	0.6182	0.6493
C_S + C_D	0.6186	0.5553	0.5853	0.6580	0.6857	0.6716

Table 5 shows the performance of the CD-REST on the test set with features from different knowledge base features, based on the best performing strategy ( C _S + C _D ). All features from knowledge bases improved the system’s performance. It is also not surprising that CTD improved the performance most, comparing with other knowledge bases, as CTD is the knowledge base for chemical-induced diseases.

Table 5.

Results of the CD-REST with + approach on the test set using the end-to-end setting (CNER Run #1, DNER Run #1) and the gold-standard setting, when different sets of knowledge base features were used. The best results are highlighted in bold.

Feature set	End-to-end			Gold-standard
Feature set	P	R	F	P	R	F
Entity + Context	0.5160	0.3640	0.4268	0.5960	0.4400	0.5073
Entity + Context + MeSH	0.5155	0.4222	0.4641	0.5842	0.5140	0.5469
Entity + Context + MeSH + MEDI	0.5206	0.4278	0.4696	0.5953	0.5244	0.5576
Entity + Context + MeSH + MEDI + SIDER	0.5308	0.4372	0.4794	0.6086	0.5310	0.5671
Entity + Context + MeSH + MEDI + SIDER + CTD	0.6186	0.5553	0.5853	0.6580	0.6857	0.6716

Feature set	End-to-end			Gold-standard
Feature set	P	R	F	P	R	F
Entity + Context	0.5160	0.3640	0.4268	0.5960	0.4400	0.5073
Entity + Context + MeSH	0.5155	0.4222	0.4641	0.5842	0.5140	0.5469
Entity + Context + MeSH + MEDI	0.5206	0.4278	0.4696	0.5953	0.5244	0.5576
Entity + Context + MeSH + MEDI + SIDER	0.5308	0.4372	0.4794	0.6086	0.5310	0.5671
Entity + Context + MeSH + MEDI + SIDER + CTD	0.6186	0.5553	0.5853	0.6580	0.6857	0.6716

Table 5.

Feature set	End-to-end			Gold-standard
Feature set	P	R	F	P	R	F
Entity + Context	0.5160	0.3640	0.4268	0.5960	0.4400	0.5073
Entity + Context + MeSH	0.5155	0.4222	0.4641	0.5842	0.5140	0.5469
Entity + Context + MeSH + MEDI	0.5206	0.4278	0.4696	0.5953	0.5244	0.5576
Entity + Context + MeSH + MEDI + SIDER	0.5308	0.4372	0.4794	0.6086	0.5310	0.5671
Entity + Context + MeSH + MEDI + SIDER + CTD	0.6186	0.5553	0.5853	0.6580	0.6857	0.6716

Feature set	End-to-end			Gold-standard
Feature set	P	R	F	P	R	F
Entity + Context	0.5160	0.3640	0.4268	0.5960	0.4400	0.5073
Entity + Context + MeSH	0.5155	0.4222	0.4641	0.5842	0.5140	0.5469
Entity + Context + MeSH + MEDI	0.5206	0.4278	0.4696	0.5953	0.5244	0.5576
Entity + Context + MeSH + MEDI + SIDER	0.5308	0.4372	0.4794	0.6086	0.5310	0.5671
Entity + Context + MeSH + MEDI + SIDER + CTD	0.6186	0.5553	0.5853	0.6580	0.6857	0.6716

Table 6 shows performance of the CD-REST on the CID task using different combinations of CNER and DNER. Among all the combinations, the Run #1 achieved the highest F 1-score of 0.5853. To our surprise, the Run #3, which combined the best CNER module (CNER #2) and the best DNER module (DNER #2), was outperformed by Run #1. Therefore, we further examined the two runs by calculating the “relation coverage”—defined as the number of gold standard relations covered by the predicted entities. A relation is labelled as covered if both the chemical entity and the disease entity were identified. We compared the relation coverage of the two runs based on the gold standard and found that the Run #1 covered 10 more relations than the Run #3, suggesting that the Run #1 could capture more in-relation entities than the Run #3.

Table 6.

The performance of the CD-REST with C_S + C_D approach on the CID task using different combinations of CNER and DNER. Where applicable, the best performance in each category is highlighted in bold.

#	# CNER Run	# DNER Run	P	R	F
1	1	1	0.6186	0.5553	0.5853
2	2	2	0.6216	0.5516	0.5845
3	3	2	0.6255	0.5422	0.5809
4	2	3	0.6193	0.5525	0.5840
5	3	3	0.6231	0.5413	0.5793

#	# CNER Run	# DNER Run	P	R	F
1	1	1	0.6186	0.5553	0.5853
2	2	2	0.6216	0.5516	0.5845
3	3	2	0.6255	0.5422	0.5809
4	2	3	0.6193	0.5525	0.5840
5	3	3	0.6231	0.5413	0.5793

Table 6.

The performance of the CD-REST with C_S + C_D approach on the CID task using different combinations of CNER and DNER. Where applicable, the best performance in each category is highlighted in bold.

#	# CNER Run	# DNER Run	P	R	F
1	1	1	0.6186	0.5553	0.5853
2	2	2	0.6216	0.5516	0.5845
3	3	2	0.6255	0.5422	0.5809
4	2	3	0.6193	0.5525	0.5840
5	3	3	0.6231	0.5413	0.5793

#	# CNER Run	# DNER Run	P	R	F
1	1	1	0.6186	0.5553	0.5853
2	2	2	0.6216	0.5516	0.5845
3	3	2	0.6255	0.5422	0.5809
4	2	3	0.6193	0.5525	0.5840
5	3	3	0.6231	0.5413	0.5793

During the challenge, we developed a rule-based post-processing module, which improved the performance on the development corpus. However, adding the post-processing module actually hurt the performance. Our best submission in the challenge (using strategy in CID Run #1 with the post-processing module) achieved the highest F -score (0.5703) among all teams, which is lower than the score reported in this article.

We examined the efficiency of CD-REST system using a computer with 32 GB RAM and a 3.7 GHz 4-core processor. It took about 450 s to process the whole test set for relation extraction. The average processing time for one abstract was <1 s. However, the web service took more time since it only processed one document per request ( 22 ).

Discussion

In this study, we developed CD-REST, an end-to-end system to extract chemical-induced disease relations from biomedical literature by incorporating domain knowledge into machine-learning models. Our system achieved the best performance among 18 participating teams and 46 submitted runs in the challenge of the BioCreative V CDR Track. Our results demonstrated the feasibility of incorporating domain knowledge into machine-learning-based approaches for CDR extraction.

System performance comparison and analysis

NER-S vs. NER-U

As shown in Table 3 , NER-S, which trained individual classifiers for chemicals and disease, respectively, outperformed the NER-U approach, which combines chemical and disease entities into one model. We noticed that the NER-S approach always achieved a higher precision while maintaining a comparable recall. In general, a unified NER model built for all entities will benefit from the dependencies among different types of entities. However, the unified model performed worse in this study, probably due to the low dependence between the chemical and disease entities.

Performance comparison among C _S , C _D and their combination

In our experiments, the document-level classifier $C_{D}$ outperformed the sentence-level classifier $C_{S}$ in both the end-to-end setting and the gold-standard settings (see Table 4 ). One obvious reason is that the $C_{S}$ discarded the chemical-induced disease pairs across multiple sentences, which accounted for ∼30% of the CID relations in the corpus ( 24 ). Moreover, the automatically generated corpus for the $C_{S}$ approach was based on a simple assumption, and it contained many false positive instances. The combination of $C_{D}$ and $C_{S}$ achieved the highest F -score of 0.5853 and 0.6716 on the end-to-end setting and the gold-standard setting, respectively ( Table 4 ). Regarding the performance of individual classifiers, the $C_{S}$ achieved an F -score of 0.5209 and the $C_{D}$ achieved an F -score of 0.5648, respectively. These F -scores were still among the top-ranked submissions in the BioCreative V CDR challenge.

The contribution of features from domain knowledge bases

The features derived from domain-specific knowledge bases improved the CDR extraction performance. As illustrated in Table 5 , domain knowledge played a critical role in CDR extraction. The contribution from different knowledge bases varied. The features derived from the CTD yielded the most improvement, which is not surprising, as CTD is the database for chemical-induced diseases. We also noticed that the category-related features derived from the MeSH improved performance on the recall.

Error analysis

For the NER and normalization task, the incorrectly recognized boundaries of mentions caused a significant performance drop, especially for disease entities. Our system achieved an F -score of over 0.90 on disease recognition under relaxed matching which allows for boundaries overlapping. Most of the boundary errors were caused by missing modifiers in disease mentions, such as course and severity. For example, our system detected “hepatic failure” instead of “end-stage hepatic failure,” “hepatitis” instead of “acute hepatitis” and “liver injury” instead of “drug-induced liver injury.” One limitation of our system is that we did not handle abbreviations well at this time. For example, in “indomethacin (IDM)”, although the long form mention “IDM” was correctly recognized, our system missed “IDM” as a chemical in following sentences. Errors caused by missing abbreviations occurred for both diseases and chemicals.

There are various types of errors for the CID task. First, implicit relations that are across multiple sentences are difficult to detect. Another type of error was related to disease granularities. For example, there was explicit evidence in the abstract that chemical X induced disease Y. However, in gold standard, a relation pair <X, Z> was extracted instead of <X, Y> in many cases, because Z was a more specific disease of Y. Moreover, the errors propagated from the NER and normalization step also reduced the performance of the end-to-end system. As we seen from Table 4 , the performance of the system increased ∼10%, when the gold-standard entities were used.

Conclusion

In this study, we incorporated machine-learning algorithms with domain-specific knowledge to build an end-to-end system for chemical-induced disease relation extraction, which consists of a disease and chemical NER and normalization module and a chemical-induced disease relation extraction module. In the BioCreative V CDR Track, our system achieved the highest performance on the CID task, indicating the feasibility of the proposed approaches for chemical-induced disease relation extraction.

Acknowledgements

This project was supported by Cancer Prevention & Research Institute of Texas (CPRIT) Rising Star Award (CPRIT R1307), the National Library of Medicine of the National Institutes of Health under Award Number 2R01LM010681-05, the National Institute of General Medical Sciences under Award Numbers 1R01GM103859 and 1R01GM102282. The first author (J.X.) is partially supported by the National Nature and Science Foundation of China (NSFC 61203378).

Conflict of interest . None declared.

References

Davis

A.P.

Grondin

C.J.

Lennon-Hopkins

. et al. . (

2015

)

The Comparative Toxicogenomics Database’s 10th year anniversary: update 2015

Nucleic Acids Res

D914

–

D920

Arighi

C.N.

Roberts

P.M.

Agarwal

. et al. . (

2011

)

BioCreative III interactive task: an overview

BMC Bioinformatics

S4.

Wiegers

T.C.

Davis

A.P.

Mattingly

C.J.

(

2012

)

Collaborative biocuration—text-mining development task for document prioritization for curation

Database

2012

bas037

Wiegers

T.C.

Davis

A.P.

Mattingly

C.J.

(

2014

)

Web services-based text-mining demonstrates broad impacts for interoperability and process simplification

Database

2014

bau050

Savova

G.K.

Masanz

J.J.

Ogren

P.V

. et al. . (

2010

)

Mayo clinical text analysis and knowledge extraction system (cTAKES): architecture, component evaluation and applications

J. Am. Med. Inf. Assoc

507

–

513

Aronson

A.R.

Lang

F.M.

(

2010

)

An overview of MetaMap: historical perspective and recent advances

J. Am. Med. Inf. Assoc

229

–

236

Bodenreider

(

2004

)

The unified medical language system (UMLS): integrating biomedical terminology

Nucleic Acids Res

D267

–

D270

Lowe

D.M.

Sayle

R.A.

(

2015

)

LeadMine: a grammar and dictionary driven approach to entity recognition

J. Cheminfo

S5.

Jiang

Chen

Liu

. et al. . (

2011

)

A study of machine-learning-based approaches to extract clinical entities and their assertions from discharge summaries

J. Am. Med. Inf. Assoc

601

–

606

Leaman

Wei

C.H.

(

2015

)

tmChem: a high performance approach for chemical named entity recognition and normalization

J. Cheminfo

S3.

Rocktaschel

Weidlich

Leser

(

2012

)

ChemSpot: a hybrid system for chemical named entity recognition

Bioinformatics

1633

–

1640

Zhang

Wang

Tang

. et al. . (

2014

) UTH_CCB: a report for SemEval 2014—Task 7 analysis of clinical text. In: Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval 2014) . ACL, Dublin, Ireland, pp. 802–806.

Bach

Badaskar

(

2007

)

A Review of Relation Extraction

Language Technologies Institute, Carnegie Mellon University

Google Preview

Zhou

Zhong

(

2014

)

Biomedical relation extraction: from binary to complex

Comput. Math. Methods Med

2014

298473.

Chen

E.S.

Hripcsak

. et al. . (

2008

)

Automated acquisition of disease drug knowledge from biomedical and clinical documents: an initial study

J. Am. Med. Inf. Assoc

–

Mao

J.J.

Chung

Benton

. et al. . (

2013

)

Online discussion of drug side effects and discontinuation among breast cancer survivors

Pharmacoepidemiol. Drug Saf

256

–

262

Khoo

C.S.G.

Chan

Niu

(

2000

) Extracting causal knowledge from a medical database using graphical patterns. In: Proceedings of the 38th Annual Meeting on Association for Computational Linguistics . Association for Computational Linguistics, Hong Kong, pp. 336–343.

Ben Abacha

Zweigenbaum

(

2011

)

Automatic extraction of semantic relations between medical entities: a rule based approach

J. Biomed. Semant

S4.

Wang

(

2014

)

Automatic construction of a large-scale and accurate drug-side-effect association knowledge base from biomedical literature

J. Biomed. Inf

191

–

199

Rosario

Hearst

M.A.

(

2004

) Classifying semantic relations in bioscience texts. In: Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics . Association for Computational Linguistics, Barcelona, Spain, pp. 430.

Gurulingappa

Mateen-Rajput

Toldo

(

2012

)

Extraction of potential adverse drug events from medical case reports

J. Biomed. Semant

15.

Wei

C.H.

Peng

Leaman

. et al. . (

2015

) Overview of the BioCreative V chemical disease relation (CDR) task. In: the fifth BioCreative Challenge Evaluation Workshop , Sevilla, Spain.

Lafferty

J.D.

McCallum

Pereira

F.C.N.

(

2001

) Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: Proceedings of the Eighteenth International Conference on Machine Learning . Morgan Kaufmann Publishers Inc., Burlington, pp. 282–289.

Sun

Johnson

. et al. . (

2015

) Annotating chemicals, diseases, and their interactions in biomedical literature. In: The fifth BioCreative Challenge Evaluation Workshop , Sevilla, Spain.

Collobert

Weston

Bottou

. et al. . (

2011

)

Natural language processing (Almost) from scratch

J. Mach. Learn. Res

2493

–

2537

Shang

(

2014

)

Integrating Domain Knowledge to Improve Signal Detection from Electronic Health Records for Pharmacovigilance

School of Biomedical Informatics. The University of Texas Health Science Center at Houston

Houston

Google Preview

Wei

W.Q.

Cronin

R.M.

. et al. . (

2013

)

Development and evaluation of an ensemble resource linking medications to their indications

J. Am. Med. Inf. Assoc

954

–

961

Kuhn

Campillos

Letunic

. et al. . (

2010

)

A side effect resource to capture phenotypic effects of drugs

Mol. Syst. Biol

343.

Chang

C.C.

Lin

C.J.

(

2011

)

LIBSVM. A library for support vector machines

ACM Trans. Intell. Syst. Technol

–

Krallinger

Leitner

Rabal

. et al. . (

2015

)

CHEMDNER: the drugs and chemical names extraction challenge

J. Cheminf

S1.

Dogan

R.I.

Leaman

(

2014

)

NCBI disease corpus: a resource for disease name recognition and concept normalization

J. Biomed. Inf

–