Abstract

Biomedical relation extraction from scientific publications is a key task in biomedical natural language processing (NLP) and can facilitate the creation of large knowledge bases, enable more efficient knowledge discovery, and accelerate evidence synthesis. In this paper, building upon our previous effort in the BioCreative VIII BioRED Track, we propose an enhanced end-to-end pipeline approach for biomedical relation extraction (RE) and novelty detection (ND) that effectively leverages existing datasets and integrates state-of-the-art deep learning methods. Our pipeline consists of four tasks performed sequentially: named entity recognition (NER), entity linking (EL), RE, and ND. We trained models using the BioRED benchmark corpus that was the basis of the shared task. We explored several methods for each task and combinations thereof: for NER, we compared a BERT-based sequence labeling model that uses the BIO scheme with a span classification model. For EL, we trained a convolutional neural network model for diseases and chemicals and used an existing tool, PubTator 3.0, for mapping other entity types. For RE and ND, we adapted the BERT-based, sentence-bound PURE model to bidirectional and document-level extraction. We also performed extensive hyperparameter tuning to improve model performance. We obtained our best performance using BERT-based models for NER, RE, and ND, and the hybrid approach for EL. Our enhanced and optimized pipeline showed substantial improvement compared to our shared task submission, NER: 93.53 (+3.09), EL: 83.87 (+9.73), RE: 46.18 (+15.67), and ND: 38.86 (+14.9). While the performances of the NER and EL models are reasonably high, RE and ND tasks remain challenging at the document level. Further enhancements to the dataset could enable more accurate and useful models for practical use. We provide our models and code at https://github.com/janinaj/e2eBioMedRE/.

Database URL: https://github.com/janinaj/e2eBioMedRE/

Introduction

Biomedical publications are a rich and valuable source of scientific knowledge. Keeping up with the ever-growing literature and generating insights from it is an essential yet challenging task in the workflows of researchers and other stakeholders of biomedical science. Automated methods for extracting and organizing the literature knowledge could improve the efficiency of such workflows and expedite the construction of knowledge bases, accelerating scientific discovery and enhancing understanding of disease and health [1]. Natural language processing (NLP) and text mining methods have long been proposed for information extraction from the literature, and substantial progress has been made over the last two decades in tasks such as named entity recognition (NER) and relation extraction (RE) [2]. Despite the progress in certain areas, the use of biomedical NLP tools in real-world applications remains limited, partly due to modest performance and methods that do not generalize well [3].

Shared task competitions, such as BioCreative challenges [4, 5] and BioNLP event extraction shared tasks [6, 7], have stimulated much of the progress made in biomedical NLP by providing sizable benchmark corpora and bringing together communities to improve the state-of-the-art in specific NLP tasks. BioCreative, launched in 2004, has been one of the long-term, sustained efforts in advancing biomedical NLP through challenges. It has led to the development of many datasets and NLP approaches over the years that aim to address various information extraction tasks related to biology literature, including gene mention identification [8] and extraction of chemical-induced diseases [9]. In its latest edition (BioCreative VIII), one of the shared tasks (BioRED Track) focused on biomedical NER and RE, as well as named entity linking (EL) and novelty detection (ND). This track expands upon the earlier BioCreative competitions that addressed one or a small number of entity and relation types [8, 9] by considering the BioRED dataset [10], which includes six entity types (Disease, Chemical, Gene, Species, Cell Line, and Variant) and eight relation types that hold between these entity types (Association, Positive Correlation, Negative Correlation, Bind, Comparison, Conversion, Cotreatment, and Drug Interaction). The novelty of each relation is also considered. Extracting such knowledge from the literature could support many downstream tasks, such as biocuration [11], pharmcovigilance [12], and literature-based discovery [13].

In this study, we build on our submission to BioCreative VIII BioRED Track, which has demonstrated competitive performance [14], and present an enhanced end-to-end system that performs all four tasks in a pipeline architecture. For each task, we enhance our previous approach or experiment with alternative approaches. Specifically, for NER, we compare our token classification model based on Bidirectional Encoder Representations from Transformers (BERT) to a BERT-based span classification model, and retrain the better-performing model by incorporating additional datasets into the training process. For EL, we use a hybrid approach that includes a convolutional neural network (CNN) for disease and chemicals [15], and uses an external entity linker, PubTator 3.0 [16], for other types. For RE and ND, we extend the BERT-based Princeton University Relation Extraction system (PURE) model [17] to deal with document-level relations. The combination of token classification model for NER, CNN-augmented EL, and modified PURE model for RE and ND yields our best overall results, showing a substantial improvement over the shared task submission. Using additional datasets in the training process seems to account for most of the performance improvements.

Related work

NER and RE from biomedical literature are foundational tasks in biomedical NLP. Some established systems aim to provide broad coverage of NER and RE. They have focused on rule-based methods and leveraged rich semantic resources, such as Unified Medical Language System [18], to extract concepts and semantic relation triples from the biomedical abstracts [19, 20]. Most current NER and RE approaches are based on supervised machine learning and are trained on corpora that often include a small number of entity or relation types, which limits their usefulness for practical purposes [10]. In recent years, deep learning architectures have led to state-of-the-art performance on biomedical NER and RE corpora. These methods often involve fine-tuning domain-specific pretrained language models based on Transformer architecture, such as BioBERT [21] and PubMedBERT (also known as BioMedBERT) [22], on corpora annotated for biomedical entities [23, 24] and relations [25, 26]. As these corpora are limited in scope, broad-coverage, rule-based tools, such as MetaMap [19] and SemRep [20], despite their modest performance, have remained popular for downstream tasks in the biomedical domain. More recently, in-context learning based on generative large language models (LLMs), such as GPT-3, has also been explored, although these methods generally underperform fine-tuning approaches [27–29].

Current methods for biomedical NER typically use a token classification approach based on the BIO scheme or formulate NER as a span classification task, where spans up to a particular length are considered independently as input [30]. State-of-the-art results on NER benchmark datasets have been reported using domain-specific BERT models [22, 31]. It has been acknowledged that models trained on a single dataset do not generalize well [3, 32], and various methods, such as multi-task learning on several datasets [33], have been explored to address this problem. In contrast, Luo et al. [32], in recent work, merged multiple datasets into a single sequence labeling task via task-oriented tagging labels and using a PubMedBERT-CRF model to obtain state-of-the-art NER results on the BioRED dataset [10].

Supervised biomedical RE is generally formulated as a binary or multi-class classification task, where domain-specific BERT models provide the foundation for classification of entity pairs into one of the predefined relation types [2]. Training corpora for biomedical RE often focus on one or a small number of relation types [25, 26, 34]. BioRED [10] is more comprehensive and includes six entity and eight relation types, although some of these relation types are rare in the dataset. Lai et al. [35] combined and aligned several RE datasets with BioRED and have shown improvement in model performance, particularly for the rare labels. While RE is often performed in a pipeline approach where NER is followed by RE [17], joint learning approaches have also been explored [15, 36, 37].

EL is a particularly important task in biomedical NLP, as it normalizes entity mentions to standard concept identifiers in knowledge bases and helps aggregate the evidence across documents [38]. Early work on biomedical EL focused on rule-based approaches [19, 39, 40], which are efficient and interpretable, but generally underperform more recent supervised machine learning approaches, such as DNorm [41]. With the advent of deep learning, neural network architectures, such as CNN-based ranking [42], Bidirectional Long Short-Term Memory [43], and BERT-based models [44, 45], have been applied to biomedical EL, yielding improved performance.

Some web-based tools and Application Programming Interfaces have enabled rapid NER and EL for biocuration of scientific articles. For example, PubTator [11] provides easy access to named entities and their database identifiers in PubMed abstracts, and has been extended to full-text articles in PubMed Central (PubTator Central [46]) and relations (PubTator 3.0 [16]). These tools focus on a limited set of entity and relation types.

Methods

We propose a pipeline approach for our end-to-end RE system, where each of the four tasks (NER, EL, RE, and ND) are performed sequentially (Fig. 1). We describe our approach for each step below.

An end-to-end pipeline for RE and ND. Article title: Ca2+ dependence of the Ca2+-selective TRPV6 channel (abstract omitted for clarity).
Figure 1.

An end-to-end pipeline for RE and ND. Article title: Ca2+ dependence of the Ca2+-selective TRPV6 channel (abstract omitted for clarity).

Dataset

We primarily use the BioRED dataset [10, 47] for training and evaluation. The latest version of the BioRED dataset [47], in the BioCreative VIII BioRED Track, contains a total of 1000 PubMed abstracts split into training (600) and test sets (400). We split the training set in accordance with the track; 500 abstracts are used as the training set and 100 abstracts serve as the development set.

Table 1 shows the distribution of entity mentions for each data split. The training set contains 3280 unique entities (based on controlled vocabulary identifiers): 1379 (42.04%) are genes, 675 (20.58%) are diseases, 580 (17.68%) are chemicals, 546 (16.65%) are sequence variants, 43 (1.31%) are species, and 57 (1.74%) are cell lines. The development set contains 985 unique entities of which 400 (40.61%) are genes, 245 (24.87%) are diseases, 171 (17.36%) are chemicals, 137 (13.91%) are sequence variants, 11 (1.12%) are species, and 21 (2.13%) are cell lines. Test set annotations are not publicly available; therefore, we use the counts from the shared task overview paper [47], which does not provide unique entity counts.

Table 1.

Distribution of entity mentions in the BioRED dataset. Knowledge base refers to the vocabulary used for grounding the entities of specific types

Entity typeKnowledge baseTrainDevTestTotal
GeneNCBI Gene [62]55171180572812 425
DiseaseMEDIC [56]462891736419186
ChemicalMESH [63]367575425927021
SpeciesNCBI Taxonomy [64]179939317743966
VariantdbSNP [65]114024115252906
Cell lineCellosaurus [66]12550140315
Total16 884353515 40035 819
Entity typeKnowledge baseTrainDevTestTotal
GeneNCBI Gene [62]55171180572812 425
DiseaseMEDIC [56]462891736419186
ChemicalMESH [63]367575425927021
SpeciesNCBI Taxonomy [64]179939317743966
VariantdbSNP [65]114024115252906
Cell lineCellosaurus [66]12550140315
Total16 884353515 40035 819
Table 1.

Distribution of entity mentions in the BioRED dataset. Knowledge base refers to the vocabulary used for grounding the entities of specific types

Entity typeKnowledge baseTrainDevTestTotal
GeneNCBI Gene [62]55171180572812 425
DiseaseMEDIC [56]462891736419186
ChemicalMESH [63]367575425927021
SpeciesNCBI Taxonomy [64]179939317743966
VariantdbSNP [65]114024115252906
Cell lineCellosaurus [66]12550140315
Total16 884353515 40035 819
Entity typeKnowledge baseTrainDevTestTotal
GeneNCBI Gene [62]55171180572812 425
DiseaseMEDIC [56]462891736419186
ChemicalMESH [63]367575425927021
SpeciesNCBI Taxonomy [64]179939317743966
VariantdbSNP [65]114024115252906
Cell lineCellosaurus [66]12550140315
Total16 884353515 40035 819

Table 2 shows the distribution of relation types for the training, development, and test sets. Associations, positive correlations, and negative correlations make up 95.74% of all relations. Other relation types are rare in the dataset. It is important to also note that BioRED relations do not have directionality, in contrast to most other RE models.

Table 2.

Distribution of relations by type in the BioRED dataset

Relation typeTrainDevTestTotal
Association275263527596146
Positive correlation144132517513517
Negative correlation97917111922342
Cotreatment4114172227
Bind809136225
Comparison3361352
Conversion311317
Drug interaction112013
Total53401163603612 539
Relation typeTrainDevTestTotal
Association275263527596146
Positive correlation144132517513517
Negative correlation97917111922342
Cotreatment4114172227
Bind809136225
Comparison3361352
Conversion311317
Drug interaction112013
Total53401163603612 539
Table 2.

Distribution of relations by type in the BioRED dataset

Relation typeTrainDevTestTotal
Association275263527596146
Positive correlation144132517513517
Negative correlation97917111922342
Cotreatment4114172227
Bind809136225
Comparison3361352
Conversion311317
Drug interaction112013
Total53401163603612 539
Relation typeTrainDevTestTotal
Association275263527596146
Positive correlation144132517513517
Negative correlation97917111922342
Cotreatment4114172227
Bind809136225
Comparison3361352
Conversion311317
Drug interaction112013
Total53401163603612 539

Named entity recognition

BERT-based token classification model

We first framed the NER task as a token classification problem, utilizing the BIO scheme to mark each token as starting (B-), inside (I-), or outside of an entity (O), with corresponding type information such as B-Chemical and I-CellLine. We fine-tuned the pre-trained BioMedBERT model [22] and used softmax for predicting token labels. The maximum sequence length is 512 tokens. Cross-entropy was used as the loss function. We performed grid search to optimize hyperparameters: learning rate (1e-04, 2e-04, 3e-04, 1e-05, 2e-05, 3e-05, 1e-06, 2e-06, 3e-06), batch size (8, 16), and epochs (1–50). The optimal hyperparameter combination was found to be learning rate (3e-05), batch size (16), and epochs (22). The training and development sets described above were used for training and evaluation, respectively.

Inspired by the finding that combining multiple biomedical NER datasets can improve model performance [32], we also sought to incorporate multiple annotated datasets during training. We used the same datasets as All-in-one Named Entity Recognition (AIONER) [32], as our focus is on the same entity types: GNormPlus [48] and NLM-Gene [49] for genes; NCBI Disease [23] and BC5CDR [25] for diseases; NLM Chem [49] and BC5CDR [25] for chemicals; Species-800 [50] for species; BioID [51] for cell lines; and tmVar3 [52] for variants. We excluded Linnaeus [53], a species dataset, because it includes full text publications, which exceed the maximum sequence length, and lower F1 score was reported with the addition of this dataset [32]. This resulted in a total of 9 datasets, including the BioRED dataset. Both training and test sets of these additional datasets were used for training the final model.

With the exception of BioRED, which has annotations for all entity types, and BC5CDR, which has chemical and disease annotations, all other datasets include a single entity type. This may confuse the model, as relevant entity mentions may be unannotated in a dataset that does not focus on that entity type. Consider the following example from the GNormPlus dataset: ‘Germline mutations of the human BRCA2 gene confer susceptibility to breast cancer.’ As GNormPlus includes gene annotations only, the disease ‘breast cancer’ is not annotated. To mitigate this issue, we employed the approach outlined in Luo et al. [32], by enclosing the input sentence in special tokens to indicate the entity types for which the model should generate predictions (e.g. <ALL>-</ALL>, <GENE>-</GENE>, <CHEMICAL-DISEASE> </CHEMICAL-DISEASE>). In addition, instead of using a single outside token (O), we also used specific type information (e.g. O-ALL, O-GENE, O-CHEMICAL-DISEASE). The ALL designation is reserved for the BioRED dataset, which contains annotations for all entity types.

PURE span classification model

We also explored a span classification approach to NER. This involves passing token spans, up to a predefined length, to the model for named entity type classification. For this task, we leveraged an existing NER model, PURE [17], which uses BERT to generate contextualized token representations. The contextualized embeddings of candidate token spans are then fed into a two-layer feedforward network for classification. We set the maximum token span length to 30, reflecting the maximum entity span length in the training set. Additionally, we conducted hyperparameter tuning, exploring various learning rates (1e-04, 2e-04, 3e-04, 1e-05, 2e-05, 3e-05, 1e-06, 2e-06, 3e-06) and number of epochs (1–50). We identified that a learning rate of 1e-05 and 39 epochs yields the best F1 score on the development set. Throughout our experiments, we kept the batch size constant at 32.

Post-processing named entity predictions

Lastly, to further improve predicted entities, we applied the following post-processing rules:

  1. If two entity mentions of the same type appear consecutively with no whitespace in between, we combine them into a single entity mention (e.g. ‘A’ and ‘(1)-adenosine receptor’ vs ‘A(1)-adenosine receptor’).

  2. If two disease entities are separated by a single character that is not a slash (/), we combine them (e.g. ‘benign’ and ‘tumor’ vs ‘benign tumor’).

Entity linking

We initially experimented with different existing approaches to EL: PubTator Central [46], BERN2 [54], and PubTator 3.0 [16]. Overall, PubTator 3.0 performed better than the other methods. However, we also recognized that there was room for improving upon PubTator 3.0, especially with Disease and Chemical entities. Improvement for these entity types could also significantly impact the downstream RE task, as most relations in the dataset involve Disease and Chemical entities. For these two entity types, we trained relatively lightweight CNN models with residual connections (ResCNN) [15], as described below.

ResCNN for disease and chemical entity linking

A typical EL approach is to train an encoder module which encodes entity mentions and the entity names from controlled vocabularies to the same embedding space, and then to use encoded query entity mentions to retrieve similar entity names from the vocabulary based on a similarity metric, such as cosine similarity.

To improve the EL performance for Disease and Chemical entities, we employ ResCNN [15] as our encoder. This architecture was motivated by the observation that, even when the order of input tokens is shuffled or attention scope is limited, the performance of a BERT-based EL model is nearly identical, which suggests that a CNN model that captures the local interactions might perform just as well as BERT-based models. ResCNN has been shown to perform comparably to BERT-based EL models, while using 1/100 of the parameters in BERT-based models [44, 45].

ResCNN-based EL architecture consists of a token embedding layer, an encoding layer, and a pooling layer. The token embedding layer tokenizes the input and initializes the embeddings using a BERT-based model’s contextualized embeddings, which are frozen during training. The encoding layer includes multiple blocks with convolutional filters of varying sizes [55]. A position-wise fully connected feed forward network and a residual connection are applied to each block. Lastly, max pooling [55] is used for the pooling layer to obtain the final vector representations. We conducted a greedy search to tune some hyperparameters for each model (Table 3). For other hyperparameters, we used the default ResCNN settings: 300 filters for the convolutional network, and 100 epochs for training with the evaluation for every 5 epochs.

Table 3.

Search space for hyperparameter tuning of ResCNN-based models for disease and chemical mentions

Search space
Pooling typeMax+, Mean, Attention*
Learning rate1e-3*+, 5e-3, 1e-4, 3e-4, 5e-4
# Encoder3+, 4*, 5
Feature size128, 256*+, 512
Dropout rate0.1, 0.25*+, 0.5
Search space
Pooling typeMax+, Mean, Attention*
Learning rate1e-3*+, 5e-3, 1e-4, 3e-4, 5e-4
# Encoder3+, 4*, 5
Feature size128, 256*+, 512
Dropout rate0.1, 0.25*+, 0.5
*

Denotes the optimal hyperparameter for ResCNN-Disease model, and

+denotes the optimal hyperparameter for ResCNN-Chemical model.

Table 3.

Search space for hyperparameter tuning of ResCNN-based models for disease and chemical mentions

Search space
Pooling typeMax+, Mean, Attention*
Learning rate1e-3*+, 5e-3, 1e-4, 3e-4, 5e-4
# Encoder3+, 4*, 5
Feature size128, 256*+, 512
Dropout rate0.1, 0.25*+, 0.5
Search space
Pooling typeMax+, Mean, Attention*
Learning rate1e-3*+, 5e-3, 1e-4, 3e-4, 5e-4
# Encoder3+, 4*, 5
Feature size128, 256*+, 512
Dropout rate0.1, 0.25*+, 0.5
*

Denotes the optimal hyperparameter for ResCNN-Disease model, and

+denotes the optimal hyperparameter for ResCNN-Chemical model.

We used the latest versions of Merged Disease Vocabulary [56] and Comparative Toxicogenomics Database (CTD) [57] to extract synonym-ID pairs and build indexes for Disease and Chemical concepts, respectively. To fully leverage the ID-mention pairs from the training set, we also added them to the index before evaluating EL on the development set. In addition, to train the models for Disease and Chemical concepts, we augmented BioRED with NCBI Disease corpus [23] and BC5CDR corpus [25], respectively. We also searched the optimal choice for ResCNN initial vector representation by initializing with BioMedBERT (same as PubMedBERT) [22] and BioLinkBERT [31] embeddings. We report the EL performance using different training sets (original vs augmented) and embedding layers (BioMedBERT vs BioLinkBERT).

In line with previous work [15, 39, 44], we use top-k accuracy to report task-specific performance for ResCNN-based EL models. We note that the EL evaluation in BioCreative VIII BioRED Track is conducted at the document level by matching (PMID, entity type, id) tuples. If a mention is mapped to multiple identifiers, they are considered multiple tuples.

PubTator 3.0

For other remaining entity types (Gene, Species, Variant, Cell Line), we simply leveraged PubTator 3.0 [16]. PubTator 3.0 uses AIONER [32] for NER and normalizes predicted mentions using GNorm2 [58] for genes and species, tmVar3 [52] for variants, and TaggerOne [59] for cell lines. As we use our own NER modules, some PubTator entity mention spans do not exactly align with our predicted mention spans. To overcome this issue, we allow partial matching in order to fully leverage the normalization predictions from PubTator 3.0. Furthermore, we also build a look-up dictionary with the PubTator 3.0 predictions so that we can also normalize entity mentions which the ResCNN-based method is unable to resolve to a concept identifier.

Relation extraction

We adapted the PURE model [17], originally comprising two separately fine-tuned BERT models for NER and RE. We utilized the RE model, which uses the generated entity representations for labeling entity pairs with a relation type (or no relation). For this purpose, all tokens belonging to an entity mention are enclosed with marker tokens denoting the entity type and whether the entity is the subject or object of a relation [e.g. (SUBJ:GENE)]. For each entity, the embedding of its corresponding marker token (from the last hidden state of the BERT model) is taken as its representation. The embeddings of each possible entity pair are concatenated and passed to the classification layer, which predicts the pair’s relation type. Cross-entropy loss is used for updating the model weights. PURE performs sentence-level extraction, assumes a single mention for each entity in each instance, and is designed for unidirectional relations. In contrast, the BioRED dataset contains full abstracts, multiple mentions of the same entity are common, and most relation types are bidirectional (e.g. Y is associated with X is equivalent to X is associated with Y). Therefore, we made several key updates to the model:

  • Directionality: we removed subject and object designations to render relations bidirectional. For a given entity pair ENTITY1, ENTITY2, we generated two embeddings: [ENTITY1, ENTITY2, ENTITY1 x ENTITY2], and [ENTITY2, ENTITY1, ENTITY2 x ENTITY1], each corresponding to the concatenation of two entity representations and their element-wise product. These concatenated embeddings are individually passed to the relation classifier. The loss is the sum of the cross-entropy losses of both relation representations. To address the bidirectionality during prediction, the logits of both representations are summed up.

  • Multiple mentions: we tag multiple mentions of the same entity. Each entity mention has its own corresponding marker token. However, for prediction, we select the pair of mentions (one for ENTITY1 mentions and one for ENTITY2 mentions) that best helps with classification. Our intuition is that not all mentions are important in identifying the relation, and may introduce unnecessary noise for the model. We take the dot product of each mention pair, which represents the importance of each mention pair to classifying the relation. We take the mention pair with the highest dot product (i.e. max pooling) and use it as the final relation representation for a given entity pair.

  • Entity type markers: we also remove the distinction between different entity types for our marker tokens; i.e. instead of using [ENTITY- GENE] as a marker token, we only used [ENTITY]. Our initial experiments showed that including the entity type information in the marker token was not helpful for relation classification.

We utilized BioMedBERT as the base pretrained model and fine-tuned it with specified hyperparameters tuned using grid search: epochs (5), learning rate (3e-05), batch size (32), and optimizer (Adam). We selected BioMedBERT over BioLinkBERT as the pretrained model as BioMedBERT produced better performance during initial experiments. Additionally, to improve model robustness, we use projected gradient descent attacks [60] during training. After the model’s weights are updated using the combined loss, we perturb the token embeddings three times, adding noise, and train the model to correctly classify relations using the perturbed input.

Novelty detection

We used a similar approach for ND task with two notable changes: (i) we did not include negative examples for training (as the input entity pairs already have an identified relation) and (ii) we used a different entity representation. Instead of picking the best pair of entity mentions, we weigh all mentions based on their importance for the ND task by using logsumexp pooling [61], a smooth version of max pooling, for all mentions of the entities in the entity pair. This generates a single vector for each entity, which we concatenate to obtain the final relation representation. Our intuition is that some mentions are more important than others; only in this case, we still consider all mentions as possibly contributing to the novelty prediction task. Similar to the RE model, we performed hyperparameter tuning using grid search. We trained the final models with the following hyperparameters: epochs (4), learning rate (2e-5), batch size (32), and optimizer (Adam).

Results

Development set results

Named entity recognition

The NER performances with BERT-based token classification and PURE span classification models are shown in Table 4. Scores include the NER post-processing step, described above. The token classification model demonstrated higher performance across most entity types as indicated by the F1 score. The span prediction model outperforms the token classification model on cell line entities. Overall, Species obtains the highest score on all metrics when only the BioRED dataset is used for training, while the performance is lowest for Disease and Cell Line entities.

Table 4.

NER performance on the development set for the models trained on the BioRED dataset

BERT-based token classificationPURE span classification
Entity typePrecisionRecallF1PrecisionRecallF1
Gene95.0091.8693.4193.7992.2092.99
Disease86.7086.0486.3784.0384.9584.49
Chemical89.2493.5091.3289.0392.5790.77
Species97.4697.7197.5996.7397.9697.35
Variant88.9887.1488.0586.5387.9787.24
Cell line82.3584.0083.1790.9180.0085.11
All91.2590.9291.0989.9990.5890.29
BERT-based token classificationPURE span classification
Entity typePrecisionRecallF1PrecisionRecallF1
Gene95.0091.8693.4193.7992.2092.99
Disease86.7086.0486.3784.0384.9584.49
Chemical89.2493.5091.3289.0392.5790.77
Species97.4697.7197.5996.7397.9697.35
Variant88.9887.1488.0586.5387.9787.24
Cell line82.3584.0083.1790.9180.0085.11
All91.2590.9291.0989.9990.5890.29
Table 4.

NER performance on the development set for the models trained on the BioRED dataset

BERT-based token classificationPURE span classification
Entity typePrecisionRecallF1PrecisionRecallF1
Gene95.0091.8693.4193.7992.2092.99
Disease86.7086.0486.3784.0384.9584.49
Chemical89.2493.5091.3289.0392.5790.77
Species97.4697.7197.5996.7397.9697.35
Variant88.9887.1488.0586.5387.9787.24
Cell line82.3584.0083.1790.9180.0085.11
All91.2590.9291.0989.9990.5890.29
BERT-based token classificationPURE span classification
Entity typePrecisionRecallF1PrecisionRecallF1
Gene95.0091.8693.4193.7992.2092.99
Disease86.7086.0486.3784.0384.9584.49
Chemical89.2493.5091.3289.0392.5790.77
Species97.4697.7197.5996.7397.9697.35
Variant88.9887.1488.0586.5387.9787.24
Cell line82.3584.0083.1790.9180.0085.11
All91.2590.9291.0989.9990.5890.29

Because of its higher overall performance, we opted to train the BERT-based token classification model on the combined dataset. Table 5 shows the results when all nine NER datasets are used for training. When all nine NER datasets are used, overall performance metrics increase by more than 2 percentage points, demonstrating the effectiveness of additional training data. For Species, the performance is slightly lower due to precision loss (−0.09 points). We obtained the most significant increases for Cell Line and Variant, two least frequent entity types in the BioRED dataset, particularly in precision (about 15 and 9 percentage points, respectively). With this increase, Variant performance surpasses that of the Species type.

Table 5.

NER performance on the development set for the token classification model trained on all nine NER datasets

Entity typePrecisionRecallF1
Gene96.0993.6494.85
Disease90.1588.8889.51
Chemical93.2092.7592.95
Species95.8299.2497.50
Variant97.9297.5197.71
Cell line97.7888.0092.63
All94.0593.0193.53
Entity typePrecisionRecallF1
Gene96.0993.6494.85
Disease90.1588.8889.51
Chemical93.2092.7592.95
Species95.8299.2497.50
Variant97.9297.5197.71
Cell line97.7888.0092.63
All94.0593.0193.53
Table 5.

NER performance on the development set for the token classification model trained on all nine NER datasets

Entity typePrecisionRecallF1
Gene96.0993.6494.85
Disease90.1588.8889.51
Chemical93.2092.7592.95
Species95.8299.2497.50
Variant97.9297.5197.71
Cell line97.7888.0092.63
All94.0593.0193.53
Entity typePrecisionRecallF1
Gene96.0993.6494.85
Disease90.1588.8889.51
Chemical93.2092.7592.95
Species95.8299.2497.50
Variant97.9297.5197.71
Cell line97.7888.0092.63
All94.0593.0193.53

To determine whether the performance difference between the token classification model that uses only BioRED and that uses all nine datasets was statistically significant, we performed bootstrap resampling, wherein we sampled 100 abstracts with replacement 1000 times. We calculated the overall F1 scores of these samples and compared the differences in scores. We found statistically significant difference between the performance of these models (mean difference: 1.54; 95% CI: 0.61, 2.85).

Entity linking

Table 6 shows the EL performance with PubTator 3.0 and ResCNN using predicted named entities as input as well as gold entities. PubTator 3.0 works well for Species entities but is less successful with other entity types. Using ResCNN-based EL models for Disease and Chemical entities improves EL performance for these entities. The improvement is especially significant for Disease entities (more than 9 percentage points) and less so for Chemical entities (about 1.5 points). Using gold entities generally leads to minor improvements, except in the case of Chemical type, where the difference is larger (more than 9 percentage points), which indicates that improving chemical entity recognition further could have significant impact in downstream tasks.

Table 6.

EL performance based on predicted entities, along with the performance when gold entities are provided. All scores are based on the development set

PubTator 3.0-only+ ResCNN+ ResCNN(GOLD)
PRF1PRF1PRF1
Gene86.8778.9082.6986.8778.9082.6987.5979.3683.27
Disease78.3180.8179.5487.1190.4188.7386.7090.9988.79
Chemical87.3778.6482.7883.1985.4584.3091.9393.1892.55
Species99.1299.1299.1299.1299.1299.12100.0099.1299.56
Variant66.1056.1260.7066.1056.1260.7066.6757.5561.78
Cell line85.0077.2780.9585.0077.2780.9582.6186.3684.44
Total83.5078.6581.0085.3782.4283.8787.2084.4685.81
PubTator 3.0-only+ ResCNN+ ResCNN(GOLD)
PRF1PRF1PRF1
Gene86.8778.9082.6986.8778.9082.6987.5979.3683.27
Disease78.3180.8179.5487.1190.4188.7386.7090.9988.79
Chemical87.3778.6482.7883.1985.4584.3091.9393.1892.55
Species99.1299.1299.1299.1299.1299.12100.0099.1299.56
Variant66.1056.1260.7066.1056.1260.7066.6757.5561.78
Cell line85.0077.2780.9585.0077.2780.9582.6186.3684.44
Total83.5078.6581.0085.3782.4283.8787.2084.4685.81
Table 6.

EL performance based on predicted entities, along with the performance when gold entities are provided. All scores are based on the development set

PubTator 3.0-only+ ResCNN+ ResCNN(GOLD)
PRF1PRF1PRF1
Gene86.8778.9082.6986.8778.9082.6987.5979.3683.27
Disease78.3180.8179.5487.1190.4188.7386.7090.9988.79
Chemical87.3778.6482.7883.1985.4584.3091.9393.1892.55
Species99.1299.1299.1299.1299.1299.12100.0099.1299.56
Variant66.1056.1260.7066.1056.1260.7066.6757.5561.78
Cell line85.0077.2780.9585.0077.2780.9582.6186.3684.44
Total83.5078.6581.0085.3782.4283.8787.2084.4685.81
PubTator 3.0-only+ ResCNN+ ResCNN(GOLD)
PRF1PRF1PRF1
Gene86.8778.9082.6986.8778.9082.6987.5979.3683.27
Disease78.3180.8179.5487.1190.4188.7386.7090.9988.79
Chemical87.3778.6482.7883.1985.4584.3091.9393.1892.55
Species99.1299.1299.1299.1299.1299.12100.0099.1299.56
Variant66.1056.1260.7066.1056.1260.7066.6757.5561.78
Cell line85.0077.2780.9585.0077.2780.9582.6186.3684.44
Total83.5078.6581.0085.3782.4283.8787.2084.4685.81

Table 7 shows the impact of augmenting the BioRED training set with external EL datasets (NCBI Disease [23] and BC5CDR [25]) and initializing token embeddings from different pretrained models (BioMedBERT [22] and BioLinkBERT [31]) for ResCNN training. The results show that additional training data consistently enhances the EL performance. Meanwhile, in terms of initial vector representations, BioMedBERT embeddings perform better with the BioRED training set, while BioLinkBERT outperforms BioMedBERT with additional dataset, although the differences are relatively minor. However, BioLinkBERT embeddings outperform BiomedBERT when it comes to using additional training set. To determine whether there is statistically significant difference in performance of models trained on BioRED and those trained with the additional datasets, we used McNemar’s test, which showed that the performance differences were statistically significant at 99% significance level.

Table 7.

Evaluation of the impact of using additional datasets and different token embedding initializations on ResCNN-based EL models on the 100 samples of the BioCreative development set

Entity typeEmbeddingTraining setAcc@1Acc@5Acc@10Acc@20
DiseaseBioMedBERTOriginal79.9588.6590.2493.14
+NCBI&CDRa83.1189.7792.3593.67
BioLinkBERTOriginal78.8988.3991.2993.14
+NCBI&CDRa83.9190.7792.3594.20
ChemicalBioMedBERTOriginal88.5091.0093.0093.50
+CDRa92.5095.5096.5096.50
BioLinkBERTOriginal88.0092.0092.0092.50
+CDRa93.5095.0096.0097.50
Entity typeEmbeddingTraining setAcc@1Acc@5Acc@10Acc@20
DiseaseBioMedBERTOriginal79.9588.6590.2493.14
+NCBI&CDRa83.1189.7792.3593.67
BioLinkBERTOriginal78.8988.3991.2993.14
+NCBI&CDRa83.9190.7792.3594.20
ChemicalBioMedBERTOriginal88.5091.0093.0093.50
+CDRa92.5095.5096.5096.50
BioLinkBERTOriginal88.0092.0092.0092.50
+CDRa93.5095.0096.0097.50
a

Indicates statistically significant difference from the model trained with original data (99% significance level).

Table 7.

Evaluation of the impact of using additional datasets and different token embedding initializations on ResCNN-based EL models on the 100 samples of the BioCreative development set

Entity typeEmbeddingTraining setAcc@1Acc@5Acc@10Acc@20
DiseaseBioMedBERTOriginal79.9588.6590.2493.14
+NCBI&CDRa83.1189.7792.3593.67
BioLinkBERTOriginal78.8988.3991.2993.14
+NCBI&CDRa83.9190.7792.3594.20
ChemicalBioMedBERTOriginal88.5091.0093.0093.50
+CDRa92.5095.5096.5096.50
BioLinkBERTOriginal88.0092.0092.0092.50
+CDRa93.5095.0096.0097.50
Entity typeEmbeddingTraining setAcc@1Acc@5Acc@10Acc@20
DiseaseBioMedBERTOriginal79.9588.6590.2493.14
+NCBI&CDRa83.1189.7792.3593.67
BioLinkBERTOriginal78.8988.3991.2993.14
+NCBI&CDRa83.9190.7792.3594.20
ChemicalBioMedBERTOriginal88.5091.0093.0093.50
+CDRa92.5095.5096.5096.50
BioLinkBERTOriginal88.0092.0092.0092.50
+CDRa93.5095.0096.0097.50
a

Indicates statistically significant difference from the model trained with original data (99% significance level).

Relation extraction

Table 8 shows the results of the PURE-based RE model when gold standard entities and entity IDs are used as the model input. The performance is highest for Positive Correlation and Cotreatment, although the latter only has a few instances in the development set. Among the most common relation types, Association lags Positive Correlation and Negative Correlation. There were no predictions for the other rare labels, Conversion and Drug Interaction. When the relation types are ignored (i.e. binary relation classification), the model performance is 82.24 precision, 74.45 recall, and 78.15 F1 score, suggesting that distinguishing relation types is challenging for the model.

Table 8.

RE model performance on the development set using gold standard entities and entity IDs

Entity typePrecisionRecallF1
Association67.4354.3260.17
Positive correlation69.8074.9272.27
Negative correlation68.5469.7169.12
Comparison100.0033.3350.00
Bind62.5055.5658.82
Conversion0.000.000.00
Cotreatment100.0057.1472.73
Drug interaction0.000.000.00
All68.6062.1065.19
Entity typePrecisionRecallF1
Association67.4354.3260.17
Positive correlation69.8074.9272.27
Negative correlation68.5469.7169.12
Comparison100.0033.3350.00
Bind62.5055.5658.82
Conversion0.000.000.00
Cotreatment100.0057.1472.73
Drug interaction0.000.000.00
All68.6062.1065.19
Table 8.

RE model performance on the development set using gold standard entities and entity IDs

Entity typePrecisionRecallF1
Association67.4354.3260.17
Positive correlation69.8074.9272.27
Negative correlation68.5469.7169.12
Comparison100.0033.3350.00
Bind62.5055.5658.82
Conversion0.000.000.00
Cotreatment100.0057.1472.73
Drug interaction0.000.000.00
All68.6062.1065.19
Entity typePrecisionRecallF1
Association67.4354.3260.17
Positive correlation69.8074.9272.27
Negative correlation68.5469.7169.12
Comparison100.0033.3350.00
Bind62.5055.5658.82
Conversion0.000.000.00
Cotreatment100.0057.1472.73
Drug interaction0.000.000.00
All68.6062.1065.19

Table 9 shows the results of the RE model when the predicted entities are used as model input (i.e. end-to-end RE pipeline). There is a 19 percentage point drop in F1 score (about 17 point drop in precision and 20 point drop in recall), indicating that errors in the previous two tasks significantly impact RE performance. The performance drop is similar for three relation types that occur in substantial numbers (Association, Positive Correlation, and Negative Correlation). About half of the predicted relations are incorrect; within these erroneous relations, 23.6% are incorrectly predicted as another relation type while 76.4% are due to non-related entities.

Table 9.

RE model performance on the development set using predicted entities and IDs

Entity typePrecisionRecallF1
Association48.5536.8541.90
Positive correlation55.8348.6251.97
Negative correlation52.1050.8851.48
Comparison100.0033.3350.00
Bind100.0033.3350.00
Conversion0.000.000.00
Cotreatment37.5021.4327.27
Drug interaction0.000.000.00
All51.4841.8746.18
Entity typePrecisionRecallF1
Association48.5536.8541.90
Positive correlation55.8348.6251.97
Negative correlation52.1050.8851.48
Comparison100.0033.3350.00
Bind100.0033.3350.00
Conversion0.000.000.00
Cotreatment37.5021.4327.27
Drug interaction0.000.000.00
All51.4841.8746.18
Table 9.

RE model performance on the development set using predicted entities and IDs

Entity typePrecisionRecallF1
Association48.5536.8541.90
Positive correlation55.8348.6251.97
Negative correlation52.1050.8851.48
Comparison100.0033.3350.00
Bind100.0033.3350.00
Conversion0.000.000.00
Cotreatment37.5021.4327.27
Drug interaction0.000.000.00
All51.4841.8746.18
Entity typePrecisionRecallF1
Association48.5536.8541.90
Positive correlation55.8348.6251.97
Negative correlation52.1050.8851.48
Comparison100.0033.3350.00
Bind100.0033.3350.00
Conversion0.000.000.00
Cotreatment37.5021.4327.27
Drug interaction0.000.000.00
All51.4841.8746.18

Novelty detection

Table 10 shows the performance of our PURE-based ND model. When the gold standard relations are known, the model predicts the novelty of about 80% of the relations accurately. Using gold standard entities and IDs and assessing the accuracy of the predicted relations and their novelty, the performance is about 10 percentage points lower than predicting relations alone (55.71 F1 score vs 65.19 F1 in Table 8). Lastly, in the end-to-end pipeline (NER-EL-RE-ND) where the model input consists only of the abstract text, there is about a 7 percentage point drop, compared to predicting relations only (38.89 F1 vs 46.18 F1 in Table 9).

Table 10.

ND model performance on the development set

InputPrecisionRecallF1
Gold standard relations81.8681.8681.86
Gold standard entities58.1353.4855.71
Abstract text only43.3435.2538.89
InputPrecisionRecallF1
Gold standard relations81.8681.8681.86
Gold standard entities58.1353.4855.71
Abstract text only43.3435.2538.89
Table 10.

ND model performance on the development set

InputPrecisionRecallF1
Gold standard relations81.8681.8681.86
Gold standard entities58.1353.4855.71
Abstract text only43.3435.2538.89
InputPrecisionRecallF1
Gold standard relations81.8681.8681.86
Gold standard entities58.1353.4855.71
Abstract text only43.3435.2538.89

Test set results

We also generated predictions on the test set using RE and ND models trained on the combination of the training and development sets. This setting corresponds to Subtask 1 of the BioRED Track and uses gold standard entities and IDs. Table 11 shows the performance of these models on the test set. Compared to our shared task system [14], the enhanced RE model yields about 3 percentage points higher F1 score (55.61 vs 52.76) and the end-to-end RE + ND model increases F1 score by about 2 percentage points (41.66 vs 39.71). Our RE model performed best on relations between Chemical and Gene entities, obtaining an F1 score of 64.07. The lowest performance was on Chemical/Variant relations (37.03 F1). These results are likely due to abundance of Chemical/Gene relations and scarcity of Chemical/Variant relations in the training and development set.

Table 11.

RE and ND performance on the test set using gold standard entities and entity IDs

PrecisionRecallF1
RE56.7254.5455.61
RE + ND42.4840.8641.66
PrecisionRecallF1
RE56.7254.5455.61
RE + ND42.4840.8641.66
Table 11.

RE and ND performance on the test set using gold standard entities and entity IDs

PrecisionRecallF1
RE56.7254.5455.61
RE + ND42.4840.8641.66
PrecisionRecallF1
RE56.7254.5455.61
RE + ND42.4840.8641.66

Discussion

We enhanced our shared task system by including additional datasets in training and extensive hyperparameter tuning. As the system involves a pipeline approach, errors in earlier stages of the pipeline can propagate, leading to lower performance in later steps. By improving the performance in NER and EL, we were able to observe an improvement in downstream tasks of RE and ND. Table 12 shows a side-by-side comparison of the evaluation results of our best shared task submission [14] and our current system on the development set, which shows substantial improvement in each step of the pipeline. We note that the test set results can only be evaluated via CodaLab, which only focuses on the RE and RE + ND tasks; therefore, we are unable to assess the performance difference in NER and EL on the test set. As noted above, the improvement in RE and RE + ND performance on the test set was about 3 and 2 percentage points, respectively.

Table 12.

Comparison with our previous shared task results

Previous F1New F1Change
NER90.4493.53+3.09
NER + EL74.1483.87+9.73
NER + EL + RE30.5146.18+15.67
NER + EL + RE + ND23.9638.86+14.90
Previous F1New F1Change
NER90.4493.53+3.09
NER + EL74.1483.87+9.73
NER + EL + RE30.5146.18+15.67
NER + EL + RE + ND23.9638.86+14.90
Table 12.

Comparison with our previous shared task results

Previous F1New F1Change
NER90.4493.53+3.09
NER + EL74.1483.87+9.73
NER + EL + RE30.5146.18+15.67
NER + EL + RE + ND23.9638.86+14.90
Previous F1New F1Change
NER90.4493.53+3.09
NER + EL74.1483.87+9.73
NER + EL + RE30.5146.18+15.67
NER + EL + RE + ND23.9638.86+14.90

While the main architecture of our best NER model did not change, hyperparameter tuning and the inclusion of multiple NER datasets improved the results, which underscores the importance of large, annotated datasets and hyperparameter tuning for deep neural network models. We improved our EL performance by adopting a more accurate entity linker (i.e. PubTator 3.0 [16]) compared to previous linkers we used (BERN2 [54] and PubTator Central [46]) and training specialized CNN models for Disease and Chemical entities. As the performance of PubTator 3.0 was relatively low for these entity types and very often they served as relation arguments, they were considered as the entity types that could benefit most from specialized models. Incorporating additional EL datasets to the model training was also found to be beneficial. Using PubTator 3.0 alone led to an F1 increase of about 7 percentage F1 points (74.14 to 81), while CNN models led to another increase of about 3 points (81 to 83.87). Significantly, EL recall increased from 67.01 to 82.42, indicating that substantially more entities were linked to their corresponding identifiers, setting the stage for better RE and ND. One somewhat anomalous result relates to Variant entities, for which we obtain the highest NER performance (97.71 F1) but the lowest EL performance (60.70), which is especially surprising, because PubTator 3.0, which we used for variant EL, reports very high performance for this entity type (98.48 F1) [16]. A preliminary analysis suggests that there may be some differences in how the same variants are normalized in the ground truth data versus by PubTator 3.0, and in the EL evaluation in the shared task versus the PubTator 3.0 study.

As we expected, performance improvement in NER and EL increased the performance of RE and ND models, even though these models did not change from our shared task system, except for additional hyperparameter tuning. We note that, concurrent to our work, Lai et al. [35] incorporated additional RE datasets to improve RE performance on the BioRED dataset; so, the approach we used for NER and EL can further be extended to RE. There are no similar datasets for the ND task, so other approaches could be explored for ND, such as leveraging the abstract structure for novel information (i.e. novel information is less likely to be in introductory sections of the abstract).

We are unable to compare our enhanced pipeline with the results of the other shared task systems. Here, we compare our NER + EL + RE results with Pubtator 3.0 [16], which was officially released after the shared task (Table 13). Note that Pubtator 3.0 does not perform ND. The PubTator 3.0 yields 52.87 precision, 40.36 recall, and 45.77 F1 score on the development set. This is on par with our system; PubTator performs a bit better in precision (1.39 points), while our model performs slightly better in recall (1.51 points) and F1 score (0.41 points). More specifically, PubTator 3.0 performs better on Association and Negative Correlation relations, while our model shows better performance on Positive Correlation and Binding. The performance difference between PubTator 3.0 and our end-to-end RE model is statistically significant (McNemar’s test, 99% significance level).

Table 13.

End-to-end RE performance using Pubtator 3.0

Entity typePrecisionRecallF1
Association56.1338.6345.76
Positive correlation45.2340.3042.63
Negative correlation58.1149.7153.58
Comparison36.3666.6747.06
Bind50.0011.1118.18
Conversion0.000.000.00
Cotreatment66.6728.5740.00
Drug interaction0.000.000.00
All52.8740.3645.77
Entity typePrecisionRecallF1
Association56.1338.6345.76
Positive correlation45.2340.3042.63
Negative correlation58.1149.7153.58
Comparison36.3666.6747.06
Bind50.0011.1118.18
Conversion0.000.000.00
Cotreatment66.6728.5740.00
Drug interaction0.000.000.00
All52.8740.3645.77
Table 13.

End-to-end RE performance using Pubtator 3.0

Entity typePrecisionRecallF1
Association56.1338.6345.76
Positive correlation45.2340.3042.63
Negative correlation58.1149.7153.58
Comparison36.3666.6747.06
Bind50.0011.1118.18
Conversion0.000.000.00
Cotreatment66.6728.5740.00
Drug interaction0.000.000.00
All52.8740.3645.77
Entity typePrecisionRecallF1
Association56.1338.6345.76
Positive correlation45.2340.3042.63
Negative correlation58.1149.7153.58
Comparison36.3666.6747.06
Bind50.0011.1118.18
Conversion0.000.000.00
Cotreatment66.6728.5740.00
Drug interaction0.000.000.00
All52.8740.3645.77

Overall, RE remains challenging on the BioRED dataset. Low performance of end-to-end RE systems can be attributed to several factors. First, even though there are eight relation types, the great majority of the relations belongs to three types (Association, Positive Correlation, and Negative Correlation), and there are few examples of other relation types. The most common of these (Association) is particularly heterogeneous, and our models had modest performance on this type compared to correlation types. Our model (and PubTator 3.0) did not yield positive predictions for two rare types. Given that most biologically relevant relations are mechanistic (causal), the lack of directionality in BioRED relations could be problematic for downstream uses. The BioRED relations are document-level, which is a more challenging setting than sentence-bound RE, although arguably more relevant. More comprehensive RE datasets reflecting the complexity of the biological processes and better-performing RE models are needed to enable practical use of biomedical RE models and tools.

Error analysis

To shed more light on the shortcomings of our pipeline, we performed error analysis of our NER, EL, and ND predictions on the development set. NER predictions from the BERT-based model were categorized as exact matches (boundaries and entity type match, n = 3336), partial matches (boundaries partially match, n = 67), and complete misses (n = 132). Among the complete misses, 92% belonged to Disease (n = 58), Gene (n = 30), and Chemical types (n = 34). Overall, Disease entity type had most mismatches. Future work could focus on improving the performance of Disease and Gene entity types in particular.

We analyzed the errors made by the ResCNN-Disease model. We adopted the error types from BioSyn [44] in our analysis: Incomplete Synset (input mention differs significantly from the synonyms of the identifier), Contextual Entity (mention and synonym are identical but have different identifiers), Overlapped Entity (word overlap between the mention and the predicted candidate), Abbreviation (abbreviation cannot be resolved), Hypernym/Hyponym (mention and the concept identifier have hypernym/hyponym relation). Table 14 shows the examples and frequency of the error cases. Almost half of the errors are due to hypernym-hyponym relations, followed by overlapped entity sets. Hypernym-hyponym errors could be considered less severe (especially if the hyponym mention is mapped to a hypernym concept). Incomplete synonym sets could be addressed through better modeling of the semantic similarity between the mentions and the synonyms. Contextual entity errors are challenging because ResCNN only takes the named entity mention as input, not the surrounding context, so it may favor an identifier that has more surface similarity over an identifier that is a better conceptual match.

Table 14.

EL error examples and frequency counts on the development set

Error typeInput mentionPredicted nameGold concept nameFrequency
Incomplete Synsethyper locomotionneurologic locomotion disordershyperkinesis2 (3.4%)
Contextual Entitycolon cancercolon cancercolorectal neoplasms5 (8.6%)
Overlapped Entitypostoperative analgesiacongenital analgesiapain postoperative9 (15.5%)
Abbreviationvedsvedehlers danlos syndrome5 (8.6%)
Hypernymdyskinesiadyskinesiadyskinesia drug induced12 (20.7%)
Hyponymfamilial melanomamelanomafamilial melanoma20 (34.5%)
Othershyperthermiahyperthermiafever5 (8.6%)
Error typeInput mentionPredicted nameGold concept nameFrequency
Incomplete Synsethyper locomotionneurologic locomotion disordershyperkinesis2 (3.4%)
Contextual Entitycolon cancercolon cancercolorectal neoplasms5 (8.6%)
Overlapped Entitypostoperative analgesiacongenital analgesiapain postoperative9 (15.5%)
Abbreviationvedsvedehlers danlos syndrome5 (8.6%)
Hypernymdyskinesiadyskinesiadyskinesia drug induced12 (20.7%)
Hyponymfamilial melanomamelanomafamilial melanoma20 (34.5%)
Othershyperthermiahyperthermiafever5 (8.6%)
Table 14.

EL error examples and frequency counts on the development set

Error typeInput mentionPredicted nameGold concept nameFrequency
Incomplete Synsethyper locomotionneurologic locomotion disordershyperkinesis2 (3.4%)
Contextual Entitycolon cancercolon cancercolorectal neoplasms5 (8.6%)
Overlapped Entitypostoperative analgesiacongenital analgesiapain postoperative9 (15.5%)
Abbreviationvedsvedehlers danlos syndrome5 (8.6%)
Hypernymdyskinesiadyskinesiadyskinesia drug induced12 (20.7%)
Hyponymfamilial melanomamelanomafamilial melanoma20 (34.5%)
Othershyperthermiahyperthermiafever5 (8.6%)
Error typeInput mentionPredicted nameGold concept nameFrequency
Incomplete Synsethyper locomotionneurologic locomotion disordershyperkinesis2 (3.4%)
Contextual Entitycolon cancercolon cancercolorectal neoplasms5 (8.6%)
Overlapped Entitypostoperative analgesiacongenital analgesiapain postoperative9 (15.5%)
Abbreviationvedsvedehlers danlos syndrome5 (8.6%)
Hypernymdyskinesiadyskinesiadyskinesia drug induced12 (20.7%)
Hyponymfamilial melanomamelanomafamilial melanoma20 (34.5%)
Othershyperthermiahyperthermiafever5 (8.6%)

In RE, 43.45% of precision errors involved type errors; in other words, the model was able to correctly identify that the two entity pairs are related but assigned the incorrect relation type. The rest of the errors are false-positives involving non-related entities. On the other hand, 31.98% of the recall errors are due to incorrect relation type assignment, while the majority (68.02%) is due to the model failing to identify relationships between entities. Relations involving variant types obtained the lowest recall; 39.09% of relations where at least one of the entities is a variant were missed by our model, including all variant–variant relations (10 instances). Our model missed 27.79% and 24.74% of relations involving diseases and genes, respectively. Chemical relations obtained the highest recall, and only 17.9% of relations involving at least one chemical was missed by the model.

The majority of relation type confusion cases occurred between Association, Positive Correlation, and Negative Correlation. There were only four instances of type confusion involving other relation types. Specifically considering the most common relation type (Association), we found that 7.71% and 3.62% were identified as Positive Correlation and Negative Correlation, respectively, while 35.27% were missed by the model completely. Interestingly, it was significantly more common for Positive Correlation relations to be incorrectly labeled as Association relations compared to Negative Correlation relations. These results indicate the need to focus on associative relations in future work.

Conclusion

We presented an enhanced end-to-end pipeline for biomedical RE and ND. Compared to our BioCreative VIII BioRED Track submission, our pipeline demonstrates substantial performance improvement across all four tasks of the pipeline (NER, EL, RE, and ND). In particular, enhancements on our NER and EL methods including the use of additional datasets improved the performance of the target tasks of RE and ND, as well. Despite considerable increases in our pipeline performance and being on par with PubTator 3.0, there is much room for improvement, especially in RE and ND tasks.

The BioRED dataset and the BioCreative VIII BioRED Track are significant steps in expanding biomedical RE from a few relation types to a more comprehensive set of relevant relation types and practical use cases. However, our work also highlights some important challenges; e.g. relations are skewed toward a few types and may not be sufficiently specific, and document-level relation formulation, while flexible, also presents difficulties in interpretation of the relations and predictions. Further enhancements to the dataset could facilitate more accurate and useful systems for information extraction from the biomedical literature.

Conflict of interest

The funders had no role in considering the study design; the collection, analysis, and interpretation of data; writing of the report; or decision to submit the article for publication. The contents are those of the authors and do not necessarily represent the official views of, nor an endorsement by, Office of Research Integrity/OASH/US Department of Health and Human Services or the US Government.

Funding

This study was supported in part by the ORI of the HHS (grant number: ORIIR220073), the National Library of Medicine of the National Institutes of Health (award number R01LM014709), and a University of Illinois Personalized Nutrition Initiative Seed Grant.

Data availability

The BioRED track dataset and other challenge materials are available at https://ftp.ncbi.nlm.nih.gov/pub/lu/BC8-BioRED-track/. The code and models generated in this study are available at https://github.com/janinaj/e2eBioMedRE/.

References

1.

Jensen
L-J
,
Saric
J
,
Bork
P
.
Literature mining for the biologist: from information retrieval to biological discovery
.
Nat Rev Genet
2006
;
7
:
119
29
. https://doi.org/10.1038/nrg1768

2.

Zhao
S
,
Su
C
,
Lu
Z
et al. 
Recent advances in biomedical literature mining
.
Briefings Bioinf
2021
;
22
:bbaa057. https://doi.org/10.1093/bib/bbaa057

3.

Kühnel
L
,
Fluck
J
.
We are not ready yet: limitations of state-of-the-art disease named entity recognizers
.
J Biomed Semant
2022
;
13
:26. https://doi.org/10.1186/s13326-022-00280-6

4.

Hirschman
L
,
Yeh
A
,
Blaschke
C
et al. 
Overview of BioCreAtIvE: critical assessment of information extraction for biology
.
BMC Bioinf
2005
;
6
:S1. https://doi.org/10.1186/1471-2105-6-S1-S1

5.

Chen
Q
,
Allot
A
,
Leaman
R
et al. 
Multi-label classification for biomedical literature: an overview of the BioCreative VII LitCovid Track for COVID-19 literature topic annotations
.
Database
2022
;
2022
:baac069. https://doi.org/10.1093/database/baac069

6.

Kim
J-D
,
Ohta
T
,
Pyysalo
S
et al. 
Overview of BioNLP’09 shared task on event extraction
. In: Proceedings of the BioNLP 2009 Workshop Companion Volume for Shared Task. pp.
1
9
. Boulder, CO: Association for Computational Linguistics,
2009
.

7.

Nédellec
C
,
Bossy
R
,
Kim
J-D
et al. 
Overview of BioNLP shared task 2013
. In: Proceedings of the BioNLP Shared Task 2013 Workshop. pp.
1
7
. Sofia, Bulgaria: Association for Computational Linguistics,
2013
.

8.

Smith
L
,
Tanabe
LK
,
Ando
RJ
et al. 
Overview of BioCreative II gene mention recognition
.
Genome Biol
2008
;
9
:
1
19
. https://doi.org/10.1186/gb-2008-9-s2-s2

9.

Wei
C-H
,
Peng
Y
,
Leaman
R
et al. 
Assessing the state of the art in biomedical relation extraction: overview of the BioCreative V chemical-disease relation (CDR) task
.
Database
2016
;
2016
:baw032. https://doi.org/10.1093/database/baw032

10.

Luo
L
,
Lai
P-T
,
Wei
C-H
et al. 
BioRED: a rich biomedical relation extraction dataset
.
Briefings Bioinf
2022
;
23
:bbac282. https://doi.org/10.1093/bib/bbac282

11.

Wei
C-H
,
Kao
H-Y
,
Lu
Z
.
PubTator: a web-based text mining tool for assisting biocuration
.
Nucleic Acids Res
2013
;
41
:
W518
W522
. https://doi.org/10.1093/nar/gkt441

12.

Harpaz
R
,
Callahan
A
,
Tamang
S
et al. 
Text mining for adverse drug events: the promise, challenges, and state of the art
.
Drug Safety
2014
;
37
:
777
90
. https://doi.org/10.1007/s40264-014-0218-z

13.

Henry
S
,
McInnes
BT
.
Literature based discovery: models, methods, and trends
.
J Biomed Informat
2017
;
74
:
20
32
. https://doi.org/10.1016/j.jbi.2017.08.011

14.

Sarol
MJ
,
Hong
G
,
Kilicoglu
H
.
UIUC-BioNLP @ BioCreative VIII BioRED Track
. In: Proceedings of the BioCreative VIII Challenge and Workshop: Curation and Evaluation in the era of Generative Models. New Orleans, LA: AMIA 2023 Annual Symposium,
2023
.

15.

Lai
T
,
Ji
H
,
Zhai
C
.
BERT might be overkill: a tiny but effective biomedical entity linker based on residual convolutional neural networks
.
Findings of the Association for Computational Linguistics: EMNLP 2021
. p.
1631
39
,
2021
.

16.

Wei
C-H
,
Allot
A
,
Lai
P-T
et al. 
PubTator 3.0: an AI-powered literature resource for unlocking biomedical knowledge
.
Nucleic Acids Res
2024
;
52
:
W540
6
. https://doi.org/10.1093/nar/gkae235

17.

Zhong
Z
, and
Chen
D
.
A frustratingly easy approach for entity and relation extraction
. In: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. pp.
50
61
.
Online, Association for Computational Linguistics
,
2021
.

18.

Bodenreider
O
.
The unified medical language system (UMLS): integrating biomedical terminology
.
Nucleic Acids Res
2004
;
32
:
D267
D270
. https://doi.org/10.1093/nar/gkh061

19.

Aronson
AR
,
Lang
F-M
.
An overview of MetaMap: historical perspective and recent advances
.
J Am Med Inf Assoc
2010
;
17
:
229
36
. https://doi.org/10.1136/jamia.2009.002733

20.

Kilicoglu
H
,
Rosemblat
G
,
Fiszman
M
et al. 
Broad-coverage biomedical relation extraction with SemRep
.
BMC Bioinf
2020
;
21
:
1
28
. https://doi.org/10.1186/s12859-020-3517-7

21.

Lee
J
,
Yoon
W
,
Kim
S
et al. 
BioBERT: a pre-trained biomedical language representation model for biomedical text mining
.
Bioinformatics
2020
;
36
:
1234
40
. https://doi.org/10.1093/bioinformatics/btz682

22.

Gu
Y
,
Tinn
R
,
Cheng
H
et al. 
Domain-specific language model pretraining for biomedical natural language processing
.
ACM Trans Comput Healthc
2021
;
3
:
1
23
.

23.

Doğan
RI
,
Leaman
R
,
Lu
Z
.
NCBI disease corpus: a resource for disease name recognition and concept normalization
.
J Biomed Informat
2014
;
47
:
1
10
. https://doi.org/10.1016/j.jbi.2013.12.006

24.

Krallinger
M
,
Leitner
F
,
Rabal
O
et al. 
CHEMDNER: the drugs and chemical names extraction challenge
.
J Cheminf
2015
;
7
:
1
11
. https://doi.org/10.1186/1758-2946-7-S1-S1

25.

Li
J
,
Sun
Y
,
Johnson
RJ
et al. 
BioCreative V CDR task corpus: a resource for chemical disease relation extraction
.
Database
2016
;
2016
:baw068. https://doi.org/10.1093/database/baw068

26.

Herrero-Zazo
M
,
Segura-Bedmar
I
,
Martínez
P
et al. 
The DDI corpus: an annotated corpus with pharmacological substances and drug–drug interactions
.
J Biomed Informat
2013
;
46
:
914
20
. https://doi.org/10.1016/j.jbi.2013.07.011

27.

Jimenez Gutierrez
B
,
McNeal
N
,
Washington
C
et al. 
Thinking about GPT-3 in-context learning for biomedical IE? Think again
.
Findings of the Association for Computational Linguistics: EMNLP 2022
. p.
4497
512
,
2022
.

28.

Chen
Q
,
Du
J
,
Hu
Y
et al. 
Large language models in biomedical natural language processing: benchmarks, baselines, and recommendations
.
arXiv
2023
. https://doi.org/10.48550/arXiv.2305.16326

29.

Wadhwa
S
,
Amir
S
,
Wallace
B
.
Revisiting relation extraction in the era of large language models
. In: Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics. pp.
15566
89
. Toronto, Canada: Association for Computational Linguistics,
2023
.

30.

Wadden
D
,
Wennberg
U
,
Luan
Y
et al. 
Entity, relation, and event extraction with contextualized span representations
. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). pp.
5784
89
. Hong Kong, China: Association for Computational Linguistics,
2019
.

31.

Yasunaga
M
,
Leskovec
J
,
Liang
P
.
LinkBERT: pretraining language models with document links
. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics. pp.
8003
16
. Dublin, Ireland: Association for Computational Linguistics,
2022
.

32.

Luo
L
,
Wei
C-H
,
Lai
P-T
et al. 
AIONER: all-in-one scheme-based biomedical named entity recognition using deep learning
.
Bioinformatics
2023
;
39
:btad310. https://doi.org/10.1093/bioinformatics/btad310

33.

Crichton
G
,
Pyysalo
S
,
Chiu
B
et al. 
A neural network multi-task learning approach to biomedical named entity recognition
.
BMC Bioinf
2017
;
18
:
1
14
. https://doi.org/10.1186/s12859-017-1776-8

34.

Krallinger
M
,
Rabal
O
,
Akhondi
SA
et al. 
Overview of the BioCreative VI chemical-protein interaction track
. In: Proceedings of the Sixth BioCreative Challenge Evaluation Workshop. vol.
1
. pp.
141
46
,
2017
.

35.

Lai
P-T
,
Wei
C-H
,
Chen
Q
et al. 
BioREx: improving biomedical relation extraction by leveraging heterogeneous datasets
.
J Biomed Informat
2023
;
146
:104487. https://doi.org/10.1016/j.jbi.2023.104487

36.

Eberts
M
,
Ulges
A
.
Span-based joint entity and relation extraction with transformer pre-training
. In: Proceedings of the 24th European Conference on Artificial Intelligence. pp.
2006
13
. Santiago de Compostela, Spain: IOS Press,
2020
.

37.

El-Allaly
E-D
,
Sarrouti
M
,
En-Nahnahi
N
et al. 
An attentive joint model with transformer-based weighted graph convolutional network for extracting adverse drug event relation
.
J Biomed Informat
2022
;
125
:103968. https://doi.org/10.1016/j.jbi.2021.103968

38.

French
E
,
McInnes
BT
.
An overview of biomedical entity linking throughout the years
.
J Biomed Informat
2023
;
137
:104252. https://doi.org/10.1016/j.jbi.2022.104252

39.

D’Souza
J
,
Ng
V
.
Sieve-based entity linking for the biomedical domain
. In: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing. pp.
297
302
. Beijing, China: Association for Computational Linguistics,
2015
.

40.

Liu
H
,
Wu
ST
,
Li
D
et al. 
Towards a semantic lexicon for clinical natural language processing
.
AMIA Annu Symp Proc
.
2012
;
2012
:
568
76
.

41.

Leaman
R
,
Islamaj Doğan
R
,
Lu
Z
.
DNorm: disease name normalization with pairwise learning to rank
.
Bioinformatics
2013
;
29
:
2909
17
. https://doi.org/10.1093/bioinformatics/btt474

42.

Li
H
,
Chen
Q
,
Tang
B
et al. 
CNN-based ranking for biomedical entity normalization
.
BMC Bioinf
2017
;
18
:
79
86
. https://doi.org/10.1186/s12859-017-1805-7

43.

Phan
MC
,
Sun
A
,
Tay
Y
.
Robust representation learning of biomedical names
. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. pp.
3275
85
. Florence, Italy: Association for Computational Linguistics,
2019
.

44.

Sung
M
,
Jeon
H
,
Lee
J
et al. 
Biomedical entity representations with synonym marginalization
. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. pp.
3641
50
.
Online, Association for Computational Linguistics
,
2020
.

45.

Liu
F
,
Shareghi
E
,
Meng
Z
et al. 
Self-alignment pretraining for biomedical entity representations
. In: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. pp.
4228
38
.
Online, Association for Computational Linguistics
,
2021
.

46.

Wei
C-H
,
Allot
A
,
Leaman
R
et al. 
PubTator central: automated concept annotation for biomedical full text articles
.
Nucleic Acids Res
2019
;
47
:
W587
W593
. https://doi.org/10.1093/nar/gkz389

47.

Islamaj
R
,
Wei
C-H
,
Lai
P-T
et al. 
The biomedical relationship corpus of the BioRED track at the BioCreative VIII challenge and workshop
.
Zenodo
.
2023
.

48.

Wei
C-H
,
Kao
H-Y
,
Lu
Z
.
GNormPlus: an integrative approach for tagging genes, gene families, and protein domains
.
Biomed Res Int
2015
;
2015
:918710. https://doi.org/10.1155/2015/918710

49.

Islamaj
R
,
Leaman
R
,
Kim
S
et al. 
NLM-Chem, a new resource for chemical entity recognition in PubMed full text literature
.
Sci Data
2021
;
8
:91. https://doi.org/10.1038/s41597-021-00875-1

50.

Pafilis
E
,
Frankild
S-P
,
Fanini
L
et al. 
The species and organisms resources for fast and accurate identification of taxonomic names in text
.
PLoS One
2013
;
8
:e65390. https://doi.org/10.1371/journal.pone.0065390

51.

Arighi
C
,
Hirschman
L
,
Lemberger
T
et al. 
Bio-ID track overview
. In: Proceedings of the BioCreative VI Challenge Evaluation Workshop. vol:
482
: p.376,
2017
.

52.

Wei
C-H
,
Allot
A
,
Riehle
K
et al. 
tmVar 3.0: an improved variant concept recognition and normalization tool
.
Bioinformatics
2022
;
38
:
4449
51
. https://doi.org/10.1093/bioinformatics/btac537

53.

Gerner
M
,
Nenadic
G
,
Bergman
CM
.
Linnaeus: a species name identification system for biomedical literature
.
BMC Bioinf
2010
;
11
:
1
17
. https://doi.org/10.1186/1471-2105-11-85

54.

Sung
M
,
Jeong
M
,
Choi
Y
et al. 
BERN2: an advanced neural biomedical named entity recognition and normalization tool
.
Bioinformatics
2022
;
38
:
4837
39
. https://doi.org/10.1093/bioinformatics/btac598

55.

Kim
Y
.
Convolutional neural networks for sentence classification
. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). pp.
1746
51
. Doha, Qatar: Association for Computational Linguistics,
2014
.

56.

Davis
AP
,
Wiegers
TC
,
Rosenstein
MC
et al. 
MEDIC: a practical disease vocabulary used at the comparative toxicogenomics database
.
Database
2012
;
2012
:bar065. https://doi.org/10.1093/database/bar065

57.

Davis
AP
,
Wiegers
TC
,
Johnson
RJ
et al. 
Comparative toxicogenomics database (CTD): update 2023
.
Nucleic Acids Res
2023
;
51
:
D1257
D1262
. https://doi.org/10.1093/nar/gkac833

58.

Wei
C-H
,
Luo
L
,
Islamaj
R
et al. 
GNorm2: an improved gene name recognition and normalization system
.
Bioinformatics
2023
;
39
:btad599. https://doi.org/10.1093/bioinformatics/btad599

59.

Leaman
R
,
Lu
Z
.
TaggerOne: joint named entity recognition and normalization with semi-Markov models
.
Bioinformatics
2016
;
32
:
2839
46
. https://doi.org/10.1093/bioinformatics/btw343

60.

Madry
A
,
Makelov
A
,
Schmidt
L
et al. 
Towards deep learning models resistant to adversarial attacks
. In: Proceedings of the Sixth International Conference on Learning Representations. p.
2
6
.
Vancouver, Canada: OpenReview.net
,
2018
.

61.

Jia
R
,
Wong
C
,
Poon
H
.
Document-level n-ary relation extraction with multiscale representation learning
. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, vol:
1
, pp.
3693
704
. Minneapolis, MN: Association for Computational Linguistics,
2019
.

62.

Brown
GR
,
Hem
V
,
Katz
KS
et al. 
Gene: a gene-centered information resource at NCBI
.
Nucleic Acids Res
2015
;
43
:
D36
D42
. https://doi.org/10.1093/nar/gku1055

63.

Lipscomb
CE
.
Medical subject headings (MeSH)
.
Bulletin Med Libr Assoc
2000
;
88
:265.

64.

Federhen
S
.
The NCBI taxonomy database
.
Nucleic Acids Res
2012
;
40
:
D136
D143
. https://doi.org/10.1093/nar/gkr1178

65.

Sherry
ST
,
Ward
M-H
,
Kholodov
M
et al. 
dbSNP: the NCBI database of genetic variation
.
Nucleic Acids Res
2001
;
29
:
308
11
. https://doi.org/10.1093/nar/29.1.308

66.

Bairoch
A
.
The Cellosaurus, a cell-line knowledge resource
.
J Biomol Techniques: JBT
2018
;
29
:25. https://doi.org/10.7171/jbt.18-2902-002

This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.