Towards discovery: an end-to-end system for uncovering novel biomedical relations Open Access

BioRED-BCVIII annotation statistics and, in parentheses, the unique set.

Annotations	Train	Test
Documents	600	400
Gene	6697 (1643)	5728 (1278)
Disease	5545 (778)	3641 (644)
Chemical	4429 (651)	2592 (618)
Variant	1381 (678)	1774 (974)
Species	2192 (47)	1525 (33)
Cell Line	175 (72)	140 (50)
Total	20,419 (3869)	15,400 (3597)
Disease–Gene	1633	1610
Chemical–Gene	923	1121
Disease–Variant	893	975
Gene–Gene	1227	936
Chemical–Disease	1237	779
Chemical–Chemical	488	412
Chemical–Variant	76	199
Variant–Variant	25	2
Total	6502	6034
Novel Relations	4532	3683

Annotations	Train	Test
Documents	600	400
Gene	6697 (1643)	5728 (1278)
Disease	5545 (778)	3641 (644)
Chemical	4429 (651)	2592 (618)
Variant	1381 (678)	1774 (974)
Species	2192 (47)	1525 (33)
Cell Line	175 (72)	140 (50)
Total	20,419 (3869)	15,400 (3597)
Disease–Gene	1633	1610
Chemical–Gene	923	1121
Disease–Variant	893	975
Gene–Gene	1227	936
Chemical–Disease	1237	779
Chemical–Chemical	488	412
Chemical–Variant	76	199
Variant–Variant	25	2
Total	6502	6034
Novel Relations	4532	3683

Table 1.

BioRED-BCVIII annotation statistics and, in parentheses, the unique set.

Annotations	Train	Test
Documents	600	400
Gene	6697 (1643)	5728 (1278)
Disease	5545 (778)	3641 (644)
Chemical	4429 (651)	2592 (618)
Variant	1381 (678)	1774 (974)
Species	2192 (47)	1525 (33)
Cell Line	175 (72)	140 (50)
Total	20,419 (3869)	15,400 (3597)
Disease–Gene	1633	1610
Chemical–Gene	923	1121
Disease–Variant	893	975
Gene–Gene	1227	936
Chemical–Disease	1237	779
Chemical–Chemical	488	412
Chemical–Variant	76	199
Variant–Variant	25	2
Total	6502	6034
Novel Relations	4532	3683

Annotations	Train	Test
Documents	600	400
Gene	6697 (1643)	5728 (1278)
Disease	5545 (778)	3641 (644)
Chemical	4429 (651)	2592 (618)
Variant	1381 (678)	1774 (974)
Species	2192 (47)	1525 (33)
Cell Line	175 (72)	140 (50)
Total	20,419 (3869)	15,400 (3597)
Disease–Gene	1633	1610
Chemical–Gene	923	1121
Disease–Variant	893	975
Gene–Gene	1227	936
Chemical–Disease	1237	779
Chemical–Chemical	488	412
Chemical–Variant	76	199
Variant–Variant	25	2
Total	6502	6034
Novel Relations	4532	3683

Evaluation metrics

The official evaluation metrics used in this work are micro-average Precision, Recall and F1-score (main evaluation metric). These metrics take into account the number of True Positives (correct predictions), False Negatives (incorrect negative predictions) and False Positives (incorrect positive predictions). The BioCreative VIII BioRED challenge was organized in two subtasks. In Subtask 1, participants were given PubMed abstracts, with annotated entities by human experts, and were asked (i) to extract relation pairs, (ii) to identify their semantic type and (iii) whether the relation is novel. The annotated entities included the text mention (span with start- and end-character offsets), the entity type (gene, disease or other) and an identifier code linked to a specific terminology (NCBI Gene, MeSH or others). In Subtask 2, participants were solely given the PubMed abstracts and were challenged to build an end-to-end system for the same relation extraction task (identification of relation pairs, relation classification and their novelty factor).

The organizers considered four evaluation results for relation extraction:

Relation pair identification: Whether a pair of entities was identified as constituting a relationship;
Relation classification: Whether an entity pair relationship was categorized with the correct relation class (association, drug interaction, positive correlation or others);
Relation novelty factor: Whether an entity pair relationship is considered as novel given the article context.
All: This includes all the previous three scenarios, a relation is considered to be correctly predicted if an entity pair relationship exists and it is classified with the correct relation type and the correct novelty factor.

These four scenarios were evaluated, and teams were sorted according to each of these results for relation extraction. In Subtask 2, participants needed to build their NER and entity linking systems and these tasks were also taken for evaluation with the microaverage F1-score. During the development of our end-to-end model presented in this work, we also calculated the NER and entity linking results per each entity class to inspect which entity classes needed more refinement and attention from our model.

End-to-end system

As previously mentioned, our system is structured around three core modules—“Tagger,” “Linker” and “Extractor”—which operate in a cascading pipeline, a brief architectural overview can be seen in Figure 1. Each module is designed to address, in an isolated manner, a distinct task within the broader process.

Figure 1.

An overview of our cascade pipeline, which showcases the interaction between the three main modules: Tagger, Linker and Extractor.

Tagger (NER): The “Tagger” module’s objective is to identify biomedical entities within a document, classifying them into one of several categories: Gene, Disease, Chemical, Sequence Variant, Species (Organism) or Cell Line.
Linker (Entity Linking): Following entity identification, the “Linker” module takes over to normalize the identified entities to their corresponding entries in standard knowledge bases, thus ensuring consistency and accuracy in entity representation.
Extractor (Relation Detection and Classification): The final module, “Extractor,” is tasked with discerning the relationships that exist among the various entities within a document. It classifies these relations into one of the eight predefined categories and identifies which of these relations are novel, i.e. not previously described in the literature.

Our system is developed entirely using Python, with the use of PyTorch and Hugging Face libraries for the implementation of the various models models (more information is provided in the GitHub repository¹). All the code run on a machine with Intel(R) Xeon(R) Gold 5218R CPU, 128GB of RAM and a NVIDIA Quadro RTX 8000.

Tagger (NER)

For the “Tagger” module, our approach aligns with methodologies described in other works (3, 4, 44), which can be seen in Figure 2, employing the BIO-tagging schema for data encoding. This schema, widely adopted in the field as highlighted by Lample et al. (35), assigns each token to a category—beginning (B), inside (I) or outside (O)—to mark entity boundaries within the text. Subsequently, data encoded in this manner are processed by a transformer-based model, which is enhanced with a Masked-CRF classifier (2, 76), to accurately identify entity types.

Figure 2.

Simplified overview of the inner workings of the Tagger module.

Given the variety of entities in the BioRED dataset—ranging from Genes and Diseases to Chemicals and Cell Lines—the need for a multiclass approach becomes evident, which differs from previous works that primarily focused on single-entity identification. To accommodate this, we extend the BIO tagging schema to multiple classes. As a result, our label set expands to include specific tags for each entity class, such as B-Gene, I-Gene, through to B-Diseases and I-Diseases, framing this as a 13-class sequence classification problem.

Formally, let us consider |$x=\{x_1,x_2,{\ldots},x_N\}$| as a sequence of text tokens, where x_i represents the i-th token in the text and N denotes the total number of tokens; |$y=\{y_1,y_2,{\ldots},y_N\}$| as the corresponding sequence of labels, where each y_i from a set of predefined labels such as |$\{O, {\rm B-Gene}, {\rm I-Gene}, {\ldots}, {\rm B-Diseases}, {\rm I-Diseases}\}$| is assigned to token x_i. To estimate |$P(y|x)$|⁠, the probability of assigning a label sequence y to a given token sequence x, traditional methods might assume label independence, calculating |$P(y|x)$| as a product of individual label probabilities given the entire sequence, |$P(\textbf{y}|\textbf{x}) = \prod^N_{i=1} P(y_i|\textbf{x})$|⁠. However, as pointed out in (44), this overlooks the dependencies between labels, especially critical in BIO tagging where, for example, an “I” (inside) tag must always follow an “B” (beginning) tag. To account for label dependencies, we modify the approach to include the probability of each label not just given the entire sequence x but also considering the previous label, thus incorporating sequential context. In practice, this context-aware estimation is achievable by using linear-chain CRF to model |$P(y|x)$|⁠,

$$ P(y|x) = \frac{1}{Z(x)}\exp{\left (\sum^N_{i=1} f_u(y_i,x;\theta_u) + \sum^{N}_{i=2} f_t(y_i,y_{i-1};\theta_t) \right )},\\[5pt] $$

where θ represents the trainable parameters, f_u is the unary function and f_t represent a transition function. The unary function computes the unary potentials which essentially compute the score of each label being assigned to token x_i while considering the whole sequence. f_t is the transition function that simply corresponds to a transition matrix, being parameterized by θ_t and having its score obtained by looking up the entry in the matrix. Lastly, |$Z(\textbf{x})$| is known as the partition function and acts as a normalizing factor to obtain a probabilistic distribution over all sequences.

Additionally, to further enforce the restrictions of the BIO tagging schema, we follow the ideas of (3, 44, 76) and masked our CRF by applying a large negative weight to the impossible transitions according to the BIO schema, like predicting a “I” after an “O.”

We relied on BERT-based (18) encoders as part of the unary function of the Masked CRF. Given the limited context size of this architecture (512 tokens), we also adopted a sliding window strategy to split the document into more manageable sizes. More precisely, we consider a window of size of 512 tokens, but with k tokens as left and right context. These context tokens are not taken into consideration for the final label prediction, but are rather for contextualizing the actual predictions.

During training, we also adopted the “Random Token Replacement with Unknown” augmentation technique presented in (3). This involves randomly replacing entities tokens in the input sequence with the special unknown token “[UNK].” The intuition is to force the model to not only rely on the entity text but also use the context tokens.

Finally, we employ two “postprocessing” steps, namely, decoding and ensembles. The decoding phase takes the label outputs from the various sequences for each document and extracts the corresponding entity class and spans. The ensemble consists in combining the outputs of various models at the entity level, taking advantage of the knowledge learnt by multiple models.

Linker

To perform entity linking on the entities identified from NER, we employ a multistage pipeline in an attempt to maximize the number of hits for the entities detected in the previous stage of the model. Although we attempt to incorporate the same linking methodology, in practice, most of the knowledge bases have significant differences between them and hence require a different pipeline to process them. Furthermore, for each entry in a knowledge base, there may be many ways to find a code (by looking for concepts, synonyms, or description), and so a single text term can correspond to many identifiers, which creates a problem of disambiguation.

The general idea behind our entity linking is as follows (illustrated in Figure 3):

Direct match over training data: The initial step in our pipeline is to create a dictionary of the training entities and their corresponding codes. It is assumed that any entity that exists in the training data can be assigned its corresponding code.
Direct match over corpus: The next step is to perform direct matching over the respective knowledge corpus. The step has unforeseen challenges such as the large scales of the various corpora as well as the number of sources for which direct lookup can be performed. Details regarding this will be provided in the detailed explanation of each knowledge base.
Semantic search over the corpus: The next step is to perform a semantic search using text embeddings from SapBERT (large) (43). When using the embeddings, we perform cosine similarity between the corpus and the entity text. In the cases where we have multiple text knowledge bases, we select the code that has the largest cosine similarity value above a certain threshold.
Disambiguation: The final step is to resolve ambiguities in terms that have multiple codes assigned to them. For this, we propose a naive algorithm that selects the most frequent code that is shared between the maximum number of entities. This approach is based on the premise that documents are likely to maintain consistency in their annotations, suggesting that the most shared code among entities is the most likely to be the correct one.

Figure 3.

Simplified overview of the inner workings of the Linker module.

When performing linking, unless otherwise stated, it is assumed that all the entity terms are converted to lowercase in order to maximize matching between entity mentions and our vocabulary entries.

Species/organism

To normalize species, we use the NCBI-Taxonomy knowledge base. In this corpus, there exist 2,564,321 codes, containing a total of 3,998,949 terms which we use to build the dictionary for test. These values are from the “name_txt” field from the names.dmp file, which can be obtained here.² For the linking of species, we only relied on direct match on training, direct match over the corpus and disambiguation. We did not employ any semantic search here, due to the higher matching scores that we were obtaining with direct match (97%).

Chemicals

For the linking of chemicals, we utilized MeSH codes specifically associated with chemicals, incorporating both the official codes and those from the Supplementary Concept Records. To elaborate, we filtered the codes starting with “D*,” which are explicitly designated for chemical substances within the MeSH hierarchy. For our search, we considered only three fields, namely, concepts (25,465) synonyms (93,091) and definitions (10,812). These correspond to 10,541 different entities. The supplementary data contain an additional 323,495 entities and 216,429 definitions presented. After performing the exact match on the training, we then perform the semantic search over the embeddings of both concepts, synonyms and definitions combined, selecting the highest cosine score over a threshold.

Diseases

Diseases follow a similar methodology as chemicals in terms of the linking pipeline. However, as knowledge bases we used the CTD disease corpus, containing MeSH and the Online Mendelian Inheritance in Man database. Again, we use the three fields, namely, concepts (13,298) synonyms (77,319) and definitions (10,812). There exist 13,298 unique codes in the MeSH corpus for diseases, which were filtered with descriptors starting with C*.

Cell lines

In the case of the Cell lines linking, we utilized the Cellosaurus knowledge base, which contains 152,231 unique codes, each associated with a distinct concept. In terms of linking, we follow the standard pipeline, by performing direct match over training and corpus and then using semantic search to find semantic similar matches on the remanding of terms.

Genes

For gene linking, we relied on the NCBI-Gene knowledge base that assigns unique codes to organisms associated with specific genes. A crucial aspect of this process is recognizing that a gene’s identity is contingent upon its organism, given that identical genes may exist across different species. To address this, our preliminary step involves identifying the relevant organism for each gene mention in the document. For this, we implemented a straightforward algorithm that looks up up for the closest organism mention to the gene term directly in the text. Here, our assumption is that the closest organism to a gene mention must be the organism that the gene is referring to. Additionally, if no organism is found, we consider the organism to be human (code 9606).

Following this organism identification, we continued with our previously described linking methodology, but now by only considering gene codes that belong to the identified organism. With respect to the NCBI-Gene knowledge base, there are 48,880 organisms with genes, totaling 50,941,500 unique gene codes. For each, we only consider the fields, symbols, synonyms, descriptions and other designations as descriptors totaling 101,928,344.

Finally, due to computational requirements, our semantic search with embeddings is limited to the top seven organisms most prevalent in the corpus: house mouse (10 090), Norway Rat (10 116), human immunodeficiency virus (11 676), Respiratory syncytial virus (12 814), thale cress (3702), zebrafish (7955) and human (9606). This selection was necessary due to the substantial memory requirements for holding the embeddings, which would exceed the 300 GB for the full knowledge base.

Sequence Variants

Sequence variants present a unique challenge within our pipeline, primarily because most do not have a unique identifier and instead utilize the tmVar (70) code notation. This notation is a manually crafted format that standardizes the description of sequence variants. For example, “Arg-114” and “-3170G>A” are standardized to “p|$\vert$|Allele|$\vert$|R|$\vert$|114” and “c|$\vert$|SUB|$\vert$|G|$\vert$|-3170|$\vert$|A,” respectively. Additionally, similar to gene linking, the sequence variants can have different codes depending on the gene where they are expressed. Given these circumstances, we divided the linking task in three stages. First, we identify the gene each sequence variant refers to; for that, we use the same algorithm used in gene linking to find the nearest gene mention to the sequence variant. Next, we conduct a direct lookup for the sequence variant and gene within the dbSNP database. If this search is unsuccessful, we then proceed to generate the corresponding tmVar code notation.

Regarding the direct lookup, our first approach was to download the entire dbSNP database, which catalogs genetic variation codes across various species. However, given the database’s substantial size (over 200 GB of raw data), we opted instead to utilize LitVar2, a public Application Programming Interface (API) capable of performing sequence variant lookups.³ Note that this API represents the only third-party dependency in our pipeline although theoretically, we could develop an in-house solution for sequence variant lookup.

With regard to the tmVar notation generation, we initially tried to use tmVar itself; however, we were unable to make a successful deployment of the tool. Therefore, we propose to frame this task as a translation problem, where we trained a translation model and also tried a few-shot LLM generation approach.

Regarding the generation of tmVar codes, we aimed to approach this task as a translation problem. The underlying idea is to translate a given gene and sequence variant mention into the corresponding tmVar notation. Specifically, we investigated training a translation model and utilizing a LLM for few-shot code translation. For the translation model, we utilized the plain T5-base model trained on pairs of sequence variants and tmVar notations from the training dataset. Meanwhile, for the LLM approach, we conducted a semantic search to identify up to 25 similar translation examples from the training data. These examples were then used to instruct the LLM to generate the next code, following the patterns observed. The detailed prompt used is presented in Prompt 1. Additionally, we observed that amino acids and their respective codons are typically normalized to their corresponding single-letter codes. To accommodate this, we manually constructed a translation table that converts both codons and amino acids into their single-letter representations.

Prompt 1.

Example of the prompt used to translate codes under a few-shot configuration.

Extractor

As mentioned, the objective of the “Extractor” is to identify relations between the normalized entities, classify them and determine which of these relations are novel. Most of our “Extractor” module follows the system already presented in the Biocreative VIII BioRED track (3), with a small set of additions and corrections. To make this work self-contained, we will briefly describe our previous “Extractor” model, and then we will discuss the changes that we made.

First, let us define the task of relation extraction as assigning potential relations |$r_k\in R$| to pairs of entities |$(e_i,e_j)$| within a document D containing E unique linked entities, culminating in the triplet |$(e_i,r_k,e_j)$|⁠. Additionally, we can also frame the novelty task as a binary classification task over the previous triplet |$(e_i,r_k,e_j)\rightarrow \{0,1\}$|⁠. However, while a document may theoretically contain up to E² entity pairs, in reality, only a smaller subset of these pairs have relevant relations. To address this, we introduced an additional “negative” class alongside the original set of eight classes, specifically to identify instances where the entity pair does not exhibit a relation. Based on this joint definition, we can now develop a model capable of simultaneously predicting relations while also assessing their novelty for any given document annotated with a set of |$(e_i,e_j)$|⁠.

In terms of architecture, as depicted in Figure 4, the model leverages a transformer-based architecture to produce contextualized representation for each entity. Then from these, we produce, in a multihead attention layer, a joint representation that we use to perform both the relation classification and novelty detection. Furthermore, to accurately encode contextual entity information as input for the model, we introduce new tokens “[s1],” “[e1],” “[s2],” and “[e2],” which correspond to the start and end of the two entities in the text. These tokens are then directly inserted into the text. For example, in the sentence “(…) high-grade [s1]glioma[e1] (…),” “glioma” corresponds to the first entity. In order to jointly train this model on both tasks, we propose a masked combined loss defined in Equation 1,

$$ L=L_{\rm r} + (y_{\rm r} \neq 8)L_{\rm n}, $$

(1)

Figure 4.

Simplified overview of the inner workings of the Extractor module, as depicted in previous work (3).

here, we sum the cross-entropy loss for the relation (L_r) and novelty (L_n) losses. Notably, the novelty loss L_n is considered only when the entity pair is deemed valid, i.e. its relation, y_r does not correspond to the negative class |$(y_{\rm r} \neq 8)$|⁠.

Building on our initial model, we have introduced some postchallenge enhancements that will be discussed later.

Dynamic negative sampling: In our previous approach, we randomly selected negative examples from the training data, where a negative example corresponds to an entity pair without a valid relation. Due to the vast number of possible pair combinations, many of these negative pairs represented easy cases that offered minimal contribution to model training. To address this issue, we introduced a strategy for dynamically sampling negative examples using a previously trained model. With more details, initially, we train a model using a random negative sample, which we call of M₀. Next, we generate a new dataset of negative samples by applying M₀ to the training data, focusing on selecting pairs for which M₀ showed low confidence in negative classification or incorrectly predicted as positive. The rationale here is that these examples should correspond to “harder” negative pairs that when prioritised over easy negative samples should yield better training performance. This curated set of “harder” negative examples is then used to train a new model, referred to as M₁. It is worth noting that this process can be iteratively repeated to generate further model iterations although computational costs increase significantly with each cycle. Exploring the benefits of continued iterations was left for future work.
Correction of an assumption: In our previous work, we assumed that the relation triples were directional, in that |$(e_i,r_k,e_j) \nRightarrow (e_j,r_k,e_i)$|⁠. However, we later discovered that this statement is incorrect and that |$(e_i,r_k,e_j) \Rightarrow (e_j,r_k,e_i)$|⁠, which also reduced the number of negative samples present in the dataset.

Results and discussion

In this section, we focus on evaluating and discussing the outcomes achieved by our individual modules and the integrated end-to-end system. Initially, we present the results obtained on the validation set. Subsequently, we detail the performance of our model on the test set and offer a comparative analysis with our submissions to BioCreative VIII Track 1 Subtask 2.

It is important to note that there is not an official evaluation script provided and the primary mode of evaluation is through a CodaLab⁴ competition set up by the event organizers. However, this CodaLab competition does not offer metrics for NER or Linking as it primarily focuses on Subtask 1. Consequently, the results we report for NER and Linking were derived using our own evaluation scripts. Due to this discrepancy, we refrain from comparing our NER and Linking outcomes directly with those from BioCreative as we cannot guarantee the consistency of our metrics with those used by the organizers. Instead, we benchmark our NER and Linking performances against PubTator 3 (68), which represents the current state-of-the-art for both tasks.

Validation results

Here, we first discuss the performance of each module individually and then conclude with the cumulative results of all modules combined in an end-to-end fashion. All the measures are reported over the validation set, containing a total of 100 documents.

Tagger

For the “Tagger,” we are mainly concerned to evaluate the NER performance of our models. In terms of configuration, we adopted the BioLinkBERT-large model, mainly due to the superior performance showcased during the challenge versus other pretrained models. Furthermore, we adopted a context size of 32, while not noticing any difference to other context sizes. We believe that this is mainly explained by the documents being only abstract, and in most cases, they fit inside the initial window size, hence not being necessary to split. Regarding training, we mainly stick with the default hyperparameters present on the Hugging Face (77) trainer, with the addition of using the unknown random augmentation technique previously described.

Regarding the results, we present in Table 2 the performance of our NER model versus the state-of-the-art PubTator 3, in terms of micro-F1 score. With more details, we trained, with different random seeds, and evaluate five NER models and report the results as the average and as a single run produced by an entity-level ensemble over our five runs. Compared to PubTator 3, our model demonstrates significantly superior performance, outscoring it by 12.46 points. This is an interesting result considering that PubTator 3 was trained on a larger and more diverse datasets, including the BioRED dataset, showing the importance of fine-tuning on domain-specific data, as varying annotation guidelines across datasets can lead to inconsistencies in entity recognition, which we believe being the main reason for these differences. Additionally, it is compelling to note that our entity-level ensemble method managed to produce a combined run that exceeds the average scores of all individual runs. This suggests that leveraging multiple models in an ensemble can effectively enhance overall performance.

Table 2.

Comparison of five runs best performing model with PubTator 3.

	BioNExt	BioNExt
Entity	(average)	(entity ensemble)	PubTator 3
Gene	92.54 ± 0.56	93.16	68.42
Disease	84.80 ± 0.46	85.97	79.74
Chemical	91.87 ± 0.41	92.33	86.04
Variant	85.49 ± 0.73	85.88	82.28
Species	89.91 ± 0.54	90.21	80.61
Cell Line	91.40 ± 1.80	91.84	80.77
Total	89.57 ± 0.21	90.24	77.68

	BioNExt	BioNExt
Entity	(average)	(entity ensemble)	PubTator 3
Gene	92.54 ± 0.56	93.16	68.42
Disease	84.80 ± 0.46	85.97	79.74
Chemical	91.87 ± 0.41	92.33	86.04
Variant	85.49 ± 0.73	85.88	82.28
Species	89.91 ± 0.54	90.21	80.61
Cell Line	91.40 ± 1.80	91.84	80.77
Total	89.57 ± 0.21	90.24	77.68

Results are presented in terms of F1 score. The best scores are highlighted in bold.

Table 2.

Comparison of five runs best performing model with PubTator 3.

	BioNExt	BioNExt
Entity	(average)	(entity ensemble)	PubTator 3
Gene	92.54 ± 0.56	93.16	68.42
Disease	84.80 ± 0.46	85.97	79.74
Chemical	91.87 ± 0.41	92.33	86.04
Variant	85.49 ± 0.73	85.88	82.28
Species	89.91 ± 0.54	90.21	80.61
Cell Line	91.40 ± 1.80	91.84	80.77
Total	89.57 ± 0.21	90.24	77.68

	BioNExt	BioNExt
Entity	(average)	(entity ensemble)	PubTator 3
Gene	92.54 ± 0.56	93.16	68.42
Disease	84.80 ± 0.46	85.97	79.74
Chemical	91.87 ± 0.41	92.33	86.04
Variant	85.49 ± 0.73	85.88	82.28
Species	89.91 ± 0.54	90.21	80.61
Cell Line	91.40 ± 1.80	91.84	80.77
Total	89.57 ± 0.21	90.24	77.68

Results are presented in terms of F1 score. The best scores are highlighted in bold.

Linker

Before addressing the main results, let us discuss our methodology for generating tmVar codes. As previously mentioned, we propose two strategies: (i) training translation models and (ii) utilizing a LLM with a few-shot prompt. For the translation models, we trained both T5-small and T5-large models (57) on tmVar codes from the training data and the tmVar 3.0 corpus (70). For the LLM strategy, we employed the hermes-2-mixtral (67) model. In terms of performance on the validation set, the best translation model achieved an accuracy of 44.15%, while the few-shot LLM approach reached 69.37%. Based on these outcomes, we opted for the LLM approach when predicting tmVar codes.

Regarding the main results, it is important to note that the results for linking are obtained over the previous NER runs. The main reason for this is because we were not capable of run PubTator 3 over the gold-standard entities due to the unavailability of the API. So it should be mentioned that given our superior NER results, we expected to have an advantage in this linking stage. We further would have liked to test out our NER model with PubTator’s linking; however, the API service was not functioning at the time.

However, as observable by the results presented in Table 3, our entity linking performance falls short of PubTator 3, trailing by almost 5 points (77.05 versus 81.96). A closer examination of the results reveals that the discrepancy is most pronounced in the Gene class, with a nearly 10-point gap. This poor performance on the gene class was somewhat anticipated, given that it depends on finding the correct species that it belongs to. Consequently, any error in the linking of species will directly impact the linking of genes, which aligns with our comparably poor species performance. In light of these unexpected results, a considerable part of the Error Analysis section is dedicated to a more thorough examination of these findings.

Table 3.

Entity linking comparison of our best performing model with PubTator 3.

Entity	BioNExt	PubTator 3
Gene	74.85	84.84
CellLine	72.22	80.85
Variant	57.34	60.08
Chemical	83.64	84.69
Disease	78.86	80.28
Species	93.27	97.76
Total	77.05	81.96

Results are presented in terms of F1 score. The best scores are highlighted in bold.

Table 3.

Entity linking comparison of our best performing model with PubTator 3.

Entity	BioNExt	PubTator 3
Gene	74.85	84.84
CellLine	72.22	80.85
Variant	57.34	60.08
Chemical	83.64	84.69
Disease	78.86	80.28
Species	93.27	97.76
Total	77.05	81.96

Results are presented in terms of F1 score. The best scores are highlighted in bold.

Nevertheless, it is also important to consider that the PubTator 3 system applies different specialized state-of-the-art tools for normalizing each of the entities (68), namely, GNorm2 (74) for genes and species, TaggerOne (37) for diseases and cell line, NLM-Chem tagger (28) for chemicals and tmVar 3.0 (70) for variants. On the contrary, we focused on having a generic methodology that could be applied to any entity type, which eases the maintenance, since it is a single implementation.

Lastly, given the gene linking poor performance, we anticipated a significant impact on the “Extractor” module’s performance, as genes are involved in half of the relations according to Table 1.

Extractor

The evaluation of the “Extractor” module was conducted in the context of the Biocreative VIII Track 1 challenge. Here, we will discuss some validation results obtained during the challenge, as well present new validation results.

Primarily, regarding the type of transformer model to adopt, we mainly consider the two state-of-the-art BERT-based models BiomedBERT⁵ (23) and BioLinkBERT (79), as well as the new decoder-only BioGPT model (46). Note that our proposed model operates at the contextualized representation level, enabling compatibility with any type of transformer model. Table 4 presents the final entity pair and novelty score that we obtained for the validation set, when using different pretrained transformer models. As observable, our best results were obtained by using the BioLinkBERT (large) model, which aligns with the literature (79).

Table 4.

The impact of pretrained transformer-based models as the backbone for the task of entity pairing and novelty discovering.

Pretrained model	Entity pair	+ Novel
BioLinkBERT (large) (79)	75.99	53.43
BioGPT (46)	61.64	40.59
BiomedBERT (23)	72.38	49.34

Results are presented in terms of F1 score. The best scores are highlighted in bold.

Table 4.

The impact of pretrained transformer-based models as the backbone for the task of entity pairing and novelty discovering.

Pretrained model	Entity pair	+ Novel
BioLinkBERT (large) (79)	75.99	53.43
BioGPT (46)	61.64	40.59
BiomedBERT (23)	72.38	49.34

Results are presented in terms of F1 score. The best scores are highlighted in bold.

Regarding the postchallenge enhancement, we mainly propose the dynamic sampling strategy, which we now evaluate. Table 5 showcases the comparison when using dynamic sampling versus random sampling. As observable, by leveraging the dynamic sample, we were able to gain more than 2 points in terms of entity pair score, which then further translates to gains in novelty. This result aligns with our intuition, since the main reason for the dynamic sampling was to force the model to train on “harder” negative entities pairs. Furthermore, we only conducted experiments with a single iteration (M₁) of dynamic sampling.

Table 5.

Comparison between random sampling and dynamic sampling on entity pairing and novelty discovering.

Sampling strategy	Entity pair	+ Novel
Random sampling	75.99	53.43
Dynamic sampling (M₁)	77.76	55.37

Results are presented in terms of F1 score. The best scores are highlighted in bold.

Table 5.

Comparison between random sampling and dynamic sampling on entity pairing and novelty discovering.

Sampling strategy	Entity pair	+ Novel
Random sampling	75.99	53.43
Dynamic sampling (M₁)	77.76	55.37

Results are presented in terms of F1 score. The best scores are highlighted in bold.

End-to-end

Lastly, we present in Table 6 the validation results for our complete pipeline. In this comparison, we assess the performance of our end-to-end system against the combined output of PubTator 3 and our “Extractor” model.

Table 6.

Performance comparison of PubTator 3 + our Extractor with our end-to-end system, on validation data.

Configuration	PubTator 3	BioNExt
Configuration	+ Extractor (BioNExt)	BioNExt
Entity pair	52.49	43.95
Entity pair + Relation	44.80	37.60
Entity pair + Novelty	43.10	36.59
All	36.69	31.25

Configuration	PubTator 3	BioNExt
Configuration	+ Extractor (BioNExt)	BioNExt
Entity pair	52.49	43.95
Entity pair + Relation	44.80	37.60
Entity pair + Novelty	43.10	36.59
All	36.69	31.25

Results are presented in terms of F1 score. The best scores are highlighted in bold.

Table 6.

Performance comparison of PubTator 3 + our Extractor with our end-to-end system, on validation data.

Configuration	PubTator 3	BioNExt
Configuration	+ Extractor (BioNExt)	BioNExt
Entity pair	52.49	43.95
Entity pair + Relation	44.80	37.60
Entity pair + Novelty	43.10	36.59
All	36.69	31.25

Configuration	PubTator 3	BioNExt
Configuration	+ Extractor (BioNExt)	BioNExt
Entity pair	52.49	43.95
Entity pair + Relation	44.80	37.60
Entity pair + Novelty	43.10	36.59
All	36.69	31.25

Results are presented in terms of F1 score. The best scores are highlighted in bold.

As anticipated, our comprehensive pipeline does not perform as well as PubTator 3. This outcome is primarily attributed to the subpar performance of our “Linker” in comparison to that of PubTator 3. Additionally, we believe that further exploring the integration of the PubTator linker with our NER could be beneficial. At the time of writing, we have attempted to use the PubTator linker exclusively with our NER outputs but have not succeeded.

Submission results

In this section, we detail the performance of our systems on the final test set. As mentioned, given that the test set gold standard is not available, all evaluations were conducted using the CodaLab platform provided by the event organizers, limiting our metrics to the relation extraction task. Regarding the results, we begin by outlining our performance during the challenge, followed by a comparison with the postchallenge enhancements. Subsequently, we evaluate the performance of our end-to-end system in the relation extraction task, benchmarking it against PubTator 3.

Extractor

Table 7 shows the performance of our “Extractor” model when evaluated during the challenge. We submitted five runs, with the first two being both single models that utilized BioLinkBERT. The primary distinction between them was the seed used for negative random sampling. The remaining runs were ensembles of our top 8, 5 and 3 runs, respectively. Notably, Run 1 emerged as our best-performing run, closely followed by Run 4. The significant performance disparity between Runs 0 and 1 shows the influence of our negative random sampling approach, suggesting that Run 0 suffered from a less advantageous pool of negative documents, which likely contributed to its suboptimal results. This observation reinforces our rationale for adopting a dynamic negative sampling method, aiming to mitigate such impacts.

Table 7.

Results of our five runs submitted to the challenge, as well as the median and average.

Configuration	Entity Pair (P/R/F%)			+ Relation (P/R/F%)			+ Novel (P/R/F%)
run0	66.06	78.33	71.67	46.82	57.05	51.43	36.19	44.71	40.00
run1	63.91	85.72	73.22	47.23	65.98	55.05	36.88	53.00	43.50
run2	59.75	88.96	71.48	43.67	68.79	53.42	33.68	54.77	41.71
run3	64.52	86.19	73.79	47.28	65.40	54.88	36.68	51.87	42.97
run4	66.18	84.63	74.27	48.26	63.27	54.76	37.76	50.37	43.16
Median	77.93	69.65	73.56	51.64	54.79	53.17	41.61	39.88	40.73
Average	69.22	68.6	67.03	49.01	48.39	47.74	36.15	35.73	35.22

Configuration	Entity Pair (P/R/F%)			+ Relation (P/R/F%)			+ Novel (P/R/F%)
run0	66.06	78.33	71.67	46.82	57.05	51.43	36.19	44.71	40.00
run1	63.91	85.72	73.22	47.23	65.98	55.05	36.88	53.00	43.50
run2	59.75	88.96	71.48	43.67	68.79	53.42	33.68	54.77	41.71
run3	64.52	86.19	73.79	47.28	65.40	54.88	36.68	51.87	42.97
run4	66.18	84.63	74.27	48.26	63.27	54.76	37.76	50.37	43.16
Median	77.93	69.65	73.56	51.64	54.79	53.17	41.61	39.88	40.73
Average	69.22	68.6	67.03	49.01	48.39	47.74	36.15	35.73	35.22

Results are presented in terms of F1 score. The best scores are highlighted in bold.

Table 7.

Results of our five runs submitted to the challenge, as well as the median and average.

Configuration	Entity Pair (P/R/F%)			+ Relation (P/R/F%)			+ Novel (P/R/F%)
run0	66.06	78.33	71.67	46.82	57.05	51.43	36.19	44.71	40.00
run1	63.91	85.72	73.22	47.23	65.98	55.05	36.88	53.00	43.50
run2	59.75	88.96	71.48	43.67	68.79	53.42	33.68	54.77	41.71
run3	64.52	86.19	73.79	47.28	65.40	54.88	36.68	51.87	42.97
run4	66.18	84.63	74.27	48.26	63.27	54.76	37.76	50.37	43.16
Median	77.93	69.65	73.56	51.64	54.79	53.17	41.61	39.88	40.73
Average	69.22	68.6	67.03	49.01	48.39	47.74	36.15	35.73	35.22

Configuration	Entity Pair (P/R/F%)			+ Relation (P/R/F%)			+ Novel (P/R/F%)
run0	66.06	78.33	71.67	46.82	57.05	51.43	36.19	44.71	40.00
run1	63.91	85.72	73.22	47.23	65.98	55.05	36.88	53.00	43.50
run2	59.75	88.96	71.48	43.67	68.79	53.42	33.68	54.77	41.71
run3	64.52	86.19	73.79	47.28	65.40	54.88	36.68	51.87	42.97
run4	66.18	84.63	74.27	48.26	63.27	54.76	37.76	50.37	43.16
Median	77.93	69.65	73.56	51.64	54.79	53.17	41.61	39.88	40.73
Average	69.22	68.6	67.03	49.01	48.39	47.74	36.15	35.73	35.22

Results are presented in terms of F1 score. The best scores are highlighted in bold.

Following up, Table 8 compares our best performing result achieved during the challenge with our postchallenge enhancements, namely, the addition of the dynamic negative sampling. Despite our expectations, dynamic sampling did not enhance the novelty score on the test set as it did during the validation phase. Nevertheless, the postchallenge model demonstrated improved performance in terms of entity pair scores (75% versus 73.22%), supporting the idea that the dynamic sampling effectively focuses training on more challenging examples, thereby improving the model’s ability to identify correct pairs. Yet, this improvement in entity pair scoring did not translate into a higher novelty score on the test set. As equally interesting, another observation arises from comparing precision and recall scores. With dynamic sampling, our model achieves more balanced scores, often associated with peak performance in terms of F1 score. This suggests that while dynamic sampling enhances certain aspects of our model’s performance, its impact on the novelty score requires further investigation.

Table 8.

Comparison between our best challenge run and our postchallenge enhancements.

Configuration	Entity Pair (P/R/F%)			+ Relation (P/R/F%)			+ Novel (P/R/F%)
Our best submission	63.91	85.72	73.22	47.23	65.98	55.05	36.88	53.00	43.50
Extractor (+ Dynamic Sampling)	73.99	76.04	75.00	54.01	55.82	54.90	42.47	44.01	43.23
Competition best	–	–	77.07	–	–	58.88	–	–	44.55

Configuration	Entity Pair (P/R/F%)			+ Relation (P/R/F%)			+ Novel (P/R/F%)
Our best submission	63.91	85.72	73.22	47.23	65.98	55.05	36.88	53.00	43.50
Extractor (+ Dynamic Sampling)	73.99	76.04	75.00	54.01	55.82	54.90	42.47	44.01	43.23
Competition best	–	–	77.07	–	–	58.88	–	–	44.55

Results are presented in terms of F1 score. The best scores are highlighted in bold.

Table 8.

Comparison between our best challenge run and our postchallenge enhancements.

Configuration	Entity Pair (P/R/F%)			+ Relation (P/R/F%)			+ Novel (P/R/F%)
Our best submission	63.91	85.72	73.22	47.23	65.98	55.05	36.88	53.00	43.50
Extractor (+ Dynamic Sampling)	73.99	76.04	75.00	54.01	55.82	54.90	42.47	44.01	43.23
Competition best	–	–	77.07	–	–	58.88	–	–	44.55

Configuration	Entity Pair (P/R/F%)			+ Relation (P/R/F%)			+ Novel (P/R/F%)
Our best submission	63.91	85.72	73.22	47.23	65.98	55.05	36.88	53.00	43.50
Extractor (+ Dynamic Sampling)	73.99	76.04	75.00	54.01	55.82	54.90	42.47	44.01	43.23
Competition best	–	–	77.07	–	–	58.88	–	–	44.55

Results are presented in terms of F1 score. The best scores are highlighted in bold.

Still in Table 8, we also included the performance metrics of the highest-scoring system from the challenge, which surpasses our results by a margin of 1.05 points. Considering this narrow difference, we are optimistic that minor adjustments or the use of a robust ensemble of runs could elevate our system to a comparable level of performance.

End-to-end

Lastly, we present, in Table 9, our results regarding our end-to-end system. Similar to the validation results, our end-to-end system underperformed with respect to PubTator 3, showing that possibly our linking results are subpar. It is important to note that we are unable to directly assess the NER and Linking scores; and therefore, we utilize the Relation Extraction score as an indirect measure of the performance of our preceding modules. Moreover, in comparison to the top-performing entry in the competition, our system, when using PubTator 3 as a baseline, demonstrates competitive proximity, trailing by a narrow margin of 1.23 points.

Table 9.

Comparison between our end-to-end model and PubTator 3 + our Extractor against the competition best.

Configuration	Entity Pair (P/R/F%)			+ Relation (P/R/F%)			+ Novel (P/R/F%)
Competition best	–	–	55.84	–	–	43.03	–	–	32.75
PubTator 3 + Extractor (BioNExt)	56.64	53.11	54.82	42.27	39.65	40.91	32.56	30.55	31.52
BioNExt (end-to-end)	45.89	40.63	43.10	34.56	30.60	32.46	26.18	23.18	24.59

Configuration	Entity Pair (P/R/F%)			+ Relation (P/R/F%)			+ Novel (P/R/F%)
Competition best	–	–	55.84	–	–	43.03	–	–	32.75
PubTator 3 + Extractor (BioNExt)	56.64	53.11	54.82	42.27	39.65	40.91	32.56	30.55	31.52
BioNExt (end-to-end)	45.89	40.63	43.10	34.56	30.60	32.46	26.18	23.18	24.59

Table 9.

Comparison between our end-to-end model and PubTator 3 + our Extractor against the competition best.

Configuration	Entity Pair (P/R/F%)			+ Relation (P/R/F%)			+ Novel (P/R/F%)
Competition best	–	–	55.84	–	–	43.03	–	–	32.75
PubTator 3 + Extractor (BioNExt)	56.64	53.11	54.82	42.27	39.65	40.91	32.56	30.55	31.52
BioNExt (end-to-end)	45.89	40.63	43.10	34.56	30.60	32.46	26.18	23.18	24.59

Configuration	Entity Pair (P/R/F%)			+ Relation (P/R/F%)			+ Novel (P/R/F%)
Competition best	–	–	55.84	–	–	43.03	–	–	32.75
PubTator 3 + Extractor (BioNExt)	56.64	53.11	54.82	42.27	39.65	40.91	32.56	30.55	31.52
BioNExt (end-to-end)	45.89	40.63	43.10	34.56	30.60	32.46	26.18	23.18	24.59

Below, we present the computation required to operate our end-to-end system. First, in Table 10, we show the total storage needed for all the knowledge bases and the corresponding embedding representations on disk. Then, in Table 11, we show the approximate training and inference times. More specifically, it takes ≈1.72 seconds on average to process a single document, provided that all necessary models and embeddings are loaded into memory.

Table 10.

Sizes of the raw text entries that we used to perform knowledge base lookup and the corresponding embedding sizes.

Knowledge base	Raw size	Embedding size
NCBI Gene^a (10)	3.9 GB	5.5 GB
CTD diseases (16, 17)	6 MB	376 MB
MeSH (42)	46 MB	2.6 GB
dbSNP^b (64)	–	–
NCBI Taxonomy (63)	317 MB	16 GB
Cellosaurus (5)	6.3 MB	595 MB
Total	4.28 GB	25 GB

Knowledge base	Raw size	Embedding size
NCBI Gene^a (10)	3.9 GB	5.5 GB
CTD diseases (16, 17)	6 MB	376 MB
MeSH (42)	46 MB	2.6 GB
dbSNP^b (64)	–	–
NCBI Taxonomy (63)	317 MB	16 GB
Cellosaurus (5)	6.3 MB	595 MB
Total	4.28 GB	25 GB

a We only embedded the genes for most frequent species.

b As mentioned, we use LitVar2 for performing lookups on dbSNP

Table 10.

Sizes of the raw text entries that we used to perform knowledge base lookup and the corresponding embedding sizes.

Knowledge base	Raw size	Embedding size
NCBI Gene^a (10)	3.9 GB	5.5 GB
CTD diseases (16, 17)	6 MB	376 MB
MeSH (42)	46 MB	2.6 GB
dbSNP^b (64)	–	–
NCBI Taxonomy (63)	317 MB	16 GB
Cellosaurus (5)	6.3 MB	595 MB
Total	4.28 GB	25 GB

Knowledge base	Raw size	Embedding size
NCBI Gene^a (10)	3.9 GB	5.5 GB
CTD diseases (16, 17)	6 MB	376 MB
MeSH (42)	46 MB	2.6 GB
dbSNP^b (64)	–	–
NCBI Taxonomy (63)	317 MB	16 GB
Cellosaurus (5)	6.3 MB	595 MB
Total	4.28 GB	25 GB

a We only embedded the genes for most frequent species.

b As mentioned, we use LitVar2 for performing lookups on dbSNP

Table 11.

Training and inference times; inference is done over the test set containing 10,000 documents.

		Inference on the test set
Module	Train	Seconds/Doc	Total
Tagger	00:30:00	0.048	00:08:00
Linker	-	0.6	01:40:00
Extractor	08:00:00	1.08	03:00:00
Total	08:30:00	1.728	04:48:00

		Inference on the test set
Module	Train	Seconds/Doc	Total
Tagger	00:30:00	0.048	00:08:00
Linker	-	0.6	01:40:00
Extractor	08:00:00	1.08	03:00:00
Total	08:30:00	1.728	04:48:00

Table 11.

Training and inference times; inference is done over the test set containing 10,000 documents.

		Inference on the test set
Module	Train	Seconds/Doc	Total
Tagger	00:30:00	0.048	00:08:00
Linker	-	0.6	01:40:00
Extractor	08:00:00	1.08	03:00:00
Total	08:30:00	1.728	04:48:00

		Inference on the test set
Module	Train	Seconds/Doc	Total
Tagger	00:30:00	0.048	00:08:00
Linker	-	0.6	01:40:00
Extractor	08:00:00	1.08	03:00:00
Total	08:30:00	1.728	04:48:00

Error analysis

A significant source of inaccuracies within our end-to-end relation and novelty detection model stems from the cumulative effect of errors throughout its various components. Specifically, the success of relation extraction hinges on the accurate identification and linking of entities. If an entity goes unrecognized, it inevitably precludes the possibility of accurately predicting a relation involving that entity. This domino effect of errors offers insight into the significant variances observed between the performances of our integrated end-to-end system and the standalone relation extraction model. Even when we consider the “Extractor” model on its own, we can see the same cascading effect of errors happing when comparing the entity pair, relation and novelty scores, Tables 7 and 8, further harming the final novelty score.

Particularly, the “Linker” module stands out as a primary contributor to these errors by falling short of our expectations. To gain a deeper understanding of where it might have faltered, we devote the remainder of this section to a detailed examination of the most prevalent errors introduced by the “Linker” module.

One error we identified pertains to the dynamic nature of knowledge bases, which are subject to continuous updates and revisions. Through our analysis, we encountered a discrepancy between the codes found in the BioRED dataset and those in our current version of the knowledge bases. This discrepancy arises because certain codes are absent from our knowledge bases; they may have been updated to newer versions, merged with other codes or deprecated. Our versions of the knowledge bases were from February/March 2024, while the ones used in the original BioRED dataset were from before 2022. Table 12 contains the number of unique codes that are present on the validation set of the BioRED dataset that we do not have access. As an example, we are missing the species code 11103. However, upon lookup, we can see that the following code seems to be updated to 3052230. Furthermore, we also verify that PubTator 3 does not suffer from this problem, and it returns these older codes.

Table 12.

Number of codes we are unable to predict from the validation set.

Entity	Unpredictable	Total
Gene	3	397
Cell Line	0	21
Chemical	3	173
Disease	0	245
Species	1	11
Total	7	847

Table 12.

https://github.com/ieeta-pt/BioNExt

Number of codes we are unable to predict from the validation set.

Entity	Unpredictable	Total
Gene	3	397
Cell Line	0	21
Chemical	3	173
Disease	0	245
Species	1	11
Total	7	847

Another source of error that we identified stems from the interconnected nature of the entity linking process. Effective linking for certain entities is contingent upon the successful linking of dependent entities. For example, accurately linking genes in a document requires prior identification of the species those genes are associated with. Similarly, linking sequence variants is dependent on identifying the specific gene they reference.

This interdependency introduces two layers of complexity to the error landscape. First, we must ensure the accurate prediction and linking of the prerequisite entities, such as species for genes. Second, we must determine the precise relationship between these entities, such as identifying the specific species a gene pertains to or the exact gene a sequence variant is associated with. As mentioned earlier, our approach employs a straightforward algorithm that deduces the species or gene based on the nearest mention. Nevertheless, we have observed instances where this method proves to be inadequate, indicating the need for a more advanced strategy.

For example, in the validation document “Doc510” (PubMed ID: 26847345), our system accurately identifies two references to mice (code: 10090) within the text, inferring that all gene mentions refer to mice. However, all gene mentions in this document actually refer to human (code: 9606), a species not explicitly mentioned in the text.

Lastly, we identified a recurring error in generating tmVar codes with our zero-shot LLM model. According to the tmVar coding standards, a code should begin with one of the letters c, r, g, p or m, representing DNA, RNA, genome, protein and mitochondrial sequences, respectively. A significant portion of the model’s errors stemmed from incorrectly predicting the initial letter of the code. For example, the mention “203G > A” associated with the gene “BRCA2” was incorrectly predicted as “c|$\vert$|SUB|$\vert$|G|$\vert$|203|$\vert$|A” instead of the correct “g|$\vert$|SUB|$\vert$|G|$\vert$|203|$\vert$|A.” We believe that these errors could be mitigated by either enriching the model with additional contextual information or by first determining the appropriate initial letter for the code and then conditioning the code generation on that letter.

Conclusions

In this work, we propose an end-to-end biomedical relation extraction model capable of classifying the novelty of identified relations. This innovative model builds upon our system developed for the BioCreative VIII competition, integrating it into a cascading pipeline alongside NER and linking models.

While we encountered challenges, especially in achieving the linking accuracy of established systems like PubTator, our model demonstrated remarkable achievements. Notably, it reached state-of-the-art performance in NER and maintained competitiveness in relation extraction and novelty detection.

Looking ahead, we identify potential areas for enhancement within our model. Namely, refining the linking component to bridge the performance gap with established systems like PubTator is a valuable direction for future work. Additionally, enhancing the relation extraction capabilities, particularly through advancements in our multihead attention mechanism for creating joint representations, presents a promising avenue for further development.

Funding

This work was funded by the Foundation for Science and Technology in the context of the project UIDB/00127/2020.⁶ T.A. was funded by the grant 2020.05784.BD.⁷ R.A. was funded under the project UIDB/00127/2020.⁶ R.A.A.J. was funded by the grant PRT/BD/154792/2023.

Footnotes

https://ftp.ncbi.nih.gov/pub/taxonomy/

https://www.ncbi.nlm.nih.gov/research/litvar2-api/variant/autocomplete/?query=query gene

https://codalab.lisn.upsaclay.fr/competitions/16 381

https://huggingface.co/microsoft/BiomedNLP-BiomedBERT-large-uncased-abstract

https://doi.org/10.54499/UIDB/00127/2020

https://doi.org/10.54499/2020.05784.BD

References

Adel

and

Schütze

(

2017

)

Global normalization of convolutional neural networks for joint entity and relation classification

. In: 2017 Conference on Empirical Methods in Natural Language Processing.

Association for Computational Linguistics

Copenhagen, Denmark

pp. 1723

–

1729

Almeida

Antunes

Silva

J.F.

et al. (

2022

)

Chemical identification and indexing in PubMed full-text articles using deep learning and heuristics

Database

2022

, baac047. doi:

10.1093/database/baac047

Almeida

Jonker

R.A.A.

da Silva

et al. (

2023

)

BIT.UA at Biocreative VIII track 1: a joint model for relation classification and novelty detection

. In: BioCreative VIII Challenge and Workshop: Curation and Evaluation in the era of Generative Models.

Zenodo

New Orleans, LA, USA

Almeida

Jonker

R.A.A.

Poudel

et al. (

2023

)

BIT.UA at BioASQ 11B: two-stage IR with synthetic training and zero-shot answer generation

. In: CLEF 2023 Working Notes.

CEUR Workshop Proceedings

Thessaloniki, Greece

pp. 37

–

Bairoch

(

2018

)

The cellosaurus, a cell-line knowledge resource

J. Biomol. Tech. JBT

–

. doi:

10.7171/jbt.18-2902-002

10.1016/j.eswa.2018.07.032

Bekoulis

Deleu

Demeester

et al. (

2018

)

Joint entity recognition and relation extraction as a multi-head selection problem

Expert Syst. Appl.

114

–

. doi:

10.1186/2041-1480-2-S5-S4

Ben Abacha

and

Zweigenbaum

(

2011

)

Automatic extraction of semantic relations between medical entities: a rule based approach

J. Biomed. Semant.

–

. doi:

10.1016/j.jbi.2016.09.009

Bhasuran

Murugesan

Abdulkadhar

et al. (

2016

)

Stacked ensemble combined with fuzzy matching for biomedical named entity recognition of diseases

J. Biomed. Inf.

–

. doi:

10.1186/1758-2946-7-S1-S14

Bodenreider

(

2004

)

The unified medical language system (UMLS): integrating biomedical terminology

Nucleic Acids Res.

267D

–

270

. doi:

10.

Brown

G.R.

Hem

Katz

K.S.

et al. (

2015

)

Gene: a gene-centered information resource at ncbi

Nucleic Acids Res.

D36

–

D42

. doi:

11.

Chalapathy

Borzeshi

E.Z.

Piccardi

(

2016

)

Bidirectional LSTM-CRF for clinical concept extraction

. In:

Rumshisky

Roberts

Bethard

and

Naumann

(eds.) Clinical Natural Language Processing Workshop (ClinicalNLP).

The COLING 2016 Organizing Committee

Osaka, Japan

pp. 7

–

12.

Chiticariu

Krishnamurthy

Yunyao

et al. (

2010

)

Domain adaptation of rule-based annotators for named-entity recognition tasks

. In:

and

Màrquez

(eds.) 2010 Conference on Empirical Methods in Natural Language Processing.

Association for Computational Linguistics

Cambridge, MA, USA

pp. 1002

–

1012

13.

Conceição

S.I.R.

Sousa

D.F.

Silvestre

P.M.

et al. (

2023

)

BioRED track lasigeBioTM submission: relation extraction using domain ontologies with BioRED

. In: BioCreative VIII Challenge and Workshop: Curation and Evaluation in the era of Generative Models.

Zenodo

New Orleans, LA, USA

14.

Dai

H.-J.

Lai

P.-T.

Chang

Y.-C.

et al. (

2015

)

Enhancing of chemical compound and drug name recognition using representative tag scheme and fine-grained tokenization

J. Cheminf.

, S14. doi:

15.

Dai

(

2018

)

Recognizing complex entity mentions: a review and future directions

. In:

Shwartz

Tabassum

Voigt

Che

de Marneffe

M-C

and

Nissim

(eds.) ACL 2018, Student Research Workshop.

Association for Computational Linguistics

Melbourne, Australia

pp. 37

–

16.

Davis

A.P.

Murphy

C.G.

Saraceni-Richards

C.A.

et al. (

2009

)

Comparative Toxicogenomics Database: a knowledgebase and discovery tool for chemical-gene-disease networks

Nucleic Acids Res.

D786

–

D792

. doi:

17.

Davis

A.P.

Wiegers

T.C.

Johnson

R.J.

et al. (

2023

)

Comparative Toxicogenomics Database (CTD): update 2023

Nucleic Acids Res.

D1257

–

D1262

. doi:

18.

Devlin

Chang

M.-W.

Lee

et al. (

2019

)

BERT: Pre-training of deep bidirectional transformers for language understanding

. In:

Burstein

Doran

and

Solorio

(eds.) Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers).

Association for Computational Linguistics

Minneapolis, Minnesota

pp. 4171

–

4186

19.

Eberts

and

Ulges

(

2019

)

Span-based joint entity and relation extraction with transformer pre-training

. In: 24th European Conference on Artificial Intelligence,

Santiago de Compostela, Spain

29 August-8 September 2020

Vol. 325

pp. 2006

–

2013

. doi:

10.3233/FAIA200321

20.

Elhadad

Pradhan

Gorman

et al. (

2015

)

SemEval-2015 Task 14: analysis of clinical text

. In: 9th International Workshop on Semantic Evaluation (SemEval 2015).

Association for Computational Linguistics

Denver, CO, USA

pp. 303

–

310

21.

French

and

McInnes

B.T.

(

2023

)

An overview of biomedical entity linking throughout the years

J. Biomed. Inf.

137

, 104252.

10.1093/bioinformatics/btx228

22.

Gonzalez-Agirre

Marimon

Intxaurrondo

et al. (

2019

)

PharmaCoNER: pharmacological substances, compounds and proteins named entity recognition track

. In: 5th Workshop on BioNLP Open Shared Tasks.

Association for Computational Linguistics

Hong Kong, China

pp. 1

–

23.

Tinn

Cheng

et al. (

2021

)

Domain-specific language model pretraining for biomedical natural language processing

ACM Trans. Comput. Healthcare

–

. doi:

24.

Habibi

Weber

Neves

et al. (

2017

)

Deep learning with word embeddings improves biomedical named entity recognition

Bioinformatics

i37

–

i48

. doi:

25.

Hirschman

Park

J.C.

Tsujii

et al. (

2002

)

Accomplishments and challenges in literature data mining for biology

Bioinformatics

1553

–

1561

. doi:

10.1093/bioinformatics/18.12.1553

26.

Hirschman

Yeh

Blaschke

et al. (

2005

)

Overview of BioCreAtIvE: critical assessment of information extraction for biology

BMC Bioinf.

, S1. doi:

10.1186/1471-2105-6-S1-S1

10.1038/s41597-021-00875-1

27.

Islamaj

Lai

P.-T.

Wei

C.-H.

et al. (

2023

)

The overview of the BioRED (Biomedical Relation Extraction Dataset) track at BioCreative VIII

. In: BioCreative VIII Challenge and Workshop: Curation and Evaluation in the era of Generative Models.

Zenodo

New Orleans, LA, USA

28.

Islamaj

Leaman

Kim

et al. (

2021

)

NLM-Chem, a new resource for chemical entity recognition in PubMed full text literature

Sci. Data

, 91. doi:

10.1016/j.nlp.2023.100017

29.

Jehangir

Radhakrishnan

and

Agarwal

(

2023

)

A survey on named entity recognition—datasets, tools, and methodologies

Nat. Lang. Process. J.

, 100017. doi:

30.

and

Grishman

(

2011

)

Knowledge base population: successful approaches and challenges

. In:

49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies

Association for Computational Linguistics

Portland, OR, USA

pp. 1148

–

1158

31.

Kang

Singh

Bui

et al. (

2014

)

Knowledge-based extraction of adverse drug events from biomedical text

BMC Bioinf.

, 64. doi:

10.1186/1471-2105-15-64

32.

Keraghel

Morbieu

and

Nadif

(

2024

)

A survey on recent advances in named entity recognition

arXiv:2401.10825

33.

Krallinger

Leitner

Rodriguez-Penagos

et al. (

2008

)

Overview of the protein-protein interaction annotation extraction task of BioCreative II

Genome Biol.

, S4. doi:

10.1186/gb-2008-9-s2-s4

34.

Lai

P.-T.

Islamaj

Wei

C.-H.

et al. (

2023

)

Assessing the state of the art in biomedical relation extraction: evaluating ChatGPT, PubMedBERT and BioREx for the BioRED track at BioCreative VIII

. In: BioCreative VIII Challenge and Workshop: Curation and Evaluation in the era of Generative Models.

Zenodo

New Orleans, LA, USA

35.

Lample

Ballesteros

Subramanian

et al. (

2016

)

Neural architectures for named entity recognition

. In: North American Chapter of the Association for Computational Linguistics: Human Language Technologies,

San Diego, California

2016

pp. 260

–

270

. doi:

10.18653/v1/N16-1030

36.

Leaman

Islamaj

Adams

et al. (

2023

)

Chemical identification and indexing in full-text articles: an overview of the NLM-Chem track at BioCreative VII

Database

2023

, baad005. doi:

10.1093/database/baad005

10.1093/bioinformatics/btw343

37.

Leaman

and

(

2016

)

TaggerOne: joint named entity recognition and normalization with semi-Markov models

Bioinformatics

2839

–

2846

. doi:

38.

Yang

Sun

et al. (

2023

)

BioRED task DUTIR-901 submission: enhancing biomedical document-level relation extraction through multi-task method

. In: BioCreative VIII Challenge and Workshop: Curation and Evaluation in the era of Generative Models.

Zenodo

New Orleans, LA, USA

39.

and

Verspoor

(

2023

)

EMBRE: entity-aware masking for biomedical relation extraction

. In: BioCreative VIII Challenge and Workshop: Curation and Evaluation in the era of Generative Models.

Zenodo

New Orleans, LA, USA

40.

Lima-López

Farré-Maduell

Gasco

et al. (

2023

)

Overview of MedProcNER task on medical procedure detection and entity linking at BioASQ 2023

. In: CLEF 2023 Working Notes.

CEUR Workshop Proceedings

Thessaloniki, Greece

pp. 1

–

41.

Lima-López

Farré-Maduell

Gasco-Sánchez

et al. (

2023

)

Overview of SympTEMIST at BioCreative VIII: corpus, guidelines and evaluation of systems for the detection and normalization of symptoms, signs and findings from text

. In: BioCreative VIII Challenge and Workshop: Curation and Evaluation in the era of Generative Models.

Zenodo

New Orleans, LA, USA

42.

Lipscomb

C.E.

(

2000

)

Medical Subject Headings (MeSH)

Bull. Med. Libr. Assoc.

265

–

266

PubMed

43.

Liu

Shareghi

Meng

et al. (

2021

)

Self-alignment pretraining for biomedical entity representations

. In: 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies.

Association for Computational Linguistics

Online

pp. 4228

–

4238

44.

Luo

Lai

P.-T.

Wei

C.-H.

et al. (

2022

)

BioRED: a rich biomedical relation extraction dataset

Briefings Bioinf.

, bbac282. doi:

10.1093/bib/bbac282

10.1093/bioinformatics/btad310

45.

Luo

Wei

C.-H.

Lai

P.-T.

et al. (

2023

)

AIONER: all-in-one scheme-based biomedical named entity recognition using deep learning

Bioinformatics

, btad310. doi:

46.

Luo

Sun

Xia

et al. (

2022

)

BioGPT: generative pre-trained transformer for biomedical text generation and mining

Briefings Bioinf.

, bbac409. doi:

10.1093/bib/bbac409

47.

Luo

Y.-F.

Henry

Wang

et al. (

2020

)

The 2019 n2c2/UMass Lowell shared task on clinical concept normalization

J. Am. Med. Inf. Assoc.

1529

–

. doi:

10.1093/jamia/ocaa106

48.

Matsubara

Ida

et al. (

2023

)

TTI-COIN at BioCreative VIII Track 1

. In: BioCreative VIII Challenge and Workshop: Curation and Evaluation in the era of Generative Models.

Zenodo

New Orleans, LA, USA

49.

Meesawad

Hsueh

C.-Y.

Zhang

et al. (

2023

)

BioRED task NCU-IISR submission: preprocessing-robust ensemble learning approach for biomedical relation extraction

. In: BioCreative VIII Challenge and Workshop: Curation and Evaluation in the era of Generative Models.

Zenodo

New Orleans, LA, USA

50.

Mikolov

Sutskever

Chen

et al. (

2013

)

Distributed representations of words and phrases and their compositionality

. In: 27th Conference on Neural Information Processing Systems (NIPS 2013).

Curran Associates, Inc.

Lake Tahoe, NV, USA

pp. 3111

–

3119

51.

Miranda-Escalada

Gascó

Lima-López

et al. (

2022

)

Overview of DisTEMIST at BioASQ: automatic detection and normalization of diseases from clinical texts: results, methods, evaluation and multilingual resources

. In: CLEF 2022 Working Notes.

CEUR Workshop Proceedings

Bologna, Italy

pp. 179

–

203

52.

Miranda-Escalada

Mehryary

Luoma

et al. (

2023

)

Overview of DrugProt task at BioCreative VII: data and methods for large-scale text mining and knowledge graph generation of heterogenous chemical–protein relations

Database

2023

, baad080. doi:

10.1093/database/baad080

53.

Parmar

Koehler

Bringmann

et al. (

2020

)

Biomedical information extraction for disease gene prioritization

arXiv preprint arXiv:2011.05188

54.

Phan

C.-P.

Ngo

G.-H.

Phan

et al. (

2023

)

Probability model with ensemble learning and data augmentation for named entity recognition (NER) and relation extraction (RE) tasks

. In: BioCreative VIII Challenge and Workshop: Curation and Evaluation in the era of Generative Models.

Zenodo

New Orleans, LA, USA

55.

Pradhan

Elhadad

Chapman

et al. (

2014

)

SemEval-2014 Task 7: analysis of clinical text

. In: 8th International Workshop on Semantic Evaluation (SemEval 2014).

Association for Computational Linguistics

Dublin, Ireland

pp. 54

–

56.

Pradhan

Elhadad

South

B.R.

et al. (

2013

)

Task 1: ShARe/CLEF eHealth evaluation lab 2013

. In: CLEF 2013 Working Notes,

CEUR Workshop Proceedings

Valencia, Spain

Vol. 1179

57.

Raffel

Shazeer

Roberts

et al. (

2020

)

Exploring the limits of transfer learning with a unified text-to-text transformer

J. Mach. Learn. Res.

–

. doi:

58.

Ratinov

and

Roth

(

2009

)

Design challenges and misconceptions in named entity recognition

. In: Thirteenth Conference on Computational Natural Language Learning (CoNLL-2009).

Association for Computational Linguistics

Boulder, CO, USA

pp. 147

–

155

59.

Salem

N.M.

White

E.K.

Baumgartner

et al. (

2023

)

An end-to-end approach for asserted named entity recognition and relationship extraction in biomedical text

. In: BioCreative VIII Challenge and Workshop: Curation and Evaluation in the era of Generative Models.

Zenodo

New Orleans, LA, USA

60.

Sänger

Garda

Wang

X.D.

et al. (

2024

)

HunFlair2 in a cross-corpus evaluation of biomedical named entity recognition and normalization tools

arXiv:2402.12372

61.

Sarker

and

Gonzalez

(

2017

)

Overview of the Second Social Media Mining for Health (SMM4H) shared tasks at AMIA 2017

. In: 2nd Social Media Mining for Health Research and Applications Workshop co-located with the American Medical Informatics Association Annual Symposium (AMIA 2017).

CEUR Workshop Proceedings

Washington, DC, USA

pp. 43

–

62.

Sarol

M.J.

Hong

, and

Kilicoglu

(

2023

)

UIUC-BioNLP @ BioCreative VIII BioRED Track

. In: BioCreative VIII Challenge and Workshop: Curation and Evaluation in the era of Generative Models.

Zenodo

New Orleans, LA, USA

63.

Schoch

C.L.

Ciufo

Domrachev

et al. (

2020

)

Ncbi taxonomy: a comprehensive update on curation, resources and tools

Database

2020

, baaa062. doi:

10.1093/database/baaa062

64.

Smigielski

E.M.

(

2000

)

dbsnp: a database of single nucleotide polymorphisms

Nucleic Acids Res.

352

–

355

. doi:

65.

Song

Liu

et al. (

2021

)

Deep learning methods for biomedical named entity recognition: a survey and qualitative comparison

Briefings Bioinf.

, bbab282. doi:

10.1093/bib/bbab282

10.1093/bioinformatics/btac598

66.

Sung

Jeong

Choi

et al. (

2022

)

BERN2: an advanced neural biomedical named entity recognition and normalization tool

Bioinformatics

4837

–

4839

. doi:

67.

“Teknium

”,

“theemozilla

”,

“karan4d”

, and

“huemin_art”

Nous hermes 2 mixtral 8x7b dpo

68.

Wei

C.-H.

Allot

Lai

P.-T.

et al. (

2024

)

PubTator 3.0: an AI-powered literature resource for unlocking biomedical knowledge

Nucleic Acids Research

, gkae235. doi:

10.1093/nar/gkae235

10.1093/bioinformatics/btac537

69.

Wei

C.-H.

Allot

Leaman

et al. (

2019

)

PubTator central: automated concept annotation for biomedical full text articles

Nucleic Acids Res.

W587

–

W593

. doi:

70.

Wei

C.-H.

Allot

Riehle

et al. (

2022

)

tmVar 3.0: an improved variant concept recognition and normalization tool

Bioinformatics

4449

–

4451

. doi:

71.

Wei

C.-H.

Harris

B.R.

et al. (

2012

)

Accelerating literature curation with text-mining tools: a case study of using PubTator to curate genes in PubMed abstracts

Database

2012

, bas041. doi:

10.1093/database/bas041

10.1093/bioinformatics/btad599

72.

Wei

C.-H.

Kao

H.-Y.

, and

(

2012

)

PubTator: a PubMed-like interactive curation system for document triage and literature curation

. In: 2012 BioCreative Workshop,

Washington, DC, USA

pp. 145

–

150

73.

Wei

C.-H.

Kao

H.-Y.

, and

(

2013

)

PubTator: a web-based text mining tool for assisting biocuration

Nucleic Acids Res.

W518

–

W522

. doi:

74.

Wei

C.-H.

Luo

Islamaj

et al. (

2023

)

GNorm2: an improved gene name recognition and normalization system

Bioinformatics

, btad599. doi:

75.

Wei

C.-H.

Peng

Leaman

et al. (

2016

)

Assessing the state of the art in biomedical relation extraction: overview of the BioCreative V chemical-disease relation (CDR) task

Database

2016

, baw032. doi:

10.1093/database/baw032