Abstract

Large amounts of important medical information are captured in free-text documents in biomedical research and within healthcare systems, which can be made accessible through natural language processing (NLP). A key component in most biomedical NLP pipelines is entity linking, i.e. grounding textual mentions of named entities to a reference of medical concepts, usually derived from a terminology system, such as the Systematized Nomenclature of Medicine Clinical Terms. However, complex entity mentions, spanning multiple tokens, are notoriously hard to normalize due to the difficulty of finding appropriate candidate concepts. In this work, we propose an approach to preprocess such mentions for candidate generation, building upon recent advances in text simplification with generative large language models. We evaluate the feasibility of our method in the context of the entity linking track of the BioCreative VIII SympTEMIST shared task. We find that instructing the latest Generative Pre-trained Transformer model with a few-shot prompt for text simplification results in mention spans that are easier to normalize. Thus, we can improve recall during candidate generation by 2.9 percentage points compared to our baseline system, which achieved the best score in the original shared task evaluation. Furthermore, we show that this improvement in recall can be fully translated into top-1 accuracy through careful initialization of a subsequent reranking model. Our best system achieves an accuracy of 63.6% on the SympTEMIST test set. The proposed approach has been integrated into the open-source xMEN toolkit, which is available online via https://github.com/hpi-dhc/xmen.

Introduction

Extraction and normalization of information captured in medical free-text documents are prerequisites for their use in research, clinical decision support, personalized medicine, and other downstream applications [1, 2]. An important task within biomedical information extraction pipelines is entity linking (EL) i.e. mapping mentions of named entities to their corresponding concepts in a structured knowledge base (KB), which is typically based on a controlled vocabulary or other medical terminology system [3].

The canonical architecture for EL comprises two main steps: candidate generation and candidate ranking [4]. Candidate generation aims to produce a list of potential concept matches for a given mention, while the ranking step prioritizes these candidates based on relevance and context. Within the candidate generation phase, achieving high recall is essential: the goal is to ensure that the correct entity is included among the candidates generated for subsequent reranking. Failure to generate candidates accounts for a large fraction of errors in current biomedical EL systems [5].

Candidate generation becomes particularly challenging when there is a mismatch between KB aliases and entity mentions. Various strategies have been proposed for improving performance in such cases, e.g. by obtaining additional (multi-lingual) KB aliases [6] or using coreference information across documents [7]. An area that has received comparatively little attention is the handling of complex entity mentions, i.e. mentions spanning multiple tokens. Several recently released biomedical corpora feature a large proportion of such long mention spans, resulting from their specific annotation policies. Notable examples include not only the DisTEMIST [8] or SympTEMIST [9] corpora for the Spanish language but also GGPONC for German [10]. Complex entity mentions pose significant challenges for candidate generation algorithms that rely on similarity-based lookups (utilizing either dense or shallow representations) between the mention and potential candidates.

Recently, generative large language models (LLMs) have demonstrated their potential across a wide range of biomedical natural language processing (NLP) tasks, including text simplification and summarization [11]. However, for information extraction problems, such as named entity recognition and EL, the zero-shot and few-shot performance of general-purpose LLMs still falls short of the (supervised) state-of-the-art performance by a substantial margin [12]. More successful generative approaches for biomedical EL are instead based on sequence-to-sequence models, which perform well on English-language benchmark datasets but rely on language resources, such as pretrained encoder–decoder models and KB aliases, which might not be available for other languages [13, 14]. In contrast, generative biomedical EL approaches for non-English languages have received comparatively little attention. Zhu et al. [15] propose a prompt-based contrastive generation framework for biomedical EL in multiple languages. However, we note that the only gold-standard dataset used in their evaluations are two subsets of the MantraGSC [16], with reported performance substantially below the state-of-the-art for this dataset, following more classical generate-and-rank approaches [17].

In this work, we investigate how LLMs can improve candidate generation performance for biomedical EL within a generate-and-rank framework. In particular, we employ LLMs to simplify mention spans, i.e. rephrasing them in a way that is more suitable for similarity search over concept aliases. Such simplification can take various forms: on the level of surface forms, it usually means that long mention spans are shortened. Syntactically and semantically, it might also involve converting adjectives to noun phrases, or choosing more general terms instead of specific ones. Our work builds upon our successful participation in the BioCreative VIII SympTEMIST shared task [9, 18]. The shared task provided a benchmark dataset of clinical case reports in Spanish with gold-standard annotations of symptom mentions and corresponding Systematized Nomenclature of Medicine Clinical Terms (SNOMED CT) codes [19]. Our winning submission employed the default EL pipeline implemented in the xMEN toolkit for cross-lingual medical entity normalization [17]. In this paper, we will share how the candidate generation phase of the system can be enhanced further through a reusable LLM-based entity simplification approach. The approach shares some similarities to Roller et al. [20], who translate non-English biomedical entity mentions to English, thus facilitating lookup in an English-language KB. As the xMEN system already accounts for the language mismatch through dense, cross-lingual candidate retrieval, this work investigates the additional benefit of paraphrasing the input mentions in the target language. An overview of our EL pipeline as a flow chart, along with example mentions, is depicted in Fig. 1. The figure shows two examples of mentions: a simple one, which can be easily normalized (“fiebre”), and a more complex one that benefits from simplification (“aumento de densidad en lóbulo inferior” is simplified to the more general term “lesión de pulmón”). In addition to improved candidate generation, we will also investigate how initialization of a cross-encoder-based reranker from different BERT models affects EL performance and to what extent reranking is able to leverage the additionally retrieved candidates.

(a) Examples of entities and their grounding against SNOMED CT, as well as (b) an overview of our EL pipeline as a flow chart- the components investigated in this work are highlighted.
Figure 1.

(a) Examples of entities and their grounding against SNOMED CT, as well as (b) an overview of our EL pipeline as a flow chart- the components investigated in this work are highlighted.

In summary, the contributions of this work are:

  • an easily reusable, LLM-based entity simplification approach,

  • an analysis of its feasibility and the effect of entity simplification on recall during candidate generation, as well as

  • an evaluation of the impact of the initialization of cross-encoders for reranking on final entity linking accuracy.

The remainder of this work is structured as follows: we introduce the SympTEMIST benchmark dataset and our EL pipeline, including the novel entity simplification component. Afterward, we share our experimental results, followed by a discussion and error analysis. Finally, we summarize our findings and conclude with an outlook.

Materials and methods

In this section, we describe our incorporated methods. The source code to reproduce our experimental results is available online (https://github.com/hpi-dhc/symptemist/). Unless otherwise stated, we use the default configurations of the xMEN toolkit [17].

Dataset

The SympTEMIST shared task dataset encompasses n = 1000 manually annotated case reports in Spanish. Twenty-five percent (n = 250) of the documents have been used as a held-out test set in the shared task and were made available thereafter. The remaining documents (n = 750) are used as training data, out of which we sample 20% (n = 150) as an internal validation set for model selection. For integration with the xMEN framework, we implement a data loader for the dataset within the BigBIO [21] library. In total, the dataset contains more than 12 000 annotations of symptoms, signs, and findings. However, for the shared task, composite entity mentions, i.e. with multiple concepts per span, were excluded from the evaluation, leaving us with 11 124 annotations.

Mention spans in SympTEMIST tend to be long, in comparison to other corpora, as shown in Table 1.

Table 1.

Dataset statistics of SympTEMIST (version as used in the entity linking track of the shared task without composite mentions) and comparable biomedical corpora (largest values are highlighted bold)

Mention length
CorpusDocumentsMentionsMeanMedianMax
SympTEMIST100011 K3.56236
DisTEMIST100011 K3.27228
Quaero250816 K1.40115
MedMentions43922327 K1.40119
Mention length
CorpusDocumentsMentionsMeanMedianMax
SympTEMIST100011 K3.56236
DisTEMIST100011 K3.27228
Quaero250816 K1.40115
MedMentions43922327 K1.40119
Table 1.

Dataset statistics of SympTEMIST (version as used in the entity linking track of the shared task without composite mentions) and comparable biomedical corpora (largest values are highlighted bold)

Mention length
CorpusDocumentsMentionsMeanMedianMax
SympTEMIST100011 K3.56236
DisTEMIST100011 K3.27228
Quaero250816 K1.40115
MedMentions43922327 K1.40119
Mention length
CorpusDocumentsMentionsMeanMedianMax
SympTEMIST100011 K3.56236
DisTEMIST100011 K3.27228
Quaero250816 K1.40115
MedMentions43922327 K1.40119

Measured as the number of whitespace-separated tokens, the average mention in SympTEMIST has a length of 3.56 tokens, with a median of 2 tokens and a maximum length of 36 tokens. This is comparable to the DisTEMIST dataset (mean: 3.27, median: 2, max: 28) and much longer than in other biomedical EL benchmark datasets, such as Quaero (mean: 1.40, median: 1, max: 15) or MedMentions (mean: 1.40, median: 1, max: 19) [22, 23].

Target knowledgebase

The set of target concepts for EL in SympTEMIST consists of 121 760 SNOMED CT codes as defined by a gazetteer of concepts related to a task-relevant subset of SNOMED CT [9]. Using the xMEN command-line interface, we extend the limited number of aliases for this defined set of concepts with all Spanish and English aliases from the Unified Medical Language System (version 2023AB), resulting in ∼1.08 m terms [24].

Candidate generation

For candidate generation, we use the default pipeline from the xMEN toolkit, which consists of an ensemble of (i) a term frequency–inverse document frequency (TF-IDF) vectorizer over character 3-grams as well as (ii) a cross-lingual SapBERT model [25]. Candidate concepts for each mention are retrieved based on a k-nearest-neighbor search, with cosine similarity over mention and concept vectors. We retrieve k = 64 candidates, which are used as an input to the subsequent reranking. As only concepts that are retrieved at this point can be possibly correctly ranked, it is essential to achieve high recall@k during this phase.

LLM-based entity simplification

To increase recall for complex entity mentions, which tend to have little lexical overlap with aliases in the target terminology, we simplify mentions using a generative LLM. In this work, we employ, Generative Pre-trained Transformer model (GPT-4) (gpt-4-0125-preview) through the OpenAI API [26], as a proof of concept. However, our approach can also be used with any other generative LLM.

We call the LLM using a few-shot prompt with four fixed in-context examples. Each example consists of a mention from the training set as well as the canonical concept name from the ground-truth concept in SNOMED CT. Figure 2 shows the used prompt template. Three few-shot examples were randomly sampled from the subset of mentions, for which the candidate generator fails to retrieve the ground-truth concept. This broad notion of simplification might also involve rewriting an adjective as a noun phrase (“afebril” as “temperature corporal normal”). One additional example has been chosen manually to demonstrate that easily linkable mentions (“disnea”) should not be further simplified.

Few-shot prompt template for entity simplification: few-shot examples are underlined and the actual input mention is highlighted in bold, inputs and outputs are denoted by triple backticks (```).
Figure 2.

Few-shot prompt template for entity simplification: few-shot examples are underlined and the actual input mention is highlighted in bold, inputs and outputs are denoted by triple backticks (```).

However, it is not desirable to simplify all mentions, for instance, when candidates can already be generated with high confidence for the original mention span. We therefore consider a confidence threshold and apply mention simplification only when the top candidate concept for a mention has a score, i.e. cosine similarity, below this value, as shown in Fig. 1. For this subset of concepts, we apply text simplification and generate candidates again. The confidence threshold is a hyperparameter in the range [0, 1] that can be simply optimized by observing its impact on recall for the training set.

Candidate ranking

The list of generated candidates is used as an input to a cross-encoder model for reranking [27]. The cross-encoder is trained for 20 epochs, using batches of 64 candidates with ground truth labels and a softmax loss with rank regularization [17]. We use the default hyperparameters of the xMEN library and keep the checkpoint that maximizes accuracy on the validation set. One training run takes ∼30 h on a single Nvidia A40 GPU. Since the main evaluation metric for the SympTEMIST shared task is accuracy (equivalent to recall@1), we do not consider a NIL (not-in-list) option during training and inference. This means a prediction is made for each mention and that recall is equal to precision or accuracy.

The cross-encoder model can be initialized from any pretrained BERT encoder. In our experiments, we investigate the initialization from the base-sized versions of the following models:

  • Multi-lingual, general-domain: XLM-RoBERTa (FacebookAI/xlm-roberta-base), a multi-lingual model pretrained on general-domain text in 100 languages [28].

  • English, biomedical: PubMedBERT (BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext), pretrained on English biomedical text [29]. We include this state-of-the-art English model in our evaluation, as prior work has shown that such models can also outperform models for the specific target language for some tasks [30].

  • Spanish, general-domain: BNE (roberta-base-bne), a RoBERTa model pretrained on a Spanish corpus from the National Library of Spain (Biblioteca Nacional de España) [31].

  • Spanish, biomedical: A RoBERTa model (PlanTL-GOB-ES/roberta-base-biomedical-clinical-es) pretrained on Spanish biomedical text, which we refer to as PlanTL [32].

Results

In the following, we share details about our conducted experiments and obtained results.

Candidate generation with entity simplification

Figure 3 shows the impact of entity simplification on candidate generation on the training set for different confidence thresholds. The lowest score yielded by the baseline candidate generator is 0.55. Therefore, no simplification takes place below this threshold and thus the results are unaltered compared to the baseline. However, even for a threshold of 1.0, i.e. when all mentions are subject to simplification, some mentions remain unchanged, as the LLM may just return the original input when it cannot be further simplified.

Impact of different confidence thresholds for entity simplification as the proportion of affected mentions out of all mentions, and the resulting improvements in recall@1 and recall@64 over the baseline candidate generator with the original mention spans (all results on the training set); we only simplify mentions when the baseline candidate generator produces a score below the threshold.
Figure 3.

Impact of different confidence thresholds for entity simplification as the proportion of affected mentions out of all mentions, and the resulting improvements in recall@1 and recall@64 over the baseline candidate generator with the original mention spans (all results on the training set); we only simplify mentions when the baseline candidate generator produces a score below the threshold.

The improvement in recall@64 over the baseline candidate generator increases up until a threshold of 0.85, then plateaus at a level of +2.9 pp. The difference in recall@1 also increases up until a threshold of 0.85, but then decreases and even becomes negative for too high thresholds. This highlights that the original candidate generator is highly confident about its top-1 choice for a subset of mentions, typically the ones with high lexical overlap between a mention and candidate alias. Altering such mentions by simplification is usually not helpful, and may indeed decrease recall@1 (while leaving recall@64 largely unaffected). As performance improvements are maximized for a threshold of 0.85, we use this threshold for all further experiments and test set evaluations.

Table 2 provides details on the impact of entity simplification during candidate generation (on the test set). Mentions are binned according to the maximum confidence score of the candidate generator. Confidence of 1.0 occurs when a direct lookup is possible, i.e. there is at least one exact match for an alias in the target KB. Results for simplification without a threshold, i.e. when all mentions are simplified, are reported for comparison (last column). As expected, the mean mention length is reduced, from 3.63 to 2.61 tokens. Comparing the confidence scores assigned by the candidate generator before and after simplification illustrates its effect: the number of mentions in the higher confidence brackets (>0.7) is reduced; at the same time, the number of mentions with lower candidate confidence scores increases. We interpret this change as the effect of generating more general terms for specific mentions, which have lower similarity with single (but potentially wrong) concepts in the candidate space. This way, relatively large increases in both recall@1 and recall@64 can be observed in the confidence range (0.6–0.9].

Table 2.

Impact of entity simplification with a threshold of t = 0.85 by comparing test set performance before and after entity simplification, as well as the difference (Δ)

No SimplificationSimplification (t = 0.85)ΔSimplification (t > 1.0)
Mention lengthMax.2313−1013
Mean3.632.61−1.022.83
Test setMentions284728472847
Recall@10.4450.460+0.0150.431
Recall@640.7640.793+0.0290.792
Confidence score
(0.5,0.6]Mentions3785+4885
Recall@10.0270.048+0.0210.048
Recall@640.2430.241−0.0020.241
(0.6,0.7]Mentions444616+172639
Recall@10.0650.096+0.0310.095
Recall@640.3380.481+0.1430.499
(0.7,0.8]Mentions653577−76742
Recall@10.2190.328+0.1090.340
Recall@640.6290.744+0.1140.782
(0.8,0.9]Mentions555411−144507
Recall@10.4380.494+0.0560.523
Recall@640.8450.910+0.0650.933
(0.9,1.0]Mentions433433±0217
Recall@10.7180.718±00.659
Recall@640.9770.977±00.982
1.0Mentions725725±0657
Recall@10.7560.756±00.764
Recall@640.9860.986±00.992
No SimplificationSimplification (t = 0.85)ΔSimplification (t > 1.0)
Mention lengthMax.2313−1013
Mean3.632.61−1.022.83
Test setMentions284728472847
Recall@10.4450.460+0.0150.431
Recall@640.7640.793+0.0290.792
Confidence score
(0.5,0.6]Mentions3785+4885
Recall@10.0270.048+0.0210.048
Recall@640.2430.241−0.0020.241
(0.6,0.7]Mentions444616+172639
Recall@10.0650.096+0.0310.095
Recall@640.3380.481+0.1430.499
(0.7,0.8]Mentions653577−76742
Recall@10.2190.328+0.1090.340
Recall@640.6290.744+0.1140.782
(0.8,0.9]Mentions555411−144507
Recall@10.4380.494+0.0560.523
Recall@640.8450.910+0.0650.933
(0.9,1.0]Mentions433433±0217
Recall@10.7180.718±00.659
Recall@640.9770.977±00.982
1.0Mentions725725±0657
Recall@10.7560.756±00.764
Recall@640.9860.986±00.992
Table 2.

Impact of entity simplification with a threshold of t = 0.85 by comparing test set performance before and after entity simplification, as well as the difference (Δ)

No SimplificationSimplification (t = 0.85)ΔSimplification (t > 1.0)
Mention lengthMax.2313−1013
Mean3.632.61−1.022.83
Test setMentions284728472847
Recall@10.4450.460+0.0150.431
Recall@640.7640.793+0.0290.792
Confidence score
(0.5,0.6]Mentions3785+4885
Recall@10.0270.048+0.0210.048
Recall@640.2430.241−0.0020.241
(0.6,0.7]Mentions444616+172639
Recall@10.0650.096+0.0310.095
Recall@640.3380.481+0.1430.499
(0.7,0.8]Mentions653577−76742
Recall@10.2190.328+0.1090.340
Recall@640.6290.744+0.1140.782
(0.8,0.9]Mentions555411−144507
Recall@10.4380.494+0.0560.523
Recall@640.8450.910+0.0650.933
(0.9,1.0]Mentions433433±0217
Recall@10.7180.718±00.659
Recall@640.9770.977±00.982
1.0Mentions725725±0657
Recall@10.7560.756±00.764
Recall@640.9860.986±00.992
No SimplificationSimplification (t = 0.85)ΔSimplification (t > 1.0)
Mention lengthMax.2313−1013
Mean3.632.61−1.022.83
Test setMentions284728472847
Recall@10.4450.460+0.0150.431
Recall@640.7640.793+0.0290.792
Confidence score
(0.5,0.6]Mentions3785+4885
Recall@10.0270.048+0.0210.048
Recall@640.2430.241−0.0020.241
(0.6,0.7]Mentions444616+172639
Recall@10.0650.096+0.0310.095
Recall@640.3380.481+0.1430.499
(0.7,0.8]Mentions653577−76742
Recall@10.2190.328+0.1090.340
Recall@640.6290.744+0.1140.782
(0.8,0.9]Mentions555411−144507
Recall@10.4380.494+0.0560.523
Recall@640.8450.910+0.0650.933
(0.9,1.0]Mentions433433±0217
Recall@10.7180.718±00.659
Recall@640.9770.977±00.982
1.0Mentions725725±0657
Recall@10.7560.756±00.764
Recall@640.9860.986±00.992

At the same time, there is no increase in the number of mentions linkable with very high confidence (>0.9) through simplification, i.e. none of the generated concepts is more suitable for a direct lexical lookup. Due to the applied threshold of 0.85, no reductions are visible in the higher confidence ranges as well. However, as shown in the last column of Table 2, and consistent with Fig. 3, simplifying entities with already high confidence scores would negatively impact the overall recall@1, highlighting the importance of selecting appropriate mentions for simplification.

Impact of cross-encoder initialization on reranking performance

Table 3 shows the performance improvements on the test set, obtained through entity simplification during candidate generation (column “Simplification”) and reranking with different cross-encoder models (rows “Candidates + Reranking”). We report the recall@1, as accuracy was the main evaluation metric for the SympTEMIST shared task, which is equivalent to recall@1, given that no concept has multiple ground-truth codes. Recall@1 after candidate generation is improved by 1.5 pp via simplification on the test set, consistent with the training set performance shown in Fig. 3.

Table 3.

Test set performance before and after entity simplification, with a threshold of t = 0.85 (highest values highlighted in bold, difference denoted by Δ)

Accuracy/Recall@1 (test set)
No SimplificationSimplification (t = 0.85)Δ
CandidatesxMEN Ensemble (TF-IDF + SapBERT)0.4450.460+ 0.015
Candidates +Reranking (Cross-encoder)XLM-RoBERTaMulti-lingualGeneral0.5900.602+ 0.012
BNESpanishGeneral0.5810.595+ 0.014
PubMedBERTEnglishBiomedical0.5790.585+ 0.006
PlanTLSpanishBiomedical0.6070.636+ 0.029
Shared task2nd Best Teams (BIT.UA, Fusion@SU)0.589
Accuracy/Recall@1 (test set)
No SimplificationSimplification (t = 0.85)Δ
CandidatesxMEN Ensemble (TF-IDF + SapBERT)0.4450.460+ 0.015
Candidates +Reranking (Cross-encoder)XLM-RoBERTaMulti-lingualGeneral0.5900.602+ 0.012
BNESpanishGeneral0.5810.595+ 0.014
PubMedBERTEnglishBiomedical0.5790.585+ 0.006
PlanTLSpanishBiomedical0.6070.636+ 0.029
Shared task2nd Best Teams (BIT.UA, Fusion@SU)0.589
Table 3.

Test set performance before and after entity simplification, with a threshold of t = 0.85 (highest values highlighted in bold, difference denoted by Δ)

Accuracy/Recall@1 (test set)
No SimplificationSimplification (t = 0.85)Δ
CandidatesxMEN Ensemble (TF-IDF + SapBERT)0.4450.460+ 0.015
Candidates +Reranking (Cross-encoder)XLM-RoBERTaMulti-lingualGeneral0.5900.602+ 0.012
BNESpanishGeneral0.5810.595+ 0.014
PubMedBERTEnglishBiomedical0.5790.585+ 0.006
PlanTLSpanishBiomedical0.6070.636+ 0.029
Shared task2nd Best Teams (BIT.UA, Fusion@SU)0.589
Accuracy/Recall@1 (test set)
No SimplificationSimplification (t = 0.85)Δ
CandidatesxMEN Ensemble (TF-IDF + SapBERT)0.4450.460+ 0.015
Candidates +Reranking (Cross-encoder)XLM-RoBERTaMulti-lingualGeneral0.5900.602+ 0.012
BNESpanishGeneral0.5810.595+ 0.014
PubMedBERTEnglishBiomedical0.5790.585+ 0.006
PlanTLSpanishBiomedical0.6070.636+ 0.029
Shared task2nd Best Teams (BIT.UA, Fusion@SU)0.589

Subsequent training of a reranker with a Spanish biomedical model by PlanTL [32] benefits most from the additionally generated candidates, achieving an improvement of 2.9 pp in recall@1 over the baseline without simplification, which was the top-performing system in the original SympTEMIST shared task [18]. Strikingly, the model initialized from the PlanTL checkpoint can recover from nearly all ranking errors for the newly retrieved candidates, with the improvement in recall@1 of 2.9 pp being equal to the improvement in recall@64 before reranking. Considering the close performance of the top-performing teams in the shared task [33, 34], the magnitude of this effect is even more remarkable.

While the other settings for initializing the reranking model also benefit from improved candidate generation through entity simplification, their overall performance falls short of the Spanish biomedical model. Moreover, all improvements through simplification are already achieved after candidate generation, with no additional gains by reranking for these models.

Discussion and evaluation

Error analysis

Figure 4 shows a detailed analysis of the impact of entity simplification by mention length. The frequency of mentions of increasing lengths is shown in Fig. 4a. As shown in Fig. 4b, recall@64 is generally improved for mentions of all lengths, even for ∼3% of mentions of length 1. The analysis reveals that simplification can also lead to worse candidate generation performance, i.e. cases where the correct candidate is not among the top-64 candidates after simplification, although it was before simplification. However, the overall increase in recall for many mentions outweighs the decrease for others, which ultimately leads to the improved recall@64 as shown in Fig. 3.

Impact of entity simplification for mentions of increasing length (in tokens, x-axis) - all results refer to the candidate generation phase (without reranking) on the training set: (a) total number of mentions by token length as context for the following relative measures (b) impact on overall recall@64, i.e. the fraction of mentions for which the correct candidate is retrieved after entity simplification versus the fraction of mentions where simplification is detrimental (c) fraction of mentions where simplification resulted in a higher or lower ranking of the correct concept.
Figure 4.

Impact of entity simplification for mentions of increasing length (in tokens, x-axis) - all results refer to the candidate generation phase (without reranking) on the training set: (a) total number of mentions by token length as context for the following relative measures (b) impact on overall recall@64, i.e. the fraction of mentions for which the correct candidate is retrieved after entity simplification versus the fraction of mentions where simplification is detrimental (c) fraction of mentions where simplification resulted in a higher or lower ranking of the correct concept.

In contrast, Fig. 4c shows the effect on the ranking of the correct concept for cases where it was already retrieved among the top-64 candidates. We can see that, on average, ranking performance is not affected in these instances. We note that there is a relatively large decrease for mentions of length 10 or more. However, the effect is negligible, as such mentions are very rare (see Fig. 4a).

The comparison of Fig. 4b and c shows that the improvement in recall@1 due to the simplification of the entities is based exclusively on the retrieval of additional, previously missed concepts and not on an improved ranking of candidates already among the top-64. We further conclude from Fig. 4b that the baseline candidate generation score is a helpful heuristic but not an exact criterion for identifying mentions that benefit most from entity simplification. A more accurate way to identify such mentions could potentially retain the achieved increase in recall, while avoiding most of the decrease.

Limitations

Due to resource constraints, such as the daily token limits imposed by the OpenAI API, we were unable to perform any ablations on the used prompt. Therefore, a more refined prompt, i.e. with few-shot examples selected based on the respective input sentence, might yield improved results. To enable more exhaustive experiments in this regard, we plan to investigate the suitability of open-source LLMs in future work; however, currently substantial computational resources would also be required for hosting the most capable ones. In preliminary experiments, we briefly evaluated the open-source model BLOOMZ-7B [35] for entity simplification, but the output proved to be unreliable in comparison with GPT-4.

The English-language pretrained PubMedBERT model exhibits the worst performance across all reranking models, in particular in comparison to a Spanish-language, domain-specific model. Including PubMedBERT in our evaluation without any adaptation of the input text seemed plausible, due to its successful application for other non-English biomedical benchmark tasks and the similarity of medical terminology across languages [30]. However, the reranking step in our EL pipeline incorporates a high level of contextual information; therefore, models specifically trained to represent such context in the target language appear to be advantaged.

In addition, composite mentions (usually involving coordinate structures with “and”/“or”) have been excluded from the shared task evaluation. Not only do these tend to be complex but also present the additional difficulty of being mapped to multiple KB codes. While identifying and resolving such composite mentions might be achievable with generative LLMs as well, we have not investigated this issue further in the given work.

Conclusion and outlook

In this work, we describe a novel technique for improving biomedical EL performance with text simplification in the context of the SympTEMIST shared task of the BioCreative VIII challenge. While the default EL pipeline of the incorporated open-source framework xMEN already achieved the best performance in the original shared task, we could show that the candidate generation and subsequent ranking performance for complex entity mentions can be further improved via LLM-based entity simplification with GPT-4 and a simple few-shot prompt, without elaborate prompt engineering.

We have integrated our proposed entity simplification approach as a new component into the modular architecture of the xMEN toolkit. Thus, it can be used with any EL benchmark dataset supported by the framework [17]. However, the complexity of manually or automatically annotated entity mention spans, which are subject to EL, strongly depends on the text genre and annotation policy of each dataset. An investigation into the applicability of our approach to other datasets, both with and without frequent complex mentions, is warranted. Similarly, the extension of our approach to composite mentions with multiple linked concepts remains open for future research.

The proposed approach can be extended along multiple dimensions. First, more extensive (automated) prompt engineering is likely to increase performance—for instance, it could provide benefits by dynamically selecting few-shot examples and including parts of the document context within the prompt. In the future, a more direct approach for using LLMs for biomedical EL could be achieved by enabling the generative LLM to access the target KB, for instance, via retrieval-augmented generation or function calling.

Conflict of interest

None declared.

Funding

This work was supported by a grant from the German Federal Ministry of Research and Education (01ZZ2314N).

Data availability

The SympTEMIST dataset underlying this article are available on Zenodo, at https://zenodo.org/doi/10.5281/zenodo.8223653. The source code for our experiments is available on GitHub at https://github.com/hpi-dhc/symptemist.

References

1.

Koleck
TA
,
Dreisbach
C
,
Bourne
PE
et al. 
Natural language processing of symptoms documented in free-text narratives of electronic health records: a systematic review
.
J Am Med Inform Assoc
2019
;
26
:
364
79
.

2.

Demner-Fushman
D
,
Chapman
WW
,
McDonald
CJ
.
What can natural language processing do for clinical decision support?
J Biomed Inform
2009
;
42
:
760
72
.

3.

French
E
,
McInnes
BT
.
An overview of biomedical entity linking throughout the years
.
J Biomed Inform
2023
;
137
:104252.

4.

Sevgili
Ö
,
Shelmanov
A
,
Arkhipov
M
et al. 
Neural entity linking: a survey of models based on deep learning
.
Semant Web J
2022
;
13
:
527
70
.

5.

Kartchner
D
Deng
J
Lohiya
S
et al. 
A comprehensive evaluation of biomedical entity linking models
. In:
Bouamor
H
,
Pino
J
,
Bali
K
(eds), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp.
14462
78
.
Singapore
:
Association for Computational Linguistics
,
2023
.

6.

Zhou
S
,
Rijhwani
S
,
Wieting
J
et al. 
Improving candidate generation for low-resource cross-lingual entity linking
.
Trans Assoc Comput Linguist
2020
;
8
:
109
24
.

7.

Agarwal
D
,
Angell
R
,
Monath
N
et al. 
Entity linking via explicit mention-mention coreference modeling
. In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp.
4644
58
.
Seattle, United States
:
Association for Computational Linguistics
,
2022
.

8.

Miranda-Escalada
A
,
Gascó
L
,
Lima-López
S
et al. 
Overview of DisTEMIST at BioASQ: automatic detection and normalization of diseases from clinical texts: results, methods, evaluation and multilingual resources
. In: Working Notes of Conference and Labs of the Evaluation (CLEF) Forum. CEUR Workshop Proceedings,
Bologna, Italy
, p.
179
203
,
2022
.

9.

Lima-López
S
Farré-Maduell
E
Gasco-Sánchez
L
et al. 
Overview of the SympTEMIST shared task at BioCreative VIII: detection and normalization of symptoms, signs and findings
. In:
Islamaj
R
,
Arighi
C
,
Campbell
I
et al. . (eds), Proceedings of the BioCreative VIII Challenge and Workshop: Curation and Evaluation in the era of Generative Models,
New Orleans, LA
,
2023
.

10.

Borchert
F
,
Lohr
C
,
Modersohn
L
et al. 
GGPONC 2.0 - the German clinical guideline corpus for oncology: curation workflow, annotation policy, baseline NER taggers
. In: Proceedings of the Thirteenth Language Resources and Evaluation Conference, pp.
3650
60
.
Marseille, France
:
European Language Resources Association
,
2022
.

11.

Tian
S
,
Jin
Q
,
Yeganova
L
et al. 
Opportunities and challenges for ChatGPT and large language models in biomedicine and health
.
Brief Bioinform
2023
;
25
:bbad493.

12.

Jahan
I
,
Laskar
MTR
,
Peng
C
et al. 
A comprehensive evaluation of large language models on benchmark biomedical text processing tasks
.
Comput Biol Med
2024
;
171
:108189.

13.

Yan
X
,
Möller
C
,
Usbeck
R
.
Biomedical entity linking with triple-aware pre-training. biomedical entity linking with triple-aware pre-training
.
arXiv [Cs CL]
2023
;
2023
.

14.

Yuan
H
,
Yuan
Z
,
Yu
S.
Generative biomedical entity linking via knowledge base-guided pre-training and synonyms-aware fine-tuning
. In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp.
4038
48
.
Seattle, United States
:
Association for Computational Linguistics
,
2022
.

15.

Zhu
T
Qin
Y
Chen
Q
et al. 
Controllable contrastive generation for multilingual biomedical entity linking
. In:
Bouamor
H
,
Pino
J
,
Bali
K
(eds), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp.
5742
53
.
Singapore
:
Association for Computational Linguistics
,
2023
.

16.

Kors
JA
,
Clematide
S
,
Akhondi
SA
et al. 
A multilingual gold-standard corpus for biomedical concept recognition: the Mantra GSC
.
J Am Med Inform Assoc
2015
;
22
:
948
56
.

17.

Borchert
F
,
Llorca
I
,
Roller
R
et al. 
xMEN: a modular toolkit for cross-lingual medical entity normalization
.
arXiv preprint arXiv:2310.11275
2023
.

18.

Borchert
F
Llorca
I
Schapranow
M-P
HPI-DHC @ BC8 SympTEMIST track: detection and normalization of symptom mentions with SpanMarker and xMEN
. In:
Islamaj
R
,
Arighi
C
,
Campbell
I
et al. . (eds), Proceedings of the BioCreative VIII Challenge and Workshop: Curation and Evaluation in the era of Generative Models.
New Orleans, LA
,
2023
.

19.

Donnelly
K
.
SNOMED-CT: the advanced terminology and coding system for eHealth
.
Stud Health Technol Inform
2006
;
121
:
279
90
.

20.

Roller
R
,
Kittner
M
,
Weissenborn
D
et al. .
Cross-lingual candidate search for biomedical concept normalization
. In: Proceedings of the LREC 2018 Workshop “MultilingualBIO: Multilingual Biomedical Text Processing”,
Miyazaki, Japan
, p.
16
20
,
2018
.

21.

Fries
J
,
Weber
L
,
Seelam
N
et al. 
BigBIO: a framework for data-centric biomedical natural language processing
.
Adv Neural Inf Process Syst
2022
;
35
:
25792
806
.

22.

Névéol
A
,
Cohen
KB
,
Grouin
C
et al. 
Clinical information extraction at the CLEF eHealth evaluation lab 2016
.
CEUR Workshop Proc
2016
;
1609
:
28
42
.

23.

Mohan
S
,
Li
D
.
MedMentions: A Large Biomedical Corpus Annotated with UMLS Concepts
.
Automated Knowledge Base Construction (AKBC)
, Amherst, Massachusetts,
2019
.

24.

Bodenreider
O
.
The Unified Medical Language System (UMLS): integrating biomedical terminology
.
Nucleic Acids Res
2004
;
32
:
D267
70
.

25.

Liu
F
,
Shareghi
E
,
Meng
Z
et al. 
Self-alignment pretraining for biomedical entity representations
. In: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp.
4228
38
.
Association for Computational Linguistics, Online
,
2021
.

26.

OpenAI
.
GPT-4 Technical Report
.
arXiv preprint arXiv:2303.08774
,
2023
.

27.

Wu
L
,
Petroni
F
,
Josifoski
M
et al. 
Scalable zero-shot entity linking with dense entity retrieval
. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp.
6397
407
.
Association for Computational Linguistics, Online
,
2020
.

28.

Conneau
A
Khandelwal
K
Goyal
N
et al. 
Unsupervised cross-lingual representation learning at scale
. In:
Jurafsky
D
,
Chai
J
,
Schluter
N
et al. . (eds.), Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp.
8440
51
.
Association for Computational Linguistics, Online
,
2020
.

29.

Gu
Y
,
Tinn
R
,
Cheng
H
et al. 
Domain-specific language model pretraining for biomedical natural language processing
.
ACM Trans Comput Healthcare
2021
;
3
:
1
23
.

30.

Sun
C
,
Yang
Z
,
Wang
L
et al. 
Deep learning with language models improves named entity recognition for PharmaCoNER
.
BMC Bioinf.
2021
;
22
:602.

31.

Gutiérrez-Fandiño
A
,
Armengol-Estapé
J
,
Pàmies
M
et al. 
MarIA: Spanish language models
.
Proces Leng Nat
2022
;
68
:
39
60
.

32.

Carrino
CP
,
Llop
J
,
Pàmies
M
et al. 
Pretrained biomedical language models for clinical NLP in Spanish
. In: Proceedings of the 21st Workshop on Biomedical Language Processing, pp.
193
99
.
Association for Computational Linguistics
,
Dublin, Ireland
,
2022
.

33.

Grazhdanski
G
Vassileva
S
Koychev
I
et al. 
Team Fusion@SU @ BC8 SympTEMIST track: transformer-based approach for symptom recognition and linking
. In:
Islamaj
R
,
Arighi
C
,
Campbell
I
et al. . (eds.), Proceedings of the BioCreative VIII Challenge and Workshop: Curation and Evaluation in the era of Generative Models.
New Orleans, LA
,
2023
.

34.

Jonker
RAA
Almeida
T
Matos
S
et al. 
Team BIT.UA @ BC8 SympTEMIST track: a two-step pipeline for discovering and normalizing clinical symptoms in Spanish
. In:
Islamaj
R
,
Arighi
C
,
Campbell
I
et al. . (eds), Proceedings of the BioCreative VIII Challenge and Workshop: Curation and Evaluation in the era of Generative Models.
New Orleans, LA
,
2023
.

35.

Muennighoff
N
,
Wang
T
,
Sutawika
L
et al. 
Crosslingual generalization through multitask finetuning
.
arXiv preprint arXiv:2211.01786
,
2022
.

This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.