Automatic genetic phenotype normalization from dysmorphology physical examinations: an overview of the BioCreative VIII—Task 3 competition

Examples of annotated observations with HPO terms, spans, and polarity—indicating a key finding if left empty, or a normal finding if marked as ‘Negated’.

ID	Text	HPO term	Spans	Polarity
1	EYES: partial synophrys, long lashes, horizontal slant	HP:0000664	14–23
1	EYES: partial synophrys, long lashes, horizontal slant	HP:0000527	25–36
2	MOUTH: normal lips, tongue, high palate	HP:0000218	28–39
2	MOUTH: normal lips, tongue, high palate	HP:0000159	7–18	Negated
2	MOUTH: normal lips, tongue, high palate	HP:0000157	7–13, 20–26	Negated
3	NEUROLOGIC: very active	NA	NA	NA

ID	Text	HPO term	Spans	Polarity
1	EYES: partial synophrys, long lashes, horizontal slant	HP:0000664	14–23
1	EYES: partial synophrys, long lashes, horizontal slant	HP:0000527	25–36
2	MOUTH: normal lips, tongue, high palate	HP:0000218	28–39
2	MOUTH: normal lips, tongue, high palate	HP:0000159	7–18	Negated
2	MOUTH: normal lips, tongue, high palate	HP:0000157	7–13, 20–26	Negated
3	NEUROLOGIC: very active	NA	NA	NA

Green highlights denote spans of key findings, while yellow highlights normal findings.

Table 1.

Examples of annotated observations with HPO terms, spans, and polarity—indicating a key finding if left empty, or a normal finding if marked as ‘Negated’.

ID	Text	HPO term	Spans	Polarity
1	EYES: partial synophrys, long lashes, horizontal slant	HP:0000664	14–23
1	EYES: partial synophrys, long lashes, horizontal slant	HP:0000527	25–36
2	MOUTH: normal lips, tongue, high palate	HP:0000218	28–39
2	MOUTH: normal lips, tongue, high palate	HP:0000159	7–18	Negated
2	MOUTH: normal lips, tongue, high palate	HP:0000157	7–13, 20–26	Negated
3	NEUROLOGIC: very active	NA	NA	NA

ID	Text	HPO term	Spans	Polarity
1	EYES: partial synophrys, long lashes, horizontal slant	HP:0000664	14–23
1	EYES: partial synophrys, long lashes, horizontal slant	HP:0000527	25–36
2	MOUTH: normal lips, tongue, high palate	HP:0000218	28–39
2	MOUTH: normal lips, tongue, high palate	HP:0000159	7–18	Negated
2	MOUTH: normal lips, tongue, high palate	HP:0000157	7–13, 20–26	Negated
3	NEUROLOGIC: very active	NA	NA	NA

Green highlights denote spans of key findings, while yellow highlights normal findings.

All special circumstances within the data and annotations were documented for the participants. Particularly, when observations mentioned two or more findings, the observation was repeated for each finding, with one finding annotated per repetition as illustrated in Table 1 for observations 1 and 2. Additionally, mentions of findings could span discontinuous segments of text. For these findings, we reported the start and end positions of each segment, separated by commas, in the order the segments appeared in the text, as shown for the discontinuous mention of the finding normal tongue in the third repetition of observation 2 in Table 1. The HPO ontology does not have terms to denote normal findings. Thus, as an alternative to this limitation, when possible, we normalized normal findings to the corresponding most specific abnormal term in the ontology and annotated it as negated; when it was not possible, i.e. no related terms were available to normalize the findings, we annotated the finding with not available. Of note, phenotypes encoded in the HPO that cannot be observed during a physical examination in a genetic encounter need not be considered by the normalizer. A list of these ‘non-observable’ terms in the HPO, deemed irrelevant to the task, was shared with the registered participants.

We randomly split the annotated corpus into three subsets: a training set (55%, 1716 observations), a validation set (15%, 454 observations), and a test set (30%, 966 observations). In addition, we added 2427 decoy observations to the test set, consisting of unannotated clinical observations collected from the EHR at the time we were preparing the datasets for the competition. The test set released to the participants contained only the observation IDs and the text of the observations.

In April 2023, the organizers released the training and validation sets to the registered participants. We released the test set on 15 tember 2023. We chose CodaLab Competitions [12], a free and open-source web-based platform, to host the competition. Participants were allotted 3 days to automatically predict the extraction and normalization of spans in the test set. While we cannot entirely rule out the possibility of participants manually correcting their predictions before submission, if any corrections were attempted, the large number of decoy observations in the test set and the strict 3-day submission window should have limited their number and their impact on the overall performance of the systems. Participants were required to submit their predictions by 18 September 2023, online to the Task 3 competition site on CodaLab (https://codalab.lisn.upsaclay.fr/competitions/11351). Each submission triggers an automated script to evaluate the submitted system’s predictions against our labelled test set. The test set was also uploaded to CodaLab but hidden from participants. Thus, the participants can see their results instantaneously, but not how they compare to those of others: We hid the leaderboard in CodaLab until the official release of the results during the BioCreative VIII workshop (https://www.ncbi.nlm.nih.gov/research/bionlp/biocreative#bioc-8). We limited each participating team to a maximum of three different system prediction submissions.

The use of CodaLab not only eliminated the need for the manual management of participants’ submissions and results but also allowed us to keep the competition open beyond the BioCreative event. This means that any researcher can register on CodaLab for access to the data—i.e. the labelled training and validation sets and the unlabelled test set, including the decoy observations—and submit their system’s predictions on the test set to the competition for evaluation for at least 18 months after the competition. Their results will be evaluated using the same evaluation script, against the same test data thus ensuring an equitable comparison against all previously evaluated systems.

Evaluation

Metrics

We evaluated each competing systems against two subtasks of the overall task. For subtask A, we evaluated the ability of the competing systems to normalize to HPO terms all mentions of key findings in an observation (Normalization-only), regardless of whether they could detect the spans of the mentions. We selected the standard precision, recall, and F1 scores to measure their performance on subtask A. A true positive (TP) is a correctly predicted HPO term for a key finding in an observation. That is a predicted HPO term for an observation exactly matching one of the annotated HPO terms for that observation in the gold standard. A false positive (FP) is a predicted HPO term that does not exactly match one of the HPO terms in the gold standard for that observation. A false negative (FN) is an HPO term that was present for an observation in the gold standard but was missed by the system. The precision (P) is the ratio of all correct HPO term predictions (TP) to all HPO terms predicted by the system (TP + FP), equation (1). The recall (R) is the ratio of all correct HPO term predictions (TP) to all HPO terms in the gold standard (TP + FN), equation (2). The F1 score (equation (3)), which was used to summarize the overall performance of the system, is the harmonic mean of P and R.

$$\begin{eqnarray} {\rm{P}} = {\rm{TP}}/\left( {{\rm{TP}} + {\rm{FP}}} \right), \end{eqnarray}$$

(1)

$$\begin{eqnarray} {\rm{R}} = {\rm{TP}}/\left( {{\rm{TP}} + {\rm{FN}}} \right), \end{eqnarray}$$

(2)

$$\begin{eqnarray} {\rm{F}}1 = 2^*\ \left( {{\rm{P}} ^*{\rm{R}}} \right)/\left( {{\rm{P}} + {\rm{R}}} \right). \end{eqnarray}$$

(3)

In subtask B, we evaluated the system’s ability to both detect the spans of mentions of key findings and normalize them as a supplementary evaluation (Overlapping Extraction and Normalization). We selected the overlapping precision, recall, and F1 scores as the metric to evaluate subtask B. For overlapping extractions, the system was rewarded when it extracted the spans or a part of the spans of a labelled key finding mention and correctly assigned the labelled HPO term ID to the mention. For example, in Table 1, a key finding is ‘high palate (span 28–39)—HPO ID: 0000218’. If the system predicted ‘palate (span 33–39)—HPO ID: 0000218’, it would be scored as a TP because ‘palate’ (span 28–39) is a substring of ‘high palate’ (span 33–39). Note, however, that the predicted HPO terms were required to match exactly the HPO terms in the gold standard for the mentions of the key findings; if the system predicted a different HPO term on the overlapping span, the prediction would be scored as an FP. In the competition, we chose the best system to be the system, which achieved the best F1 score on the Normalization-only evaluation. This subtask and metric were chosen based on the real-world application of the system where only the normalization of key findings in observations is medically relevant to physicians. We computed all metrics at the level of individual observations.

Baseline systems

Multiple baseline systems, freely available and open source, were available to participants to perform our task off the shelf. We evaluated txt2hpo (https://github.com/GeneDx/txt2hpo/), Doc2HPO [2], NeuralCR [3], PhenoBERT [6], and PhenoTagger [4]. A brief overview of these systems is reported in the ‘Introduction’ section. We ran and evaluated these systems on our gold standard test set without further training. PhenoTagger was the best-performing system with an F1 score of 0.633 when performing the normalization-only subtask, which we reported in Table 2. The authors of PhenoTagger approached the problem with a method that combines dictionary matching with machine learning. The authors compiled a dictionary from the list of all observable terms and their synonyms in the HPO. They used this dictionary to build a distantly supervised training dataset. They trained a BioBERT model to classify each n-gram to an HPO term ID or the special tag ‘None’ while retaining terms surpassing a predetermined threshold. The outcomes of the dictionary matching and the classifier were subsequently combined to generate the final predictions.

Table 2.

Systems performance (F1: F1 scores; P: precision; R: recall) and system summaries (TG: term generation; TE: term extraction; TN: term normalization). The highest scores are highlighted in bold.

Team	F1	P	R	F1	P	R	System synopsis
	Normalization-only			Overlapping extraction and normalization
Soysal and Roberts [13]	0.820	0.842	0.799	0.817	0.841	0.794	TG: ChatGPT + TN: exact matching on stems
Qi et al. [14]	0.763	0.831	0.706	0.762	0.830	0.704	TE: multiple W²NER instances relying on various BERT models + TN: ensemble of Bioformer and dictionary matching
Kim et al. [15]	0.745	0.735	0.755	0.743	0.734	0.752	TG: fined tuned ChatGPT + TN: synonym marginalization (BioSyn)
Alhassan et al. [16]	0.723	0.718	0.728	0.721	0.717	0.726	TG: FLAN-T5-XL fined tuned with LoRA + TN: distance similarity (RoBERTa) and candidates re-ranking with sentence transformer cross-encoder
Lin et al. [17]	0.644	0.762	0.557	0.642	0.761	0.556	TN: ensemble (PhenoTagger and PhenoBERT) + TE: BioLinkBERT
Baseline [4]	0.633	0.587	0.687	0.632	0.586	0.685	Ensemble of a BioBERT multi-class classifier and dictionary matching

Team	F1	P	R	F1	P	R	System synopsis
	Normalization-only			Overlapping extraction and normalization
Soysal and Roberts [13]	0.820	0.842	0.799	0.817	0.841	0.794	TG: ChatGPT + TN: exact matching on stems
Qi et al. [14]	0.763	0.831	0.706	0.762	0.830	0.704	TE: multiple W²NER instances relying on various BERT models + TN: ensemble of Bioformer and dictionary matching
Kim et al. [15]	0.745	0.735	0.755	0.743	0.734	0.752	TG: fined tuned ChatGPT + TN: synonym marginalization (BioSyn)
Alhassan et al. [16]	0.723	0.718	0.728	0.721	0.717	0.726	TG: FLAN-T5-XL fined tuned with LoRA + TN: distance similarity (RoBERTa) and candidates re-ranking with sentence transformer cross-encoder
Lin et al. [17]	0.644	0.762	0.557	0.642	0.761	0.556	TN: ensemble (PhenoTagger and PhenoBERT) + TE: BioLinkBERT
Baseline [4]	0.633	0.587	0.687	0.632	0.586	0.685	Ensemble of a BioBERT multi-class classifier and dictionary matching

Table 2.

Systems performance (F1: F1 scores; P: precision; R: recall) and system summaries (TG: term generation; TE: term extraction; TN: term normalization). The highest scores are highlighted in bold.

Team	F1	P	R	F1	P	R	System synopsis
	Normalization-only			Overlapping extraction and normalization
Soysal and Roberts [13]	0.820	0.842	0.799	0.817	0.841	0.794	TG: ChatGPT + TN: exact matching on stems
Qi et al. [14]	0.763	0.831	0.706	0.762	0.830	0.704	TE: multiple W²NER instances relying on various BERT models + TN: ensemble of Bioformer and dictionary matching
Kim et al. [15]	0.745	0.735	0.755	0.743	0.734	0.752	TG: fined tuned ChatGPT + TN: synonym marginalization (BioSyn)
Alhassan et al. [16]	0.723	0.718	0.728	0.721	0.717	0.726	TG: FLAN-T5-XL fined tuned with LoRA + TN: distance similarity (RoBERTa) and candidates re-ranking with sentence transformer cross-encoder
Lin et al. [17]	0.644	0.762	0.557	0.642	0.761	0.556	TN: ensemble (PhenoTagger and PhenoBERT) + TE: BioLinkBERT
Baseline [4]	0.633	0.587	0.687	0.632	0.586	0.685	Ensemble of a BioBERT multi-class classifier and dictionary matching

Team	F1	P	R	F1	P	R	System synopsis
	Normalization-only			Overlapping extraction and normalization
Soysal and Roberts [13]	0.820	0.842	0.799	0.817	0.841	0.794	TG: ChatGPT + TN: exact matching on stems
Qi et al. [14]	0.763	0.831	0.706	0.762	0.830	0.704	TE: multiple W²NER instances relying on various BERT models + TN: ensemble of Bioformer and dictionary matching
Kim et al. [15]	0.745	0.735	0.755	0.743	0.734	0.752	TG: fined tuned ChatGPT + TN: synonym marginalization (BioSyn)
Alhassan et al. [16]	0.723	0.718	0.728	0.721	0.717	0.726	TG: FLAN-T5-XL fined tuned with LoRA + TN: distance similarity (RoBERTa) and candidates re-ranking with sentence transformer cross-encoder
Lin et al. [17]	0.644	0.762	0.557	0.642	0.761	0.556	TN: ensemble (PhenoTagger and PhenoBERT) + TE: BioLinkBERT
Baseline [4]	0.633	0.587	0.687	0.632	0.586	0.685	Ensemble of a BioBERT multi-class classifier and dictionary matching

Systems

Results

Among the 20 teams that registered for the shared task, 5 participated in the end, and submitted 3 prediction files each (the maximum number of submission files authorized). We kept the best predictions for each team. We present the results of each team and a brief synopsis of the architecture of their best approach in Table 2. All systems outperformed the best-performing baseline system, PhenoTagger. Based on a large generative model for the extraction and a combination of generation and dictionary matching for the normalization, the best system [13] achieved an F1 score of 0.82, only 2 points under human performance on this task (i.e. the inter-annotator agreement of 0.844 average F1 score). This confirms the recent technical improvement made possible by large language models (LLMs).

Individual system descriptions

The five teams who submitted predictions of their systems for evaluation against the gold standard were invited to submit a technical summary describing in detail their approach for creating their system(s) for the BioCreative VIII Task 3 competition. Table 3 presents a summary of the teams that participated. The individual systems descriptions are presented next in rank order of their results in the competition, summarized from their submitted technical abstract [13, 14, 15, 16, 17].

Table 3.

Summary of participating teams.

Team	Institution	Country	System description paper
1	The University of Texas Health Science Center at Houston	USA	Soysal and Roberts [13]
2	Dalian University of Technology	China	Qi et al. [14]
3	Korea University, Imperial College, AIGEN Sciences, University of Edinburgh, University of Nottingham	Korea, UK	Kim et al. [15]
4	University of Manchester, King Saud University, ASUS	United Kingdom, Saudi Arabia, Singapore	Alhassan et al. [16]
5	National Cheng Kung University	Taiwan	Lin et al. [17]

Team	Institution	Country	System description paper
1	The University of Texas Health Science Center at Houston	USA	Soysal and Roberts [13]
2	Dalian University of Technology	China	Qi et al. [14]
3	Korea University, Imperial College, AIGEN Sciences, University of Edinburgh, University of Nottingham	Korea, UK	Kim et al. [15]
4	University of Manchester, King Saud University, ASUS	United Kingdom, Saudi Arabia, Singapore	Alhassan et al. [16]
5	National Cheng Kung University	Taiwan	Lin et al. [17]

Table 3.

Summary of participating teams.

Team	Institution	Country	System description paper
1	The University of Texas Health Science Center at Houston	USA	Soysal and Roberts [13]
2	Dalian University of Technology	China	Qi et al. [14]
3	Korea University, Imperial College, AIGEN Sciences, University of Edinburgh, University of Nottingham	Korea, UK	Kim et al. [15]
4	University of Manchester, King Saud University, ASUS	United Kingdom, Saudi Arabia, Singapore	Alhassan et al. [16]
5	National Cheng Kung University	Taiwan	Lin et al. [17]

Team	Institution	Country	System description paper
1	The University of Texas Health Science Center at Houston	USA	Soysal and Roberts [13]
2	Dalian University of Technology	China	Qi et al. [14]
3	Korea University, Imperial College, AIGEN Sciences, University of Edinburgh, University of Nottingham	Korea, UK	Kim et al. [15]
4	University of Manchester, King Saud University, ASUS	United Kingdom, Saudi Arabia, Singapore	Alhassan et al. [16]
5	National Cheng Kung University	Taiwan	Lin et al. [17]

Team 1: The University of Texas Health Science Center at Houston

Summary: Term extraction was performed using the generative model ChatGPT with few-shot learning to identify the spans of key findings. Term normalization was first attempted through dictionary matching on the extracted spans. If no match was found, dictionary matching was then applied to the corresponding HPO-preferred terms generated by ChatGPT.

We used, at the time of writing, the latest version of OpenAI’s LLM, GPT-4 [18], to solve the problem of extracting key findings from given observations and normalizing them to concepts in the HPO. Prior to model training, we performed preprocessing steps. First, we reviewed the annotations in the training and validation sets and corrected any inconsistencies in the annotation. We corrected inconsistencies in the spans selected for HPO terms, improving consistency in the spans would improve performance in span selection. To increase performance in HPO concept selection for a given entity, we also corrected inconsistencies in selected HPO terms in the annotations that may arise when there are similar terms that all closely match a selected entity. Our second preprocessing step was to remove all normal finding concept annotations from the training and validation sets.

Using OpenAI’s ChatCompletion API (https://platform.openai.com/docs/api-reference/chat) with the GPT-4 model, we prompted the model to extract key findings from a given text and return an answer with two elements: (i) the HPO term and (ii) the original text marked with brackets around the words associated with the HPO term. We used the bracketed text to identify the character offsets for each entity, allowing for the extraction of continuous or disjoint entity spans. We used this approach to overcome the shortcomings in GPT-4’s ability to identify character offsets. Next, we experimented with a few-shot learning approach where we provided examples with each request to better convey our intent to GPT-4. For each request, we added 25 examples generated specifically for that request. The first 15 examples were selected from the training data based on them being most similar to the observation text and included at least one observation annotation. Similarity was scored using spaCy document similarity [19], which performs cosine similarity using an average of word vectors. Five manually selected ‘tricky’ examples were added to guide GPT-4 through task requirements that were not always followed as described in the prompt, and the final five were negative examples that contained no key finding to convey to GPT-4 that empty responses were allowable.

For entity normalization, we built a dictionary of HPO terms, including the preferred terms, synonyms, and the labelled spans in the annotated observations, that map to HPO IDs. All terms in the dictionary were stemmed to facilitate exact matching. We used two different matching algorithms to normalize the extracted named entities. In the first matching algorithm, the first part of the answer, the HPO term identified by GPT-4 for an observation, is ignored. The second part of the answer, text marked, is stemmed and matched with the entries of the dictionary. The second matching algorithm follows the steps of the first; however, if no match is found in the dictionary after the search, the HPO term identified by GPT-4 is stemmed and that term is then searched for in the dictionary.

We submitted two runs for evaluations; the first run used GPT-4 with few-shot learning, and our first matching algorithm and the second run used GPT-4 with few-shot learning and our second matching algorithm. Our second run, using both the extracted entity and the GPT-4 identified HPO term for normalization, achieved the highest F1-score of 0.8197 for normalization-only and 0.8168 for overlapping extraction and normalization. We found that our approach with few-shot learning where GPT-4 was shown examples of expected responses significantly improved performance over trying to elicit the same response behaviours from prompting alone. The system is available at https://github.com/esoysal/phenormgpt.

Team 2: Dalian University of Technology

Summary: Term extraction was performed using the W²NER architecture instantiated with various BERT-based classifiers working in parallel to identify spans of all findings. All extracted candidates were normalized using the classifier Bioformer, with additional candidates retrieved through dictionary matching. Only the most likely candidates were passed to a voting ensemble to select their final HPO IDs. Postprocessing rules were applied to remove overlapping candidates and normal findings.

To automatically extract and normalize key findings from organ system observations, we employed a deep learning-based pipeline approach that divided the process into two subtasks: NER and named entity normalization (NEN). This approach was based on our prior work, PhenoTagger [3]. For the phenotype entity recognition part, we sought to extend the baseline system PhenoTagger, which demonstrates competency in the recognition of contiguous phenotype entities but has limitations in effectively identifying discontinuous entities. We addressed this challenge by leveraging the W²NER system [20], which reframes the NER task into predicting relationship categories between word pairs. We tried Bioformer [21], BioBERT [5], BioLinkBERTlarge [22], Biom-ELECTRAlarge [23], Clinical BERT [24], PubMedBERT [25], and Clinical PubMedBERT [26] models for W²NER, including the large versions of both BioBERT and PubMedBERT. The models were trained using the training set from the competition. For entity normalization, we used the deep learning-based classification method from PhenoTagger to classify the NER results to a specific HPO term. We experimented with Bioformer, BioBERT, and PubMedBERT, choosing Bioformer as the final classification model based on the results on the validation set (F1 scores of 0.7165, 0.7142, and 0.7140, respectively). After passing through Bioformer, candidate entities are classified by the softmax layer, which outputs a probability score. We manually set a threshold, keeping only the results with a probability above this threshold. We experimented with two thresholds to generate the NEN results: 0.8 and 0.95. On the validation set, the model had a higher recall rate at a threshold set to 0.8 and higher accuracy at a threshold of 0.95. Generally, when evaluating the performance of our system on the validation set across identical indicators, a threshold set at 0.8 yielded a better F1 score.

An additional voting ensemble method for entity recognition was created using the nine distinct models where a threshold, m, was set and the entity extracted by more than m models was selected as the ensemble result. We ran experiments with m = 2, 3, 4, and 5, achieving the best results with m = 2. We applied post-processing rules to the ensemble results to remove overlapping recognition results and normal findings. Finally, entities identified using the dictionary-based part of PhenoTagger were added as supplements to the results. On the validation set, the ensemble achieved the best result on the normalization-only task with an F1 score of 0.7975. The highest scoring singular model on normalization-only was the W²NER (PubMedBERTlarge) with PhenoTagger achieving an F1 score of 0.7642, a 4.45% increase in performance over PhenoTagger alone. For the entity extraction and normalization task, this model had a 6.63% performance increase over PhenoTagger (F1: 0.7317 vs. 0.6654), although as before, the ensemble method achieved the highest overall score with an F1 score of 0.7502.

For the final submission, the highest scoring run for normalization was the model trained on the training and validation sets, with the NEN threshold set to 0.95 to generate the final results. The model achieved an F1 score of 0.7632 on the official test set.

Team 3: Korea University/Imperial College/AIGEN Sciences/University of Edinburgh/University of Nottingham

Summary: Term extraction was performed using a generative fine-tuned ChatGPT model to identify spans of all findings and subsequently class them as normal or key findings. Term normalization was conducted with the classifier BioSyn, trained using synonym marginalization on the shared task training set.

We developed a pipeline consisting of two parts. In the first part, we aimed to identify HPO entities present in the input text using NER. We first identified four edge cases in the training data: observations with discontinuous findings, observations with only continuous findings, observations with normal findings, and observations with no findings. We split the set into 70/30 training/validation sets, keeping an even proportions of edge cases in each set. We assessed multiple models that have shown state-of-the-art performance in both continuous and discontinuous NER. We chose ChatGPT and W²NER, which outperformed other competing systems. We constructed our first NER model using the ChatGPT Finetuning API (https://platform.openai.com/docs/guides/model-optimization). The data were preprocessed to expand abbreviations and translate symbols into text format. We performed the extraction in two steps using the fine-tuned ChatGPT. In the first step, we extracted all findings, whether key or normal findings. In the second step, we classify the extracted findings into the two categories. Our second NER model used the W²NER architecture, which we optimized by fine-tuning the hyperparameter. We evaluated several BERT models to use as the first layer of the system, including BioBERT [5], SciBERT [27], PubMedBERT [25], and ClinicalBERT [28]. We found that ClinicalBERT performed best. We trained W²NER to extract only key findings.

For the second part of our pipeline, the NEN, we used a combination of methods. To incorporate more synonyms and create a generalizable dictionary, we flattened the HPO dictionary and removed unused terms. We used SapBERT [29] for embeddings. As it was pre-trained on UMLS data, we undertook pre-finetuning of SapBERT using our dictionary. We call this pre-fine-tuned model, PhenoSapBERT. To enhance our research, we used the synonym marginalization method, BioSyn [30], which uses an iterative candidate retrieval method and additive synonym incorporation during marginalization.

For our submissions, we used models with the fine-tuned ChatGPT, W²NER, and an ensemble of the two for NER. All models used BioSyn for NEN. The best model was the fine-tuned ChatGPT model, which achieved an F1 score of 0.7448 for normalization-only and an F1 score of 0.7428 for overlapping extraction and normalization.

Team 4: University of Manchester/King Saud University/ASUS

Summary: Term extraction was performed using a generative fine-tuned T5 model to identify the spans of key findings. For normalization, RoBERTa was used to embed the identified key findings alongside all HPO terms. The 30 most similar HPO terms were then scored using a sentence transformer cross-encoder, and the top-ranked terms were selected as their corresponding HPO terms.

For BioCreative VIII Task 3, we developed DiscHPO, a two-component pipeline. The first component detects continuous and discontinuous named entity spans, and the second normalizes the extracted spans to associated HPO identifiers. We framed the NER problem as a sequence-to-sequence problem [31]. In our preprocessing, we converted the numeric span offsets in the training data to the actual word span prefixed with the entity type: key finding or normal finding.

For the NER component in our pipeline, we investigated several variants of the Text-to-Text Transfer Transformer (T5) [32] to create a sentence-level NER module based on fine-tuning a pre-trained sequence-to-sequence encoder–decoder language model. We examined several T5 architectures, including the original T5, Flan-T5 [33], and SCIFIVE [34]. For optimizing the Flan-T5-XL architecture, we used low-rank adaptation (LoRA) [35], a parameter-efficient fine-tuning approach.

For the normalization component, we fine-tuned the all-roberta-large-v1 model (https://huggingface.co/sentence-transformers/all-roberta-large-v1) to create embeddings for both HPO terms in the ontology and the extracted entity spans from the observations. We excluded the HPO terms that were in the list of excluded HPO terms provided by the task organizers. To train on semantic similarity, pairs of spans from the training set and their HPO terms were used for fine-tuning. Then, to identify relevant candidates, we compared the embeddings representing the spans with those representing the HPO terms. We performed semantic matching with cosine similarity. The top 30 matches based on semantic similarity were then passed to a sentence transformer cross-encoder model, ms-marco-electra-base (https://huggingface.co/cross-encoder/ms-marco-electra-base), which re-ranks the candidate HPO terms by calculating the scores for each span-candidate term combination. After the results were sorted, we selected the top-scoring match as the final output.

We evaluated the different T5 models, with the normalization module, on the validation set using the evaluation script provided by the task organizers. We obtained the best results on this dataset with FlanT5-XL with LoRA with an F1 score of 0.742 for normalization only, and an F1 score of 0.738 for overlapping extraction and normalization. For our submitted final runs, we used the model with this architecture, varying the alpha hyperparameter of LoRA to either 512 or 1024 and trained our model either using only the training dataset or combining the training set with the validation set. The normalization component was unchanged in all runs. We achieved our highest score on the official test set using FlanT5-XL with LoRA α = 512, trained solely on the training set, resulting in an F1 score of 0.7220.

Team 5: National Cheng Kung University

Summary: The focus was on improving term extraction from existing baselines, PhenoTagger and PhenoBERT, which cannot process discontinuous findings. These baselines first extracted and normalized findings. A fine-tuned sequence labeller, BioLinkBERT, was then used to discard normal findings and retrieve spans of key findings from the observations.

We concentrated on the extraction and normalization subtask due to the limitation of existing methods, which can only identify consecutive spans of HPO terms. In the training and validations sets, 14.4% and 14%, respectively, are annotated as discontinuous spans, implying that these methods would be likely to miss these observations. To improve upon the performance of the baseline systems, we propose a sequence tagging framework. We preprocessed the observations, first by tokenizing them using the ‘word_tokenize’ function of NLTK [36]. Each token was then labelled for sequence tagging using a tagging schema for each token as either key finding, normal finding, or other. For sequence-level training, we leverage the HPO dictionary provided by the organizers. We then append the associated HPO text (also tokenized by NLTK) to the observation, with an insertion of the [SEP] token to separate the two types of text. We fine-tuned BERT for sequence tagging, testing our approach with PubMedBERT [25], BioLinkBERT [22], and Bioformer [21].

To obtain the predictions for Task 3, we structured three main steps in our pipeline: (1) We employed base models, such as PhenoTagger [4] or PhenoBERT [6], to generate a preliminary prediction set of HPO terms from the observations in the evaluation set; (2) we performed span localization based on the preliminary prediction set and appended each HPO term of the preliminary prediction set to an observation with a [SEP] token and then used our trained sequence tagging model to predict the labels of the input tokens; and (3) we aggregated the subword tokens from the BERT output and obtained the HPO positions corresponding to the tokens predicted as key findings. HPO terms predicted as ‘normal’ by our sequencing tagging model were filtered out before obtaining the final prediction. For the normalization-only task, our systems achieved an F1 score of 0.644 on the official test set. Our model did improve exact span localization over the base models, implying that these methods could be used on top of other approaches to enhance span localization capabilities.

Discussion

All systems proposed a common, yet effective, pipeline approach that divides the process into two subtasks: extraction followed by normalization [37]. The teams adopted two different strategies to handle discontinued and overlapping terms, which accounted for 16.9% (213/1258) of the terms in our test set. The first strategy unifies the extraction of all terms (continuous and discontinuous) by identifying relations between the tokens. This strategy was successfully implemented by Qi et al. [14] with an ensemble of W²NER [20]. The second strategy leverages recent advancements in generative models to reframe the standard NLP task of entity extraction as a straightforward question-answering task [38]. Teams adopting this strategy prompted an LLM to identify and list all HPO terms mentioned in an observation, requesting the LLM to format its response according to a predefined structure such as tabular or JSON formats, thus enabling the automatic retrieval of spans corresponding to the mentions [8]. This strategy was the most popular on our task with three of the five competing systems implementing it [13, 15, 16]. All systems explicitly handled the detection of normal findings by detecting negations in the context of the terms extracted or by training dedicated classifiers. We previously found that explicitly handling normal findings in this way is very effective since our transformer-based model [10], trained in a similar way, identified almost perfectly the normal findings in the test set.

The participants of the shared task employed all currently known approaches to normalize the terms, i.e. extraction-based, retrieval-based, and generative-based approaches, making their methodologies representative of the state-of-the-art in the field [39]. The intuitive exact matching approach selects the HPO term that matches exactly a key finding extracted in an observation. It was the default approach for three out of five systems; however, due to its obvious limitation of normalizing only matching candidates, it was combined by all three with machine learning to process the remaining cases.

All but one system normalized those cases not handled by exact matching using standard retrieval-based approaches [40], either by predicting the most likely HPO term with a multi-class classifier [14, 17] or by selecting the most similar HPO term to the candidate within an embedding space [15, 16]. Both approaches have well-known limitations. First, multi-class classification struggles with limited generalization and scalability, relying heavily on their training data and making them ineffective at handling rare or unseen HPO terms during training. This limitation becomes even more pronounced as the number of possible target terms increases, as when normalizing to large ontologies like the HPO, making the training of such systems with multi-class classification increasingly difficult as more data would need to be annotated to improve their performance [40]. On the other hand, clustering similar HPO term-candidate pairs faces challenges due to impoverished semantic representation: These models rely on computing meaningful representation of both the candidate phrase and its corresponding HPO term to assess similarity accurately. However, this computation may be difficult for several reasons: The candidate may be described rather than explicitly named or contain out-of-vocabulary tokens, such as misspellings; HPO terms, while typically defined in the ontology, lack real-world usage examples; and contextual understanding remains limited, as incorporating all relevant information from the surrounding sentence remains a challenge—one that only the largest language models have recently begun to address properly [41].

The most innovative and successful system in the task [13] leveraged the capabilities of LLMs for normalization. The authors employed prompt engineering to generate HPO terms corresponding to extracted candidates using the most advanced LLM available at the time of the competition. While this approach was conceptually simple, it proved highly effective by avoiding common pitfalls and maximizing the strengths of LLMs. To minimize hallucinations—such as mapping an HPO term to a phrase not found in the observation text—the system combines extraction and normalization within a single prompt, reducing the risk of generating inaccurate or fabricated information. This approach forced the LLM to first tag the candidate to normalize within the text of the observation before generating the associated HPO term, effectively anchoring its decision in context. Additionally, instead of prompting the LLM to directly generate HPO term IDs—which LLMs empirically struggle with [42]—the authors instructed the model to generate the preferred HPO term, for which the LLM was more likely to have an internal representation. It is uncertain whether the HPO (or a part of it, such as terms definitions) was explicitly included in the pretraining corpus of the proprietary LLM used by this team (ChatGPT-4). However, the model had clearly acquired knowledge of most, if not all, HPO terms—either directly or indirectly—through public sources such as scientific literature or discussion. Growing evidence suggests that LLMs develop internal representations of real-world entities and their relationships [43] allowing them to move beyond mere memorization of training data. Instead, they generalize by inferring new facts and integrating them into their reasoning. We hypothesize that LLMs maintain an internal representation of most HPO terms, which they utilize to compute a meaningful similarity score between the labelled HPO term and its candidate mention, understanding the latter in the context of the observation. This ability likely contributes to the effectiveness of LLM-based normalization approaches to solve the task.

Despite the advantages of LLM-based normalization, the best-performing system [13] is not flawless, and certain observations remain challenging. We analysed 105 errors across 83 randomly selected observations. Among these, 36 were FNs, where the system failed to detect HPO terms explicitly mentioned in the text. 24 were FPs, in which the system mistakenly identified normal findings as key findings. The remaining 45 errors involved cases where the system correctly identified the relevant spans—or portions of them—but assigned incorrect HPO terms.

We summarize the error categories and provide examples in Table 4. Through manual analysis, we identified 10 non-exclusive error categories. Notably, two categories (lines a and b in Table 4) account for a third of all errors. The most frequent errors (a. n = 21) are cases where the model retrieves hypernyms of the annotated HPO terms. While these terms are more general than those selected by clinicians, they remain accurate and closely related, making them acceptable in certain use cases. The next most common errors (b. n = 18) arise from descriptive observations, where clinicians describe findings rather than using specific medical terms present in the HPO. These descriptions often lack lexical overlap with the preferred terms or their synonyms, making them challenging for the model to normalize. Despite the generalization capabilities of LLMs, they still appear to rely, at least partially, on keyword matching to retrieve the appropriate HPO terms.

Table 4.

Error categories for the winning system.

	Error type	Count	Example
a.	Hypernym predicted	20% [21]	HANDS FEET: hypoplastic 3rd toe digits bilaterally [Short 3rd toe—HP:0005 643 labelled, short toe—HP:0001 831 predicted]
b.	Descriptive gold true span	17.1% [18]	EXTREMITIES: wind swept hands [Ulnar deviation of finger—HP:0009 465 labelled, missed prediction]
c.	Misplaced attention	8.6% [9]	NEUROLOGIC: abnormal gait, wide based [Broad-based gait—HP:0002 136 labelled, Gait disturbance—HP:0001 288 predicted]
d.	Complex mention requiring inference	6.7% [7]	EYES: subjectively narrow palpebral fissures, horizontal eyebrows [Horizontal eyebrows—HP:0011 228 labelled, missed prediction]
e.	Imperfect ontology	4.8% [5]	GENITALIA: enlarged scrotum; no hernia palpated [Abnormal scrotum morphology—HP:0 000045 labelled, missed prediction]
f.	Misspelling	2.9% [3]	HANDS FEET: clinodactyly, index fingers curived [Clinodactyly of the 2nd finger—HP:0040022 labelled, missed prediction]
g.	Negation unsolved	2.9% [3]	EXTREMITIES: unable to hyperextend elbows and knees beyone 10 degrees [Limited elbow extension—HP:0001 377 predicted]
h.	Unknown reason	6.7% [7]	CHEST: thorax asymmetry. Pectus carinatum [Asymmetry of the thorax—HP:0001 555 labelled, missed prediction]
i.	Annotation error	19% [20]	–
j.	Contentious annotation	11.4% [12]	–
	Total	105

	Error type	Count	Example
a.	Hypernym predicted	20% [21]	HANDS FEET: hypoplastic 3rd toe digits bilaterally [Short 3rd toe—HP:0005 643 labelled, short toe—HP:0001 831 predicted]
b.	Descriptive gold true span	17.1% [18]	EXTREMITIES: wind swept hands [Ulnar deviation of finger—HP:0009 465 labelled, missed prediction]
c.	Misplaced attention	8.6% [9]	NEUROLOGIC: abnormal gait, wide based [Broad-based gait—HP:0002 136 labelled, Gait disturbance—HP:0001 288 predicted]
d.	Complex mention requiring inference	6.7% [7]	EYES: subjectively narrow palpebral fissures, horizontal eyebrows [Horizontal eyebrows—HP:0011 228 labelled, missed prediction]
e.	Imperfect ontology	4.8% [5]	GENITALIA: enlarged scrotum; no hernia palpated [Abnormal scrotum morphology—HP:0 000045 labelled, missed prediction]
f.	Misspelling	2.9% [3]	HANDS FEET: clinodactyly, index fingers curived [Clinodactyly of the 2nd finger—HP:0040022 labelled, missed prediction]
g.	Negation unsolved	2.9% [3]	EXTREMITIES: unable to hyperextend elbows and knees beyone 10 degrees [Limited elbow extension—HP:0001 377 predicted]
h.	Unknown reason	6.7% [7]	CHEST: thorax asymmetry. Pectus carinatum [Asymmetry of the thorax—HP:0001 555 labelled, missed prediction]
i.	Annotation error	19% [20]	–
j.	Contentious annotation	11.4% [12]	–
	Total	105

Table 4.

10.1093/bioinformatics/btab019

Error categories for the winning system.

	Error type	Count	Example
a.	Hypernym predicted	20% [21]	HANDS FEET: hypoplastic 3rd toe digits bilaterally [Short 3rd toe—HP:0005 643 labelled, short toe—HP:0001 831 predicted]
b.	Descriptive gold true span	17.1% [18]	EXTREMITIES: wind swept hands [Ulnar deviation of finger—HP:0009 465 labelled, missed prediction]
c.	Misplaced attention	8.6% [9]	NEUROLOGIC: abnormal gait, wide based [Broad-based gait—HP:0002 136 labelled, Gait disturbance—HP:0001 288 predicted]
d.	Complex mention requiring inference	6.7% [7]	EYES: subjectively narrow palpebral fissures, horizontal eyebrows [Horizontal eyebrows—HP:0011 228 labelled, missed prediction]
e.	Imperfect ontology	4.8% [5]	GENITALIA: enlarged scrotum; no hernia palpated [Abnormal scrotum morphology—HP:0 000045 labelled, missed prediction]
f.	Misspelling	2.9% [3]	HANDS FEET: clinodactyly, index fingers curived [Clinodactyly of the 2nd finger—HP:0040022 labelled, missed prediction]
g.	Negation unsolved	2.9% [3]	EXTREMITIES: unable to hyperextend elbows and knees beyone 10 degrees [Limited elbow extension—HP:0001 377 predicted]
h.	Unknown reason	6.7% [7]	CHEST: thorax asymmetry. Pectus carinatum [Asymmetry of the thorax—HP:0001 555 labelled, missed prediction]
i.	Annotation error	19% [20]	–
j.	Contentious annotation	11.4% [12]	–
	Total	105

	Error type	Count	Example
a.	Hypernym predicted	20% [21]	HANDS FEET: hypoplastic 3rd toe digits bilaterally [Short 3rd toe—HP:0005 643 labelled, short toe—HP:0001 831 predicted]
b.	Descriptive gold true span	17.1% [18]	EXTREMITIES: wind swept hands [Ulnar deviation of finger—HP:0009 465 labelled, missed prediction]
c.	Misplaced attention	8.6% [9]	NEUROLOGIC: abnormal gait, wide based [Broad-based gait—HP:0002 136 labelled, Gait disturbance—HP:0001 288 predicted]
d.	Complex mention requiring inference	6.7% [7]	EYES: subjectively narrow palpebral fissures, horizontal eyebrows [Horizontal eyebrows—HP:0011 228 labelled, missed prediction]
e.	Imperfect ontology	4.8% [5]	GENITALIA: enlarged scrotum; no hernia palpated [Abnormal scrotum morphology—HP:0 000045 labelled, missed prediction]
f.	Misspelling	2.9% [3]	HANDS FEET: clinodactyly, index fingers curived [Clinodactyly of the 2nd finger—HP:0040022 labelled, missed prediction]
g.	Negation unsolved	2.9% [3]	EXTREMITIES: unable to hyperextend elbows and knees beyone 10 degrees [Limited elbow extension—HP:0001 377 predicted]
h.	Unknown reason	6.7% [7]	CHEST: thorax asymmetry. Pectus carinatum [Asymmetry of the thorax—HP:0001 555 labelled, missed prediction]
i.	Annotation error	19% [20]	–
j.	Contentious annotation	11.4% [12]	–
	Total	105

Less frequent errors (c. n = 9) result from a shift in the model’s attention, causing it to focus on phrases outside the correct annotation span and to overemphasize unrelated words. For instance, in the observation abnormal gait, wide-based, the model focused on the word abnormal, leading it to incorrectly predict Gait disturbance (HP:0001 288). Other infrequent errors (d. n = 5) occur when term normalization requires logical inference. For example, while horizontal eyebrows may not typically indicate an abnormality, within a physical exam context, it signals a key finding corresponding to Horizontal eyebrows (HP:0011 228). The remaining error categories each account for less than 5% of errors.

Despite our best efforts, we discovered several annotation errors while reviewing discrepancies between the model’s output and the gold standard. Specifically, 12 annotation errors (line j in Table 4) involved terms such as Scarring (HP:0100699) or Tube feeding (HP:0033 454), which were inconsistently annotated. Since these terms may not be indicative of genetic conditions in some contexts, they could arguably be disregarded. The 20 annotation errors (line i) were outright annotation mistakes, highlighting the inherent complexity of this task. We will upload an updated version of the test set on CodaLab as soon as it becomes available.

Conclusion

In this paper, we presented the results of the BioCreative VIII Task 3, which challenges participants to extract and normalize key findings in 3136 observations from dysmorphic examinations. Given an observation, the task consists of detecting the spans of all key findings mentioned and returning the list of their corresponding IDs in the HPO. All five competing systems addressed the task using a pipeline architecture, where key findings were extracted first, followed by independent normalization. Most systems (Teams 1, 3, and 4) prompted LLMs to list the spans of key findings in observations. Participants used a wider range of methods for normalization. All started with the default exact-matching approach, which paired obvious candidates with their corresponding HPO terms. For more complex candidates, machine learning was employed, ranging from prompting LLMs to suggest their corresponding HPO preferred terms (Team 1), to identifying their closest HPO term in embedding spaces (Teams 3 and 4), or using conventional multi-class classifiers (Teams 2 and 5). The top-performing system (Team 1) achieved 0.82 F1 score when normalizing the terms, which very closely matches human performance. However, this high performance is based on a relatively straightforward dataset of dysmorphic examination observations, characterized by well-structured, short sentences with clearly defined organs, which we manually extracted from the EHR. The broader task of detecting and normalizing HPO terms in general EHR notes is significantly more challenging, as indicated by our preliminary experiments, which show a notable drop in performance on these less structured and more complex mentions.

Conflict of interest

None declared.

Funding

I.M.C. was supported by grant K08-HD111688 from the Eunice Kennedy Shriver National Institute of Child Health and Human Development. G.G.H. and D.W. were partially supported by grant R01LM011176 from the National Library of Medicine, and by grant R01AI164481 from the National Institute of Allergy and Infectious Diseases.

References

Köhler

Gargano

Matentzoglu

et al.

The Human Phenotype Ontology in 2021

Nucleic Acids Res

2021

;

D1207

–

Liu

Peres Kury

et al.

Doc2Hpo: a web application for efficient and accurate HPO concept curation

Nucleic Acids Res

2019

;

W566

–

Arbabi

Adams

Fidler

et al.

Identifying clinical terms in medical text using ontology-guided machine learning

JMIR Med Inform

2019

;

e12596

Luo

Yan

Lai

et al.

PhenoTagger: a hybrid method for phenotype concept recognition using human phenotype ontology

Bioinformatics

2021

;

1884

–

Lee

Yoon

Kim

et al.

BioBERT: a pre-trained biomedical language representation model for biomedical text mining

Bioinformatics

2020

;

1234

–

10.1093/bioinformatics/btz682

Feng

Tian

et al.

PhenoBERT: a combined deep learning method for automated recognition of Human Phenotype Ontology

IEEE/ACM Trans Comput Biol and Bioinf

2023

;

1269

–

10.1109/TCBB.2022.3170301

Crossref

. https://www.ncbi.nlm.nih.gov/research/bionlp/biocreative

US National Library of Medicine—National Center for Biotechnology Information

BioCreative

(5 February 2025, date last accessed)

Chen

Peng

et al.

Large language models for generative information extraction: a survey

Front Comput Sci

2024

;

186357

10.1007/s11704-024-40555-y

Crossref

https://www.epic.com/about/

(25 February 2025, date last accessed)

10.

Weissenbacher

Rawal

Zhao

et al.

PhenoID, a language model normalizer of physical examinations from genetics clinical notes

2023

. http://medrxiv.org/lookup/doi/10.1101/2023.10.16.23296894

(13 May 2024, date last accessed)

11.

Yuan

Yang

Liang

et al.

Generative entity typing with curriculum learning

. In:

Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

2022

;

3061

–

12.

Pavao

Guyon

Letournel

et al.

CodaLab Competitions: an open source platform to organize scientific challenges

J Mach Learn Res

2023

;

–

https://zenodo.org/records/10104725

13.

Soysal

Roberts

2023

. UTH-Olympia@BC8 Track 3: Adapting GPT-4 for Entity Extraction and Normalizing Responses to Detect Key Findings in Dysmorphology Physical Examination Observations. Proceedings of the BioCreative VIII Challenge and Workshop: Curation and Evaluation in the era of Generative Models.

12 November.

(13 May 2024, date last accessed)

14.

Luo

Yang

et al.

2023

. DUTIR-BioNLP@BC8 Track 3: Genetic Phenotype Extraction and Normalization with Biomedical Pre-trained Language Models. Proceedings of the BioCreative VIII Challenge and Workshop: Curation and Evaluation in the era of Generative Models.

12 November.

https://zenodo.org/records/10104756

(13 May 2024, date last accessed)

15.

Kim

Sohn

et al.

2023

. KU AIGEN ICL EDI@BC8 Track 3: Advancing Phenotype Named Entity Recognition and Normalization for Dysmorphology Physical Examination Reports. Proceedings of the BioCreative VIII Challenge and Workshop: Curation and Evaluation in the era of Generative Models.

12 November

. https://zenodo.org/records/10104804

(13 May 2024, date last accessed)

16.

Alhassan

Schlegel

Aloud

et al.

2023

. DiscHPO@BC8 Track 3: Recognising and Normalising Continuous and Discontinuous Genetic Phenotypes Using T5 Variants and Sentence-Transformers Models. Proceedings of the BioCreative VIII Challenge and Workshop: Curation and Evaluation in the era of Generative Models.

12 November

. https://zenodo.org/records/10104869

(13 May 2024, date last accessed)

17.

Lin

Feng

Kao

2023

. IKMLab@BC8 Track 3: Sequence Tagging for Position-Aware Human Phenotype Extraction with Pre-trained Language Models. Proceedings of the BioCreative VIII Challenge and Workshop: Curation and Evaluation in the era of Generative Models.

12 November

. https://zenodo.org/records/10104936

(13 May 2024, date last accessed)

18.

OpenAI

Achiam

Adler

Agarwal

et al.

2024

GPT-4 Technical Report

. http://arxiv.org/abs/2303.08774

(13 May 2024, date last accessed)

19.

Honnibal

Montani

spaCy 2: natural language understanding with Bloom embeddings, convolutional neural networks and incremental parsing

Sentometrics Research

2017

. https://sentometrics-research.com/publication/72/

(13 May 2024, date last accessed)

10.1609/aaai.v36i10.21344

20.

Fei

Liu

et al.

Unified named entity recognition as word-word relation classification

AAAI

2022

;

10965

–

Crossref

. https://www.aclweb.org/anthology/2021.bionlp-1.24

21.

Fang

Chen

Wei

et al.

2023

Bioformer: an efficient transformer language model for biomedical text mining

. http://arxiv.org/abs/2302.01588

(13 May 2024, date last accessed)

22.

Yasunaga

Leskovec

Liang

LinkBERT: Pretraining Language Models with Document Links

. In:

Proceedings ofthe 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Dublin

Association for Computational Linguistics

2022

8003

–

. https://aclanthology.org/2022.acl-long.551

(13 May 2024, date last accessed)

23.

Alrowili

Shanker

BioM-Transformers: building large biomedical language models with BERT, ALBERT and ELECTRA

. In:

Proceedings of the 20th Workshop on Biomedical Language Processing

Association for Computational Linguistics

2021

221

–

(13 May 2024, date last accessed)

24.

Alsentzer

Murphy

Boag

et al.

Publicly available clinical BERT embeddings

. In:

Proceedings of the 2nd Clinical Natural Language Processing Workshop

Minneapolis, MI

Association for Computational Linguistics

2019

–

. http://aclweb.org/anthology/W19-1909

(13 May 2024, date last accessed)

25.

Tinn

Cheng

et al.

Domain-specific language model pretraining for biomedical Natural Language Processing

ACM Trans Comput Healthcare

2022

;

–

26.

Taylor

Zhang

Joyce

et al.

Clinical prompt learning with frozen language models

IEEE Trans Neural Netw Learning Syst

2024

–

. https://www.aclweb.org/anthology/2020.acl-main.335

27.

Beltagy

Cohan

SciBERT: a pretrained language model for scientific text

. In:

Inui

Jiang

Wan

(ed),

Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

Hong Kong

Association for Computational Linguistics

2019

3615

–

. https://aclanthology.org/D19-1371

(13 May 2024, date last accessed)

28.

Huang

Altosaar

Ranganath

2020

ClinicalBERT: modeling clinical notes and predicting hospital readmission

. http://arxiv.org/abs/1904.05342

(13 May 2024, date last accessed)

29.

Liu

Shareghi

Meng

et al.

Self-alignment pretraining for biomedical entity representations

. In:

Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies [Internet]. Online: Association for Computational Linguistics

2021

4228

–

. https://aclanthology.org/2021.naacl-main.334

(13 May 2024, date last accessed)

30.

Sung

Jeon

Lee

et al.

Biomedical entity representations with synonym marginalization

. In:

Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Association for Computational Linguistics

2020

3641

–

(13 May 2024, date last accessed)

31.

Sutskever

Vinyals

Sequence to sequence learning with neural networks

. In

Ghahramani

Welling

Cortes

Lawrence

Weinberger

(eds.),

Proceedings of the Advances in Neural Information Processing Systems

Curran Associates, Inc

2014

32.

Raffel

Shazeer

Roberts

et al.

Exploring the limits of transfer learning with a unified text-to-text transformer

J Mach Learn Res

2020

;

–

PubMed

33.

Chung

Hou

Longpre

et al.

Scaling instruction-finetuned language models

J Mach Learn Res

2024

;

–

34.

Phan

Anibal

Tran

et al.

2021

SciFive: a text-to-text transformer model for biomedical literature

. http://arxiv.org/abs/2106.03598

(13 May 2024, date last accessed)

35.

Shen

Wallis

et al.

2021

LoRA: low-rank adaptation of large language models

. http://arxiv.org/abs/2106.09685

(13 May 2024, date last accessed)

36.

Bird

Klein

Loper

Natural Language Processing with Python: Analyzing Text with the Natural Language Toolkit

Sebastopol, USA

O'Reilly Media, Inc

2009

, p.

506

37.

Magge

Tutubalina

Miftahutdinov

et al.

DeepADEMiner: a deep learning pharmacovigilance pipeline for extraction and normalization of adverse drug event mentions on Twitter

J Am Med Inform Assoc

2021

;

2184

–

10.1093/jamia/ocab114

38.

Raffel

Shazeer

Roberts

et al.

Exploring the limits of transfer learning with a unified text-to-text transformer

J Mach Learn Res

2020

;

5485

–

5551

10.18653/v1/2023.findings-emnlp.1040

39.

Wang

Fang

Shi

et al.

On the role of entity and event level conceptualization in generalizable reasoning: a survey of tasks, methods, applications, and future directions

2024

, https://arxiv.org/abs/2406.10885

(16 June 2024, date last accessed).

40.

Feng

Pratapa

Mortensen

et al.

Calibrated seq2seq models for efficient and generalizable ultra-fine entity typing

. In:

Findings of the Association for Computational Linguistics: EMNLP 2023

2023

15550

–

(25 July 2025, date last accessed)

41.

Kashyap

Nguyen

Schlegel

et al.

A comprehensive survey of sentence representations: from the BERT epoch to the ChatGPT era and beyond

. In:

Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)

2024

1738

–

42.

Alberts

Gabrieli

Espejo Morales I. interleaving text and number embeddings to solve mathemathics problems

. In:

Proceedings of The 4th Workshop on Mathematical Reasoning and AI at NeurIPS'24

2024

43.

Gurnee

Tegmark

Language models represent space and time

. In:

Proceedings of The Twelfth International Conference on Learning Representations

2024