PheNormGPT: a framework for extraction and normalization of key medical findings Open Access

Dataset example

Observation ID	Text	HPO ID	Spans	Polarity
D433F04E6AD5E56	EYES: partial synophrys, long lashes, horizontal slant	HP:0000664	14–23	–
D433F04E6AD5E56	EYES: partial synophrys, long lashes, horizontal slant	HP:0000527	25–36	–
8A1EEF66A345576	MOUTH: normal lips, tongue, high palate	HP:0000218	28–39	–
8A1EEF66A345576	MOUTH: normal lips, tongue, high palate	HP:0000159	7–18	X
8A1EEF66A345576	MOUTH: normal lips, tongue, high palate	HP:0000157	7–13, 20–26	X
879246677902DE5	NEUROLOGIC: very active	NA	NA	NA

Observation ID	Text	HPO ID	Spans	Polarity
D433F04E6AD5E56	EYES: partial synophrys, long lashes, horizontal slant	HP:0000664	14–23	–
D433F04E6AD5E56	EYES: partial synophrys, long lashes, horizontal slant	HP:0000527	25–36	–
8A1EEF66A345576	MOUTH: normal lips, tongue, high palate	HP:0000218	28–39	–
8A1EEF66A345576	MOUTH: normal lips, tongue, high palate	HP:0000159	7–18	X
8A1EEF66A345576	MOUTH: normal lips, tongue, high palate	HP:0000157	7–13, 20–26	X
879246677902DE5	NEUROLOGIC: very active	NA	NA	NA

Collected from BioCreative—Track 3: Genetic Phenotype Extraction and Normalization from Dysmorphology Physical Examination Entries (genetic conditions in pediatric patients). https://biocreative.bioinformatics.udel.edu/tasks/biocreative-viii/track-3/. “NA” indicates data are not available.

Table 1.

Dataset example

Observation ID	Text	HPO ID	Spans	Polarity
D433F04E6AD5E56	EYES: partial synophrys, long lashes, horizontal slant	HP:0000664	14–23	–
D433F04E6AD5E56	EYES: partial synophrys, long lashes, horizontal slant	HP:0000527	25–36	–
8A1EEF66A345576	MOUTH: normal lips, tongue, high palate	HP:0000218	28–39	–
8A1EEF66A345576	MOUTH: normal lips, tongue, high palate	HP:0000159	7–18	X
8A1EEF66A345576	MOUTH: normal lips, tongue, high palate	HP:0000157	7–13, 20–26	X
879246677902DE5	NEUROLOGIC: very active	NA	NA	NA

Observation ID	Text	HPO ID	Spans	Polarity
D433F04E6AD5E56	EYES: partial synophrys, long lashes, horizontal slant	HP:0000664	14–23	–
D433F04E6AD5E56	EYES: partial synophrys, long lashes, horizontal slant	HP:0000527	25–36	–
8A1EEF66A345576	MOUTH: normal lips, tongue, high palate	HP:0000218	28–39	–
8A1EEF66A345576	MOUTH: normal lips, tongue, high palate	HP:0000159	7–18	X
8A1EEF66A345576	MOUTH: normal lips, tongue, high palate	HP:0000157	7–13, 20–26	X
879246677902DE5	NEUROLOGIC: very active	NA	NA	NA

The task organizers split this dataset into training, validation, and test sets. The task participants were provided with only the training and the validation sets. During the development of PheNormGPT, the training set was used to fine-tune the GPT models and to sample few-shot examples. Consecutively, the validation set was used to evaluate PheNormGPT, perform error analysis, and make associated improvements to its approaches. Once the development of PheNormGPT was finalized, PheNormGPT was used for inference on the test set using both training and validation sets as training data. The training and validation datasets comprised 2170 observations and 3247 key findings mapped to HPO. A summary of further statistics can be found in Table 2. Since the gold-standard annotations for the test set were not shared with the participants, corresponding annotation-related statistics are not available.

Table 2.

Dataset statistics

Dataset statistic	Training dataset	Validation dataset	Test dataset
Number of observations	1716	454	966 + 2427 decoys
Number of empty observations	205	49	–
Empty observation rate	12%	11%	–
Number of concepts	2562	685	–
Concepts per observation	1.49	1.51	–
Number of distinct concepts	708	359	–
Observations per distinct concept	2.42	1.26	–
Number of joint spans	2193	589	–
Number of disjoint spans	369	96	–
Disjoint span rate	14%	14%	–
Number of organs/body systems	20	19	20

Dataset statistic	Training dataset	Validation dataset	Test dataset
Number of observations	1716	454	966 + 2427 decoys
Number of empty observations	205	49	–
Empty observation rate	12%	11%	–
Number of concepts	2562	685	–
Concepts per observation	1.49	1.51	–
Number of distinct concepts	708	359	–
Observations per distinct concept	2.42	1.26	–
Number of joint spans	2193	589	–
Number of disjoint spans	369	96	–
Disjoint span rate	14%	14%	–
Number of organs/body systems	20	19	20

Table 2.

Dataset statistics

Dataset statistic	Training dataset	Validation dataset	Test dataset
Number of observations	1716	454	966 + 2427 decoys
Number of empty observations	205	49	–
Empty observation rate	12%	11%	–
Number of concepts	2562	685	–
Concepts per observation	1.49	1.51	–
Number of distinct concepts	708	359	–
Observations per distinct concept	2.42	1.26	–
Number of joint spans	2193	589	–
Number of disjoint spans	369	96	–
Disjoint span rate	14%	14%	–
Number of organs/body systems	20	19	20

Dataset statistic	Training dataset	Validation dataset	Test dataset
Number of observations	1716	454	966 + 2427 decoys
Number of empty observations	205	49	–
Empty observation rate	12%	11%	–
Number of concepts	2562	685	–
Concepts per observation	1.49	1.51	–
Number of distinct concepts	708	359	–
Observations per distinct concept	2.42	1.26	–
Number of joint spans	2193	589	–
Number of disjoint spans	369	96	–
Disjoint span rate	14%	14%	–
Number of organs/body systems	20	19	20

Updated dataset

Several updates were applied to the original dataset during the development of PheNormGPT to promote consistency across different observation annotations. There are two primary reasons for making such updates to the original dataset. Firstly, this allowed for increasing consistency for spans annotated for a given concept, which would enhance the performance for predicting spans. For example, there were occasional variances between the inclusion of modifiers such as “mild,” “slightly,” and “prominent.” The first and second rows of Table 3 show one such example, where the annotated entity boundaries are marked with square brackets. These cases were corrected to consistently exclude said modifiers from any entity boundary, as outlined in the task annotation guidelines.

Table 3.

Dataset inconsistency examples

Inconsistency type	Observation ID	Exact annotated observation	HPO ID
Span	8bb2b415c3263e47b2303bf45efec958	EYES: Prominent [infraorbital creases]	HP:0100876
Span	001d6c29f4e6ab6d37e2a4b0b84db25c	EYES: [Prominent infraorbital creases]	HP:0100876
HPO Concept	5d1c983da5c12cd593edc75ee9548f76	NECK: [Excess nuchal skin]	HP:0005989
HPO Concept	a23e72ed8d70c59e91ad638f4329ff14	NECK: [excess nuchal skin] and	HP:0000474

Inconsistency type	Observation ID	Exact annotated observation	HPO ID
Span	8bb2b415c3263e47b2303bf45efec958	EYES: Prominent [infraorbital creases]	HP:0100876
Span	001d6c29f4e6ab6d37e2a4b0b84db25c	EYES: [Prominent infraorbital creases]	HP:0100876
HPO Concept	5d1c983da5c12cd593edc75ee9548f76	NECK: [Excess nuchal skin]	HP:0005989
HPO Concept	a23e72ed8d70c59e91ad638f4329ff14	NECK: [excess nuchal skin] and	HP:0000474

Table 3.

Dataset inconsistency examples

Inconsistency type	Observation ID	Exact annotated observation	HPO ID
Span	8bb2b415c3263e47b2303bf45efec958	EYES: Prominent [infraorbital creases]	HP:0100876
Span	001d6c29f4e6ab6d37e2a4b0b84db25c	EYES: [Prominent infraorbital creases]	HP:0100876
HPO Concept	5d1c983da5c12cd593edc75ee9548f76	NECK: [Excess nuchal skin]	HP:0005989
HPO Concept	a23e72ed8d70c59e91ad638f4329ff14	NECK: [excess nuchal skin] and	HP:0000474

Inconsistency type	Observation ID	Exact annotated observation	HPO ID
Span	8bb2b415c3263e47b2303bf45efec958	EYES: Prominent [infraorbital creases]	HP:0100876
Span	001d6c29f4e6ab6d37e2a4b0b84db25c	EYES: [Prominent infraorbital creases]	HP:0100876
HPO Concept	5d1c983da5c12cd593edc75ee9548f76	NECK: [Excess nuchal skin]	HP:0005989
HPO Concept	a23e72ed8d70c59e91ad638f4329ff14	NECK: [excess nuchal skin] and	HP:0000474

Secondly, the updates also increase consistency for concepts mapped to a given entity. Such annotation errors may arise from multiple concepts existing in HPO that are close matches for a single entity. These cases were resolved by looking at the entire dataset, picking the most consistent concept assigned to a given entity, and updating all annotations of said entity in the entire dataset. Resolving such inconsistencies also reduces conflicts with PheNormGPT’s dictionary matching approaches, which will be discussed in the following sections. The third and fourth rows of Table 3 show such an example. In this example, HPO ID HP:0005989 corresponds to a concept with the preferred term “redundant neck skin” with a synonym of “excess neck skin,” and HPO ID HP:0000474 corresponds to a concept with the preferred term “thickened nuchal skin fold” with a synonym of “excess nuchal skin.”

Both types of inconsistencies detailed above were detected using automatic methods. Afterward, they were manually reviewed and updated based on similar annotations in the rest of the dataset and annotation guidelines. Such updates allow PheNormGPT to rely better on example-based approaches with LLMs, reducing the possibility of data suggesting inconsistencies across desired annotation behavior.

In the end, 24 spans and 19 HPO concept IDs out of 2770 concept annotations in the training dataset, and 8 spans and 6 HPO concept IDs out of 734 concept annotations in the validation dataset were updated. In addition to this, four concept annotation additions and one concept annotation merge in the training set, and one concept annotation addition and one concept annotation merge in the validation set were performed, where a concept merge entails combining two co-existing concepts HP:0000369 (low-set ears) and HP:0000358 (posteriorly rotated ears) into a single concept HP:0000368 (low-set and posteriorly rotated ears).

Furthermore, 329 concept annotations from the training set and 78 concept annotations from the validation set that recorded normal findings were removed from the dataset to maximize performance for finding abnormal findings. Despite their presence in the dataset, normal findings are not a part of the evaluation process in this shared task, and therefore, they were not utilized to avoid providing undesired information to the models during training.

Prompt engineering strategy

PheNormGPT relies on LLMs, namely interactions with GPT-3.5 and GPT-4, to extract key findings to normalize from the observation texts. In particular, it uses the snapshot gpt-3.5-turbo-0613 for fine-tuning and the snapshots gpt-3.5-turbo-16k-0613 and gpt-4-0613 for few-shot learning. The prompting strategies used to do so rely on effective and practical expression of information through GPT. PheNormGPT utilizes OpenAI’s ChatCompletion application programming interface (API) with the GPT models to extract named HPO entities from given observations. This setup allows for using three types of messages within a prompt: system message, user message, and assistant message. PheNormGPT utilizes each for a different purpose to best represent the task fitting the standards for LLMs and clearly distinguish each message’s roles. Firstly, it uses the system message to present task definitions and related instructions. Next, it uses the user message to input data for GPT to process, such as an observation text. Finally, it uses the assistant message containing the GPT responses as a list of extracted and normalized entities. Accordingly, PheNormGPT only submits system and user messages in an inference setting and receives outputs from models as an assistant message. A prompt template representation of this strategy is presented in Table 4.

Table 4.

System message

Extract Human Phenotype Ontology (HPO) concepts from the given text of a physical examination note into a table containing standard HPO preferred term and marked original text with square brackets surrounding the words associated with the HPO term.

User message

FACE: elongated face; flattened midface

Assistant message

In the prompt template used, the system message is a fixed, short set of instructions. It is the first message in each prompt submitted to GPT. The system message mainly specifies the task’s context and setting rather than providing the complete set of instructions or defining how GPT should behave by itself. Most instructions and specifics about how GPT should behave when extracting entities are conveyed through examples instead. These examples are supplied to GPT using fine-tuning or few-shotting strategies, which will be discussed in the following sections.

Consecutively, the user message consists of a single unannotated observation text from which GPT should extract and normalize key medical findings. In return, the assistant message corresponding to the user message includes relevant information that PheNormGPT would need to facilitate concept extraction and normalization. As a result, GPT has to extract a list of entities, and for each entity, it has to identify a corresponding HPO concept and span of that entity in the observation text. Using this setup, the model would be able to extract a list of entities, along with their corresponding preferred HPO concepts from the observation text.

To satisfy these requirements, PheNormGPT makes GPT return the list of extracted and normalized entities following the format provided in the table within the assistant message. This table includes headers and a row for each entity found in the text given to GPT. It consists of two columns: preferred HPO term and marked observation text. An example of such a table can be seen in Table 4. In the example, the given observation has two extracted entities and two corresponding concepts. For the first concept, the preferred HPO term is “long face,” and the text “elongated face,” the observed term, is marked from the observation via square brackets. For the second concept, the preferred HPO term is “midface retrusion,” and the observed term “flattened midface” is marked in the text accordingly.

The first column, which is the preferred HPO term, is utilized for normalization purposes. This value is a substitute for HPO concept IDs, which can uniquely identify concepts within HPO. However, GPT models are inconsistent when identifying HPO IDs for a given entity. While they correctly identify HPO concept IDs for commonly observed concepts such as hypertension when directly asked, this behavior does not generalize across most other HPO concepts, and they often hallucinate an incorrect HPO concept ID instead. This behavior is shown to be generalized across other types of unique identifiers [26]. Despite this, GPT models demonstrate an understanding of what ontologies are, what HPO is, and approximately what kinds of concepts would be present in HPO. Therefore, instead of requesting a unique identifier code to identify an HPO concept, PheNormGPT requests the preferred HPO term for each entity. This is a roundabout approach for GPT to produce information that can lead to identifying an HPO concept. How PheNormGPT uses the preferred HPO term in normalization will be discussed in the following sections.

The second column, marked text, is utilized primarily for span identification. This column includes the original text from which GPT should extract entities, with one difference: marking of the entity. For each row, this column marks the entity in the original text with square brackets, as demonstrated in Table 4. The spans for each entity can be parsed using this column’s value, where the beginning of the entity span is where the first square bracket starts, and the end of the entity span is where the last square bracket ends. Additionally, this format supports both continuous and discontinuous spans, as disjoint sections can be surrounded with square brackets separately. To further support this format, PheNormGPT pre-processes the input text by replacing all pre-existing square bracket characters with round brackets before sending requests to GPT so that square bracket characters are exclusively used for identifying entity spans.

The marked text approach is an alternative to directly retrieving character offset numbers from the GPT models. However, as a generative model, they are unreliable in returning this information and prone to hallucinating false span offsets. More research in this domain has come to similar conclusions and uses alternative methods to mark text in similar settings, such as marking text with hashtags [27].

Learning strategies

PheNormGPT heavily relies on examples to guide the GPT model in comprehending the desired output effectively. This is achieved by implementing two different learning strategies for LLMs: fine-tuning and few-shot learning. Fine-tuning involves adjusting the parameters of a pre-trained model specifically for a new task, allowing the model to adapt its knowledge to the nuances of the task at hand. This process requires a dataset related to the target task, through which the model learns the specifics of the task by adjusting its weights accordingly. On the other hand, few-shot learning leverages the model’s existing knowledge and capabilities by providing it with a small number of examples at inference time, demonstrating what is expected of it. This method relies on the model’s ability to generalize from these examples to perform the task without the need for extensive retraining.

Fine-tuning strategy

PheNormGPT’s fine-tuning approach was only applied to GPT-3.5-Turbo, as fine-tuning GPT-4 was not available during the development of this framework. Figure 1 presents a visualization of the fine-tuning approach.

Figure 1.

Fine-tuning workflow.

The fine-tuning approach consists of creating a prompt dataset and fine-tuning a model via OpenAI API. The prompt dataset consists of one prompt per every observation in the dataset, where each prompt includes three messages, as described in the previous section. Fine-tuning is performed by uploading the prompt dataset and initiating a fine-tuning job through the API. Once a model is fine-tuned, inference is performed on it by submitting prompts consisting of a system and a user message, and receiving an assistant message back that includes a table of entities as previously described, which will be later post-processed for normalization.

Few-shot learning strategy

PheNormGPT also experiments with a few-shot learning strategy as an alternative to the fine-tuning strategy. This strategy uses semantically relevant examples to best characterize the behavior desired for each specific example. This is done by utilizing a custom few-shot example selection algorithm, visualized in Fig. 2.

Figure 2.

Few-shot learning workflow.

The max token limit for GPT-4 during the development of PheNormGPT allowed for 8k tokens per request. In consideration of this limit, 25 few-shot examples are included per request. These few-shot examples are inserted between the system message and the final user message, where the final user message is the text on which GPT is asked to extract and normalize entities. Each few-shot example consists of a user message, which includes an observation text similar to the final user message, and an assistant message, which includes the table of entities in the corresponding observation text. These 25 few-shot examples are selected in a custom-tailored manner per observation text to perform inference.

PheNormGPT generates a list of few-shot examples using three steps for each request. In the first step, PheNormGPT selects observations from the training set that are most similar to the final observation text of the request and includes at least one concept annotation. spaCy’s document similarity is used as the similarity function, which performs a cosine similarity using an average of word vectors [28]. A total of 15 out of 25 examples are selected using this method.

For the second step, PheNormGPT appends a manually selected list of five examples per body part to include “tricky” examples to the few-shot examples list. These examples mainly aim to guide GPT through BioCreative VIII Track 3’s specific task requirements that are not always followed when described in the system prompt. Such requirements include correctly determining span offsets for examples that require tagging the header body part in each observation or excluding modifiers such as “slightly” or “mild” per annotation guidelines. Table 5 shows examples that would be eligible for this category.

Table 5.

Examples demonstrating task-specific behavior

Behavior description	Example annotated observation
Disjoint span	EARS: [Low-set] left [ear]
Modifier exclusion	EYES: Mildly [low-set ear]
Organ header annotation	[EAR]: [crumpled]
Organ header annotation	EAR: [crumpled ear]
All of the above	[NOSE]: slightly [bulbous]

Behavior description	Example annotated observation
Disjoint span	EARS: [Low-set] left [ear]
Modifier exclusion	EYES: Mildly [low-set ear]
Organ header annotation	[EAR]: [crumpled]
Organ header annotation	EAR: [crumpled ear]
All of the above	[NOSE]: slightly [bulbous]

Table 5.

Examples demonstrating task-specific behavior

Behavior description	Example annotated observation
Disjoint span	EARS: [Low-set] left [ear]
Modifier exclusion	EYES: Mildly [low-set ear]
Organ header annotation	[EAR]: [crumpled]
Organ header annotation	EAR: [crumpled ear]
All of the above	[NOSE]: slightly [bulbous]

Behavior description	Example annotated observation
Disjoint span	EARS: [Low-set] left [ear]
Modifier exclusion	EYES: Mildly [low-set ear]
Organ header annotation	[EAR]: [crumpled]
Organ header annotation	EAR: [crumpled ear]
All of the above	[NOSE]: slightly [bulbous]

In the third step, PheNormGPT adds the five most similar negative examples from the training set to the final observation text of the request to the few-shot examples list. A negative example refers to an observation that does not include any key findings and, therefore, has no entities to extract. This step ensures that GPT understands it has the option to extract no entities if appropriate and shows it the format to use when returning a response with no entities for parsability.

Once PheNormGPT compiles the list of 25 few-shot examples through the three steps, it shuffles the list to randomize the distribution of the examples. Inference is performed by submitting the combined prompt to GPT via OpenAI API and receiving an assistant message back that includes a table of entities, as previously described.

Concept normalization

PheNormGPT follows a two-step normalization process for determining HPO concepts matching the entities identified by GPT in the table it returns, visualized in Fig. 3.

Figure 3.

Normalization workflow.

PheNormGPT utilizes a dictionary that maps term names to HPO concept IDs for both steps. This dictionary is constructed from the terms and the synonyms from HPO and the annotated terms mapped to HPO concept IDs in the training set. For BioCreative VIII Track 3, 5219 HPO concepts, such as modifier concepts and a list of unobservable concepts shared by the task organizers, are excluded from this dictionary. Additionally, stemming and case normalization are applied when attempting a match with this dictionary to increase dictionary matching performance.

As mentioned in the previous sections, the table returned by GPT includes two terms to potentially normalize per entity. The first one is the returned preferred HPO term. The second one is the observed term, which is the form in which the entity is present in within the given text. In other words, it is the mention of that entity in text, or the term marked using the square brackets. An example illustrating the two terms is presented in Fig. 4. In the first step of the normalization process, PheNormGPT tries to normalize the observed term as it appears in the text via dictionary matching. If there is a match, the entity is assigned the matching HPO concept, and the process is complete. Otherwise, PheNormGPT moves to the second step of the normalization process. In the second step, it tries to normalize the preferred HPO term returned by GPT, using dictionary matching again. If there is a match, the entity is assigned the matching HPO concept, and the process is complete. Otherwise, PheNormGPT fails to normalize that entity and does not record any HPO concepts for it.

Figure 4.

Illustration of terms extracted from the GPT response.

In summary, the first step involves simple dictionary matching using observed terms through the dataset and known terms through HPO, while the second step employs dictionary matching using preferred terms recommended by GPT.

Results

In the BioCreative VIII Track 3 evaluation step, PheNormGPT was evaluated on the test corpus of the shared task. For this step, task organizers only shared the observation texts with the participants and did not include the gold-standard annotations. The evaluation was conducted across three settings. The normalization only setting checked the performance of HPO identifier prediction. The strict extraction and overlapping extraction settings checked both the performance of HPO identifier prediction and entity boundary overlaps via strict and overlapping matches, respectively. Across multiple submissions, PheNormGPT received a maximum F1 score of 0.82 in the normalization only setting, and a maximum F1 score of 0.72 in the strict extraction and normalization setting.

Submissions to the test set evaluation and their scores are listed in Table 6. These submissions combine various techniques described in the “Methods” section and experiment with different source models, including GPT-3.5 Turbo and GPT-4, as well as learning techniques such as fine-tuning and few-shot learning. A combination of GPT-4 and fine-tuning is not present due to the lack of GPT-4 fine-tuning availability via OpenAI API during the development. The role of the normalization targets is also explored, where each option is attempted with normalization of both the observed term and the predicted preferred HPO term, and with normalization of only the observed term.

Table 6.

Performance on test set

			Normalization only			Overlapping extraction and normalization			Strict extraction and normalization
Model	Learning technique	Normalization target	P	R	F1	P	R	F1	P	R	F1
GPT-3.5 Turbo	Fine-tuning	Observed term and predicted preferred term	0.8142	0.7488	0.7801	0.8134	0.7448	0.7776	0.7909	0.6463	0.7113
GPT-3.5 Turbo	Fine-tuning	Only observed term	0.9129	0.5747	0.7054	0.9127	0.5731	0.7041	0.9078	0.5397	0.6770
GPT-3.5 Turbo	Few-shot Learning	Observed term and predicted preferred term	0.8432	0.6884	0.7580	0.8426	0.6852	0.7558	0.8041	0.5254	0.6356
GPT-3.5 Turbo	Few-shot Learning	Only observed term	0.9103	0.4921	0.6388	0.9100	0.4905	0.6374	0.9016	0.4444	0.5953
GPT-4	Few-shot Learning	Observed term and predicted preferred term	0.8417	0.7989	0.8197	0.8409	0.7941	0.8168	0.8104	0.6423	0.7166
GPT-4	Few-shot Learning	Only observed term	0.9363	0.5604	0.7011	0.9361	0.5588	0.6999	0.9313	0.5175	0.6653

			Normalization only			Overlapping extraction and normalization			Strict extraction and normalization
Model	Learning technique	Normalization target	P	R	F1	P	R	F1	P	R	F1
GPT-3.5 Turbo	Fine-tuning	Observed term and predicted preferred term	0.8142	0.7488	0.7801	0.8134	0.7448	0.7776	0.7909	0.6463	0.7113
GPT-3.5 Turbo	Fine-tuning	Only observed term	0.9129	0.5747	0.7054	0.9127	0.5731	0.7041	0.9078	0.5397	0.6770
GPT-3.5 Turbo	Few-shot Learning	Observed term and predicted preferred term	0.8432	0.6884	0.7580	0.8426	0.6852	0.7558	0.8041	0.5254	0.6356
GPT-3.5 Turbo	Few-shot Learning	Only observed term	0.9103	0.4921	0.6388	0.9100	0.4905	0.6374	0.9016	0.4444	0.5953
GPT-4	Few-shot Learning	Observed term and predicted preferred term	0.8417	0.7989	0.8197	0.8409	0.7941	0.8168	0.8104	0.6423	0.7166
GPT-4	Few-shot Learning	Only observed term	0.9363	0.5604	0.7011	0.9361	0.5588	0.6999	0.9313	0.5175	0.6653

Table 6.

10.1136/amiajnl-2013-002428

Performance on test set

			Normalization only			Overlapping extraction and normalization			Strict extraction and normalization
Model	Learning technique	Normalization target	P	R	F1	P	R	F1	P	R	F1
GPT-3.5 Turbo	Fine-tuning	Observed term and predicted preferred term	0.8142	0.7488	0.7801	0.8134	0.7448	0.7776	0.7909	0.6463	0.7113
GPT-3.5 Turbo	Fine-tuning	Only observed term	0.9129	0.5747	0.7054	0.9127	0.5731	0.7041	0.9078	0.5397	0.6770
GPT-3.5 Turbo	Few-shot Learning	Observed term and predicted preferred term	0.8432	0.6884	0.7580	0.8426	0.6852	0.7558	0.8041	0.5254	0.6356
GPT-3.5 Turbo	Few-shot Learning	Only observed term	0.9103	0.4921	0.6388	0.9100	0.4905	0.6374	0.9016	0.4444	0.5953
GPT-4	Few-shot Learning	Observed term and predicted preferred term	0.8417	0.7989	0.8197	0.8409	0.7941	0.8168	0.8104	0.6423	0.7166
GPT-4	Few-shot Learning	Only observed term	0.9363	0.5604	0.7011	0.9361	0.5588	0.6999	0.9313	0.5175	0.6653

			Normalization only			Overlapping extraction and normalization			Strict extraction and normalization
Model	Learning technique	Normalization target	P	R	F1	P	R	F1	P	R	F1
GPT-3.5 Turbo	Fine-tuning	Observed term and predicted preferred term	0.8142	0.7488	0.7801	0.8134	0.7448	0.7776	0.7909	0.6463	0.7113
GPT-3.5 Turbo	Fine-tuning	Only observed term	0.9129	0.5747	0.7054	0.9127	0.5731	0.7041	0.9078	0.5397	0.6770
GPT-3.5 Turbo	Few-shot Learning	Observed term and predicted preferred term	0.8432	0.6884	0.7580	0.8426	0.6852	0.7558	0.8041	0.5254	0.6356
GPT-3.5 Turbo	Few-shot Learning	Only observed term	0.9103	0.4921	0.6388	0.9100	0.4905	0.6374	0.9016	0.4444	0.5953
GPT-4	Few-shot Learning	Observed term and predicted preferred term	0.8417	0.7989	0.8197	0.8409	0.7941	0.8168	0.8104	0.6423	0.7166
GPT-4	Few-shot Learning	Only observed term	0.9363	0.5604	0.7011	0.9361	0.5588	0.6999	0.9313	0.5175	0.6653

Overall, few-shot learning with GPT-4 has achieved the highest F1 score, relying on both the normalization of the observed term and the normalization of the preferred term returned by GPT. Additionally, using the observed term alone results in higher precision at the cost of lower recall, which is expected as such strategies do not utilize GPT models’ preferred term prediction capabilities, which can introduce false predictions.

Discussion

As observed in the dataset updates section and the few-shot learning approach, normalization with GPT models can lead to various extraction errors and inconsistencies for multiple types of behaviors. These errors and inconsistencies were observed while evaluating the model using the validation set during the preliminary stages of PheNormGPT’s development. They encompass inaccurate span boundaries and imprecise normalization, appearing in various forms across multiple examples within all subsets of the original dataset.

First, boundary and normalization errors were observed for entities present in the training set. Such errors occasionally originate from the dataset itself, as similar behavior is demonstrated in the original dataset with variations. To remedy this behavior and ensure the model primarily follows the annotation guidelines specified in the shared task, previously described updates were applied to the provided dataset.

Another source of boundary errors corresponded to GPT models failing to follow task-specific behavior, particularly concerning disjoint spans and boundaries, including organ headers. Corrections to such behaviors are enforced via fine-tuning or few-shot learning. To ensure these behaviors are well-presented in the few-shot learning setting, a manually selected list of examples that demonstrate such behaviors are consistently included in the few-shot examples generated for inference, as discussed in the earlier sections.

One other source of errors in PheNormGPT concerns the marked text approach. Occasionally, GPT models return a slightly different text in the marked-text field than the one provided in the request. These differences are primarily the result of GPT fixing typos in the original text, such as misspelled words or missing spaces. Additionally, the presence of special characters such as “+” can cause such errors. For example, GPT models were observed to modify the input “HEAD: +cephalohematoma” to “HEAD: +cephalohematoma+” during earlier development iterations. This behavior causes complexities when parsing the character offsets and reduces the accuracy of this method for such rare cases. On occasions where an extracted concept has span offsets that would be out of bounds for the original observation text, PheNormGPT discards the extracted concept to avoid boundary errors. Future approaches can overcome this issue by expanding post-processing strategies or experimenting with prompt engineering to prevent it.

A potential area of improvement when using LLMs in this context is to ask LLMs to double-check their responses and self-correct. This was shown to increase performance in other biomedical informatics settings [29]. However, this approach doubles the time and the cost required for inference due to each request being made at least twice. Therefore, it was not a part of the experimentations for PheNormGPT.

Finally, PheNormGPT is built around OpenAI’s GPT models, which pose their own considerations. OpenAI’s API-based consumption model means that the models are deployed on OpenAI servers, and input and output data must also go through their servers. This can create concerns about data privacy and means that the approach presented in this paper is not directly applicable to non-de-identified data. This creates future research opportunities for extending PheNormGPT’s approaches to other open-source LLMs, such as LLaMA [22] and Falcon [23].

Conclusion

This manuscript introduces PheNormGPT, a framework for extracting and normalizing medical key findings to HPO, utilizing OpenAI’s GPT models in two separate approaches, combining rule-based methods, LLMs, and prompt engineering strategies. PheNormGPT is developed and evaluated on the dataset provided by BioCreative VIII Track 3 and has achieved the highest performance among the shared-task participants, highlighting the potential of LLMs in phenotypic data normalization. PheNormGPT’s approaches include a normalization prompting strategy harnessing the strengths of LLMs and requesting relevant information in a roundabout way to reduce hallucinations, and a few-shot learning algorithm that generates a specialized prompt with examples that demonstrate desired behavior in a custom and focused manner for each input. While the full potential of LLMs is not yet explored, and there may be higher performing prompting strategies in this area, PheNormGPT shows that LLMs can play a significant role in future research in this domain.

Acknowledgements

The authors would like to thank the organizers of the BioCreative VIII Track 3 challenge for their efforts in organizing the task and providing the dataset. PheNormGPT was presented at the BioCreative VIII Workshop during the AMIA Annual Symposium on 12 November 2023.

Conflict of interest

E.S. is employed as a Senior Software Engineer at EBSCO Information Services.

Funding

This work was supported by the National Institutes of Health [R21EB029575].

Data Availability

The training and validation data underlying this article are available on GitHub at https://github.com/Ian-Campbell-Lab/Clinical-Genetics-Training-Data/.

References

Pathak

Kho

Denny

Electronic health records-driven phenotyping: challenges, recent advances, and perspectives

J Am Med Inf Assoc

2013

;

e206

–

. doi:

10.1136/amiajnl-2012-000896

Newton

Peissig

Kho

et al.

Validation of electronic medical record-based phenotyping algorithms: results and lessons learned from the eMERGE network

J Am Med Inf Assoc

2013

;

e147

–

. doi:

10.1186/s12911-019-0786-z

Sharma

Mao

Zhang

et al.

Developing a portable natural language processing based phenotyping system

BMC Med Inform Decis Mak

2019

;

–

. doi:

Zeng

Deng

et al.

Natural language processing for EHR-based computational phenotyping

IEEE/ACM Trans Comput Biol Bioinform

2018

;

139

–

. doi:

10.1109/TCBB.2018.2849968

Bodenreider

The unified medical language system (UMLS): integrating biomedical terminology

Nucleic Acids Res

2004

;

D267

–

. doi:

Köhler

Gargano

Matentzoglu

et al.

The human phenotype ontology in 2021

Nucleic Acids Res

2021

;

D1207

–

. doi:

Bashyam

Divita

Bennett

et al.

A normalized lexical lookup approach to identifying UMLS concepts in free text

Stud Health Technol Inform

2007

;

129

:545.

Adamusiak

Shimoyama

Next generation phenotyping using the unified medical language system

JMIR Med Inform

2014

;

:e3172. doi:

10.2196/medinform.3172

10.1016/j.jbi.2019.103132

Winnenburg

Bodenreider

Coverage of phenotypes in standard terminologies

Joint Bio-Ontologies and BioLINK ISMB

2014

–

10.

Soysal

Wang

Jiang

et al.

CLAMP–a toolkit for efficiently building customized clinical natural language processing pipelines

J Am Med Inf Assoc

2018

;

331

–

. doi:

11.

Luo

Sun

Rumshisky

MCN: a comprehensive corpus for medical concept normalization

J Biomed Informat

2019

;

:103132. doi:

12.

Zhu

Nguyen

Sid

et al.

Leveraging the UMLS as a data standard for rare disease data normalization and harmonization

Methods Inf Med

2020

;

131

–

. doi:

10.1055/s-0040-1718940

13.

Campillos-Llanos

Valverde-Mateos

Capllonch-Carrión

et al.

A clinical trials corpus annotated with UMLS entities to enhance the access to evidence-based medicine

BMC Med Inform Decis Mak

2021

;

–

PubMed

14.

Newbury

Liu

Idnay

et al.

The suitability of UMLS and SNOMED-CT for encoding outcome concepts

J Am Med Inf Assoc

2023

;

1895

–

903

. doi:

10.1093/jamia/ocad161

10.1038/s41436-018-0381-1

15.

Aronson

Effective mapping of biomedical text to the UMLS Metathesaurus: the MetaMap program

. In: Proceedings of the AMIA Symposium, p. 17.

Washington, DC, USA

American Medical Informatics Association

2001

16.

Liu

Peres Kury

et al.

Doc2Hpo: a web application for efficient and accurate HPO concept curation

Nucleic Acids Res

2019

;

W566

–

. doi:

17.

Deisseroth

Birgmeier

Bodle

et al.

ClinPhen extracts and prioritizes patient phenotypes directly from medical records to expedite genetic disease diagnosis

Genet Med

2019

;

1585

–

. doi:

18.

Arbabi

Adams

Fidler

et al.

Identifying clinical terms in medical text using ontology-guided machine learning

JMIR Med Inform

2019

;

:e12596. doi:

10.2196/12596

10.1093/bioinformatics/btab019

19.

Luo

Yan

Lai

et al.

PhenoTagger: a hybrid method for phenotype concept recognition using human phenotype ontology

Bioinformatics

2021

;

1884

–

. doi:

20.

Feng

Tian

PhenoBERT: a combined deep learning method for automated recognition of human phenotype ontology

IEEE/ACM Trans Comput Biol Bioinform

2022

;

1269

–

. doi:

10.1109/TCBB.2022.3170301

10.1371/journal.pdig.0000198

21.

Radford

Narasimhan

Salimans

et al.

Improving language understanding by generative pre-training

2018

. https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf (30 July 2024, date last accessed).

22.

Touvron

Martin

Stone

et al.

Llama 2: open foundation and fine-tuned chat models

arXiv preprint

. arXiv:2307.09288,

2023

23.

Almazrouei

Alobeidli

Alshamsi

et al.

The falcon series of language models: towards open frontier models

Hugging Face Repository

2023

24.

Kung

Cheatham

Medenilla

et al.

Performance of ChatGPT on USMLE: potential for AI-assisted medical education using large language models

PLoS Digital Health

2023

;

:e0000198. doi:

10.1371/journal.pcbi.1011319

25.

Achiam

Adler

Agarwal

et al.

Gpt-4 technical report

arXiv preprint

. arXiv:2303.08774,

2023

26.

Lubiana

Lopes

Medeiros

et al.

Ten quick tips for harnessing the power of ChatGPT in computational biology

PLoS Comput Biol

2023

;

:e1011319. doi:

10.1016/j.ebiom.2023.104770

27.

Wang

Sun

et al.

Gpt-ner: named entity recognition via large language models

arXiv preprint

. arXiv:2304.10428,

2023

28.

Honnibal

, and

Montani

spaCy 2: natural language understanding with bloom embeddings, convolutional neural networks and incremental parsing

. In: Proceedings of the Association for Computational Linguistics (ACL), pp.

688

–

Vancouver, Canada

Neural Machine Translation

2017

29.

Lim

Pushpanathan

Yew

SME

et al.

Benchmarking large language models’ performances for myopia care: a comparative analysis of ChatGPT-3.5, ChatGPT-4.0, and Google Bard

EBioMedicine

2023

;

:104770. doi: