Pipeline to explore information on genome editing using large language models and genome editing meta-database

The sections marked with {} indicate that different text is inserted for each LLM processing with an article.

——————————————————
Extract key genome editing related data from the provided ABSTRACT and TITLE, referencing the predicted genes and species ({GENES from GEM} by {SPECIES from GEM}) involved in this research. Analyze the ABSTRACT to:
1. categorize the genes into ‘targeted gene of genome editing’ or ‘differentially expressed gene by genome editing’.
2. confirm the species or organisms studied.
3. describe the genome editing event (e.g. knockout, knockin, knockdown, frameshift, SNP, expression modulation etc) and specify the genome editing tools used (e.g. CRISPR-Cas9, TALEN, Prime editor, etc).
If information on any of the above points is not provided in the text, state ‘Not mentioned’.
Summarize findings in this JSON format:
{{
‘targeted_genes’: [], // List of targeted genes of genome editing
‘differentially_expressed_genes’: [], // List of genes altered expression by genome editing
‘species’: [], // List of species or organisms studied with genome editing
‘genome_editing_tools’: [], // List of genome editing tools used
‘genome_editing_event’: [], // List of editing events described
‘study_context’: ““, // Study context in short one sentence
‘key_findings’: ““, // Key findings in short one sentence
‘implications’: ““, // Implications in short one sentence
}}
Please fill out the JSON structure based on the information from the research ABSTRACT and TITLE provided below:
Abstract:
{ABSTRACT from PubMed Central}
Title:
{TITLE from PubMed Central}
——————————————————

Table 1.

The sections marked with {} indicate that different text is inserted for each LLM processing with an article.

Table 2.

The sections marked with {} indicate that different text is inserted for each LLM processing with an article.

——————————————————
Extract key genome editing related information from the provided texts (ABSTRACT, TITLE, METHODS, and RESULTS of the research article), referencing the predicted genes, species, and genome editing tools (GENES: {GENES from GEM} by {SPECIES from GEM}, genome editing tools:{genome editing tools from GEM}) involved in this research.
Analyze the provided texts to:
1. identify the targeted genes of genome editing
2. identify the differentially expressed genes by genome editing
3. confirm the species or organisms studied using genome editing
4. identify the genome editing events (e.g. knockout, knockin, knockdown, frameshift, SNP, expression modulation etc)
5. specify the genome editing tools used (e.g. CRISPR-Cas9, TALEN, Prime editor, etc).
6. identify phenotypes observed using genome editing in the study
If information on any of the above points is not provided in the text, state ‘Not mentioned’.
If genome editing is not mentioned in the text, state ‘Not mentioned’.
Summarize findings in this JSON format:
{{
‘targeted_genes’: [], // List of targeted genes of genome editing
‘differentially_expressed_genes’: [], // List of differentially expressed genes by genome editing
‘species’: [], // List of species or organisms studied using genome editing
‘genome_editing_tools’: [], // List of genome editing tools used
‘genome_editing_event’: [], // List of editing events identified
‘phenotypes’: [] // List of phenotypes observed using genome editing
}}
Please fill out the JSON structure based on the information from the texts (ABSTRACT, TITLE, METHODS, and RESULTS of the research article) provided below:
TITLE:
{TITLE from PubMed Central}
ABSTRACT:
{ABSTRACT from PubMed Central}
METHODS:
{METHODS from PubMed Central}
RESULTS:
{RESULTS from PubMed Central}
——————————————————

Table 2.

The sections marked with {} indicate that different text is inserted for each LLM processing with an article.

In the second round of experiments, we refined the prompt as follows and extracted six types of metadata: “targeted genes of GE,” “genes reported as altered expression due to GE of other target genes,” “Species studied using GE,” “GE tools used,” “GE events induced,” and “Phenotypes resulting from GE.”

In the first round of experiments, we utilized the following information: the genes and species linked to the GEM entries as well as the titles and abstracts of the articles. In the second round of experiments, we incorporated additional information, including the GE tools registered in the GEM entries and the textual content from the Materials and methods and Results sections of the article.

The outputs from the LLM were standardized into a unified JaveScript Object Notation (JSON) format using the “output_parser” function from the LangChain library. Four LLMs were tested: GPT-4, GPT-4o, GPT-4o-mini, and Llama3-70b. In the first round of experiments, each information extraction task was performed once using GPT-4, GPT-4o, and Llama3-70b models. In the second round, each task was performed once using GPT-4o and GPT-4o-mini models. Based on its accuracy, GPT-4o was selected for further use. The primary libraries and tools used to query the LLM Application Programming Interface (API) were LangChain (https://www.langchain.com/) and Groq (https://pypi.org/project/groq/).

Step 3: Visualize and analyze results

The information extraction results using LLM and GEM were visualized locally as a table_vis1 (Fig. 2a) using the custom-developed tool “visualize_geinfo” (https://github.com/szktkyk/visualize_geinfo), which allows for automated calculation of GE-related metrics and facilitates easy searching of the results. During visualization, the number of GE-targeted cases (GE_target_count) and the number of articles reporting gene expression changes due to GE of other genes (GE_degs_count) for each of the queried genes (158 genes in this case) were automatically calculated and displayed as table_vis2 (Fig. 2b). The “visualize_geinfo” tool has built in functionality to incorporate additional custom-related metrics in table_vis2 (Fig. 2b). In this study, we added two metrics: scores from a transcriptomics meta-analysis (meta-analysis score) and the number of articles reporting an association with Parkinson’s disease for each gene (PD_count) [26] (Fig. 2b). Meta-analysis score was calculated in the previous study [26]. This score represents the number of RNA-seq data pairs showing expression changes out of data pairs derived from 10 research projects, suggesting the gene’s association with OS. A higher absolute value of the score indicates more consistent differential expression across multiple studies, suggesting the gene’s association with OS. We also applied max–min normalization to scale each value between 0 and 1 (equation 1) and calculated a cumulative score, which is presented in table_vis2 (Fig. 2b). In total, 158 genes were ranked in descending order based on cumulative scores. For metrics in which lower values indicated stronger target candidates (i.e. GE_target_count and PD_count), the scoring system adjusted the ranking by subtracting each value from the overall maximum (equation 2). The score is automatically calculated through executing “weight_score.py” in visualize_geinfo repository. Two important settings—addition of custom-related scores to table_vis2 and indication of specific metrics where lower values indicate stronger target candidates—can be configured by editing the “config.py” file in the repository. The top 40 ranked genes were visualized using a bar plot (Fig. 2c). The visualize_geinfo tool can be executed in a containerized environment by loading a Docker file from the repository, ensuring that it can run on any computer platform:

$${{equation\;\;}}1:{\rm{\;}}X^{\prime}= {\rm{\;}}\frac{{X - \left( {{{\min}\;the\;column}} \right)}}{{(\max {{the\;column}}) - (\min {{the\;column}})}},$$

$${{equation\;\;}}2:{\rm{\;}}X^{\prime}= {\rm{\;}}1 - {{equation\;\;}}1.\\[-7pt]$$

Preparation of evaluation data

After querying 146 NCBI Gene IDs against the GEM dataset of 7 May 2024, 266 annotation pairs between the NCBI Gene ID and PubMed ID were retrieved from 259 articles. We manually curated the role of each gene from 266 annotation pairs by reviewing the contents of the annotated articles.

The curation results are summarized in Table S2 of the Supplementary data. If the gene was identified as a target of GE in the article, “1” was entered in the “curation_gene” column of the curation file. If the gene was not a target, “0” was recorded. For genes targeted in nonhuman species, “2” was entered, with the correct species noted in the memo column.

If the gene was reported in the article to have altered expression due to GE of other genes, “1” was entered in the “deg” column of the curation file, and “0” if not. If a gene was already labeled as a GE target, “2” was entered in the deg column to exclude it from LLM performance evaluation. This exclusion was necessary because LLMs might infer expression changes in genome-edited genes as GE_deg, even without explicit mention in the literature.

Performance evaluation of information extraction by the LLM

The evaluation of LLM-based information extraction was conducted for two types of extracted metadata: genes targeted by GE (GE_target) and genes reported as having altered expression due to the GE of other genes (GE_deg).

For GE_target, the evaluation followed these criteria: if the evaluation data (curated csv file) were labeled “1” and the LLM output included the gene, it was classified as a true positive (TP); if the gene was missing from the LLM output, it was a false negative (FN). If the evaluation data were labeled “0” and the LLM output did not include the gene, it was considered a true negative (TN), while inclusion of the gene was treated as a false positive (FP). For cases labeled “2,” it was considered a TP if the gene was present in the LLM output and the species mentioned in the LLM results matched the curated species in the memo column.

For GE_deg, 163 annotation pairs were evaluated, excluding the 103 annotation pairs curated as GE_target. The same criteria used for GE_target applied: if the evaluation data were labeled “1” and the gene was included in the LLM output, it was classified a TP, while omission of the gene was an FN. For cases labeled “0,” if the gene was absent from the LLM output, it was considered a TN; otherwise, it was classified as an FP.

The accuracy, precision, recall, and F1 scores were calculated using the following formulas:

Accuracy = (TP + TN)/(TP + TN + FP + FN),

Precision = TP/(TP + FP),

Recall = TP/(TP + FN),

F1 score = 2 × (Precision × recall)/(Precision + Recall).

The calculation was executed by running “evaluate_targetedgenes.py” and “evaluate_deg.py” from extract_geinfo repository (https://github.com/szktkyk/extract_geinfo).

Results

Proportion of GE-targeted genes in GEM

As mentioned in the Introduction section, the GEM dataset includes various types of genes linked to GE-related literature owing to the nature of its data collection system. To investigate the types of genes linked to GE-related articles, we manually curated a subset of the GEM dataset. Following Step 1 of the pipeline (Fig. 1, Method: Step 1), we queried 146 NCBI Gene IDs against the GEM dataset to retrieve GE-related articles. These 146 NCBI Gene IDs were converted from 168 Ensembl Gene IDs that were previously identified through data-driven analyses as responsive to OS in Parkinson’s disease [26]. Consequently, we retrieved 266 unique pairs of annotations, where each pair contained a gene ID (NCBI gene ID) matched with its corresponding article ID (PubMed ID). As some articles were linked to multiple genes, the total number of GE-related articles retrieved was 259 (referred to as 259_GE_articles).

Manual curation of the 266_GE_annotation pairs using the method outlined in preparation of evaluation data revealed four major categories of genes: genes targeted by GE (38.72%, 103 of 266_GE_annotation_pairs), genes reported as altered expression due to GE of other genes (24.44%, 65 of 266_GE_annotation_pairs), genes studied in the article but not related to GE (10.53%, 28 of 266_GE_annotation_pairs), and genes collected due to extraction errors (8.27%, 22 of 266_GE_annotation_pairs).

If these results reflect a similar trend across the entire GEM dataset, which contains 92 182 entries as of 18 September 2024, it suggests that only ∼39% of the search results represent the intended entries if a user searches for GEM to investigate whether a gene has been targeted by GE. Additionally, ∼19% of the search hits would consist of unrelated studies or extraction errors. This indicates that the GEM dataset, in its current form, may be an unreliable dataset for such purposes.

Proportion of GE-targeted genes after information extraction by LLMs

To address the challenge of identifying the role of each gene in a GE study, we explored the application of LLMs. In Step 2 of the pipeline, we utilized information on genes, species, and GE tools from the GEM, along with textual data from the articles, combined with custom prompts, as input to the LLMs to systematically extract GE metadata.

In May 2024, as part of the first round of experiments, we used genes and species data from the GEM and the titles and abstracts of relevant articles. Several LLMs were tested to extract “targeted genes by GE,” and their extraction performance, cost, and processing time were compared. As outlined in the “Materials and Methods: Step 2: Extract information from the article,” 259_GE_articles were processed using LLMs, and the extracted information was compared with manually curated evaluation data (see Table S2 in the Supplementary data). The results show that GPT-4 achieved an 84.96% accuracy, a cost of $9.9, and a processing time of 3000 s; GPT-4o achieved a 90.23% accuracy, a cost of $1.4, and a processing time of 960 s; and Llama3 70B achieved an 83.83% accuracy, a cost of $1.5 (the cost of using GPT-4 with LangChain’s output_parser library), and a processing time of 4440 s (Table 3). Llama3-70B was tested using Groq’s API Demo version. Based on these results, GPT-4o, which had the highest accuracy and a relatively low cost, was selected.

Table 3.

Evaluation for extraction of GE_targeted (genes targeted by GE) using LLMs for first attempt (20 240 510–16)

Evaluation for extraction of GE_targeted (genes targeted by GE) using LLMs
	GPT-4	GPT-4o	Llama3-70b
Number of articles processed	259	259	259
Accuracy for 259 articles	0.8496	0.9023	0.8383
Precision for 259 articles	0.7890	0.9326	0.7679
Recall for 259 articles	0.8349	0.8058	0.8350
F1 for 259 articles	0.8113	0.8646	0.8000
Price	$9.90	$1.40	$1.50
Time	3000 (s)	960 (s)	4440 (s)

Evaluation for extraction of GE_targeted (genes targeted by GE) using LLMs
	GPT-4	GPT-4o	Llama3-70b
Number of articles processed	259	259	259
Accuracy for 259 articles	0.8496	0.9023	0.8383
Precision for 259 articles	0.7890	0.9326	0.7679
Recall for 259 articles	0.8349	0.8058	0.8350
F1 for 259 articles	0.8113	0.8646	0.8000
Price	$9.90	$1.40	$1.50
Time	3000 (s)	960 (s)	4440 (s)

Table 3.

Evaluation for extraction of GE_targeted (genes targeted by GE) using LLMs for first attempt (20 240 510–16)

Evaluation for extraction of GE_targeted (genes targeted by GE) using LLMs
	GPT-4	GPT-4o	Llama3-70b
Number of articles processed	259	259	259
Accuracy for 259 articles	0.8496	0.9023	0.8383
Precision for 259 articles	0.7890	0.9326	0.7679
Recall for 259 articles	0.8349	0.8058	0.8350
F1 for 259 articles	0.8113	0.8646	0.8000
Price	$9.90	$1.40	$1.50
Time	3000 (s)	960 (s)	4440 (s)

Evaluation for extraction of GE_targeted (genes targeted by GE) using LLMs
	GPT-4	GPT-4o	Llama3-70b
Number of articles processed	259	259	259
Accuracy for 259 articles	0.8496	0.9023	0.8383
Precision for 259 articles	0.7890	0.9326	0.7679
Recall for 259 articles	0.8349	0.8058	0.8350
F1 for 259 articles	0.8113	0.8646	0.8000
Price	$9.90	$1.40	$1.50
Time	3000 (s)	960 (s)	4440 (s)

In August 2024, during the second round of experiments, the prompt was refined to improve accuracy, and with the release of GPT-4o-mini, we conducted a comparison between GPT-4o and GPT-4o-mini. In the second round of experiments, our custom prompts for LLMs included information about genes, species, and GE tools from the GEM and textual data from the title, abstract, methods, and results sections of the article. Metadata extraction was performed using GPT-4o and GPT-4o-mini to extract six types of metadata: targeted_genes (genes targeted by GE, referred to as GE_target), differentially_expressed_genes (genes reported as altered expression due to GE of other genes, referred to as GE_deg), species (organisms studied using GE), genome_editing_tools (tools used for GE in the study), genome_editing_event (events induced by GE), and phenotypes (phenotypes observed as a result of GE). As listed in Table 4, GPT-4o outperformed GPT-4o-mini and the models from the first round of experiments in terms of extraction accuracy. Therefore, the results of GPT-4o on the second round were selected to use for further analysis.

Table 4.

Evaluation for extraction of GE_target and GE_deg using GPT-4o for second attempt (20240802)

Evaluation for extraction using LLMs
	GPT-4o-mini (GE_targeted)	GPT-4o (GE_targeted)	GPT-4o (GE_deg)
Number of articles processed	742	742	742
Accuracy for 259 articles	0.8797	0.9511	0.8528
Precision for 259 articles	0.7840	0.9167	0.7662
Recall for 259 articles	0.9615	0.9612	0.9077
F1 for 259 articles	0.8596	0.9384	0.8310
Price	$0.76	$26.55	$26.55
Time	2050 (s)	4339 (s)	4339 (s)

Evaluation for extraction using LLMs
	GPT-4o-mini (GE_targeted)	GPT-4o (GE_targeted)	GPT-4o (GE_deg)
Number of articles processed	742	742	742
Accuracy for 259 articles	0.8797	0.9511	0.8528
Precision for 259 articles	0.7840	0.9167	0.7662
Recall for 259 articles	0.9615	0.9612	0.9077
F1 for 259 articles	0.8596	0.9384	0.8310
Price	$0.76	$26.55	$26.55
Time	2050 (s)	4339 (s)	4339 (s)

Table 4.

Evaluation for extraction of GE_target and GE_deg using GPT-4o for second attempt (20240802)

Evaluation for extraction using LLMs
	GPT-4o-mini (GE_targeted)	GPT-4o (GE_targeted)	GPT-4o (GE_deg)
Number of articles processed	742	742	742
Accuracy for 259 articles	0.8797	0.9511	0.8528
Precision for 259 articles	0.7840	0.9167	0.7662
Recall for 259 articles	0.9615	0.9612	0.9077
F1 for 259 articles	0.8596	0.9384	0.8310
Price	$0.76	$26.55	$26.55
Time	2050 (s)	4339 (s)	4339 (s)

Evaluation for extraction using LLMs
	GPT-4o-mini (GE_targeted)	GPT-4o (GE_targeted)	GPT-4o (GE_deg)
Number of articles processed	742	742	742
Accuracy for 259 articles	0.8797	0.9511	0.8528
Precision for 259 articles	0.7840	0.9167	0.7662
Recall for 259 articles	0.9615	0.9612	0.9077
F1 for 259 articles	0.8596	0.9384	0.8310
Price	$0.76	$26.55	$26.55
Time	2050 (s)	4339 (s)	4339 (s)

For GE_target, the results from the second round of experiments demonstrated that the LLM was able to interpret the context of GE-related genes described in the article, with an F1 score of 0.9384. For GE_deg, the F1 score was 0.831.

In the task of extracting GE_target from the article and GEM data using LLM, precision had the lowest value among the measured metrics. Of the 266_GE_annotation_pairs, 13 genes were incorrectly extracted, including nine FPs and four FNs (Table 5). Six of the nine FP cases resulted from the LLM incorrectly annotating GE_deg as GE_target. Similarly, precision was the lowest metric in the GE_deg extraction task with 19 FP and six FN cases (Table 6). In the evaluation of the two types of GE-related metadata in this study, a trend was observed in which precision was lower compared to accuracy and recall (Tables 5 and 6). These results suggest that LLMs tend to overinterpret contextual information rather than overlook it, particularly when distinguishing between closely related concepts such as GE_target and GE_deg. This tendency might be attributed to the LLM’s inherent behavior of making inferences beyond explicitly stated information.

Table 5.

Confusion matrix for the GE_target of second attempt with GPT-4o (20240802)

Evaluation for extraction of GE_targeted (genes targeted by GE) using LLMs
		LLM result
		Positive	Negative	Total
Evaluation data	Positive	99	4	103
Evaluation data	Negative	9	154	163
	Total	108	158	266

Evaluation for extraction of GE_targeted (genes targeted by GE) using LLMs
		LLM result
		Positive	Negative	Total
Evaluation data	Positive	99	4	103
Evaluation data	Negative	9	154	163
	Total	108	158	266

Table 5.

Confusion matrix for the GE_target of second attempt with GPT-4o (20240802)

Evaluation for extraction of GE_targeted (genes targeted by GE) using LLMs
		LLM result
		Positive	Negative	Total
Evaluation data	Positive	99	4	103
Evaluation data	Negative	9	154	163
	Total	108	158	266

Evaluation for extraction of GE_targeted (genes targeted by GE) using LLMs
		LLM result
		Positive	Negative	Total
Evaluation data	Positive	99	4	103
Evaluation data	Negative	9	154	163
	Total	108	158	266

Table 6.

Confusion matrix for the GE_deg results of second attempt with GPT-4o (20 240 802)

Evaluation for extraction of GE_deg (genes reported as altered expression due to GE of other genes) using LLMs
		LLM result
		Positive	Negative	Total
Evaluation data	Positive	59	6	65
Evaluation data	Negative	18	80	98
	Total	77	86	163

Evaluation for extraction of GE_deg (genes reported as altered expression due to GE of other genes) using LLMs
		LLM result
		Positive	Negative	Total
Evaluation data	Positive	59	6	65
Evaluation data	Negative	18	80	98
	Total	77	86	163

Table 6.

Confusion matrix for the GE_deg results of second attempt with GPT-4o (20 240 802)

Evaluation for extraction of GE_deg (genes reported as altered expression due to GE of other genes) using LLMs
		LLM result
		Positive	Negative	Total
Evaluation data	Positive	59	6	65
Evaluation data	Negative	18	80	98
	Total	77	86	163

Evaluation for extraction of GE_deg (genes reported as altered expression due to GE of other genes) using LLMs
		LLM result
		Positive	Negative	Total
Evaluation data	Positive	59	6	65
Evaluation data	Negative	18	80	98
	Total	77	86	163

The output files include the results of the GPT-4o and GPT-4o-mini for the second round of experiments (see Tables S3 and S4 in the Supplementary data). The evaluation results for GE_target and GE_deg for the second round of experiments using GPT-4o are available in Tables S5 and S6 in the Supplementary data.

Calculation of GE-related metrics for genes

By utilizing an LLM-based information extraction process, we were systematically able to obtain novel GE metadata that the current GEM cannot achieve. Although the evaluation was conducted on only 259_GE_articles, we expanded the dataset to include 742 articles retrieved by querying the GEM with 158 gene symbols rather than NCBI Gene IDs. The searching for gene symbols in the GEM enables the inclusion of articles linked to orthologous genes in nonhuman species, resulting in an increased number of hits. We processed these 742 articles following Step 2 of the pipeline to collect GE metadata. From the LLM results for these 742 articles, two GE-related metrics were calculated for each gene: (I) the number of cases in which the gene was targeted by the GE tool (GE_target_count) and (II) the number of articles reporting altered gene expression due to GE of other genes (GE_deg_count). These metrics were automatically calculated by running the custom-developed visualize_geinfo tool (https://github.com/szktkyk/visualize_geinfo), which calls search_gem_llm.py to compute how frequently each gene was identified as “GE_target_count” or “GE_deg_count” based on the LLM output.

The top 10 genes with the highest GE_target_count values among the 158 genes were: PKD1, FOS, SLC7A5, TXNIP, STAT6, SPP1, SLC3A2, TUBB3, SLC26A4, and CARD11. These genes are considered to have a higher likelihood of being well-studied for effective GE within the 158 gene set. In contrast, 103 of the 158 genes had a GE_target_count of zero according to the GEM and LLM, suggesting that they have not yet been extensively studied using GE or have limited examples. These are challenging candidate genes, potentially leading to novel discoveries through GE experiments.

The top 10 genes with the highest GE_deg_count values among the 158 genes were FOS, SPP1, TXNIP, MPO, CDKN1C, CRYAB, CD74, STAT6, CPT1A, and PKD1. These genes have been suggested to have potential responsiveness or exhibit certain phenotypes in response to GE.

Utilization of GE-related metrics

As an experimental approach for utilizing the two GE-related metrics, we scored and ranked the 158 genes. In ranking the 158 genes, we incorporated two custom metrics: a meta-analysis score derived from gene expression data (meta-analysis_score) and the number of studies reporting an association with Parkinson’s disease (PD_score), along with two GE-related metrics: GE_target_count and GE_deg_count. The custom-developed visualize_geinfo tool allows the addition of custom scores (meta-analysis_score and PD_score) to the default metrics (GE_target_count and GE_deg_count) in the table. All four metrics were normalized to a 0–1 scale using max–min normalization with weights assigned based on importance. These weights are configurable through the config.py file, allowing users to adjust them according to their priorities when running Step 3 of the pipeline. In this study, we assigned a weight to each metric as follows: GE_target_count:0.35, GE_deg_count:0.15, meta-analysis_score:0.45, and PD_score: 0.05. We assigned the highest weight (0.45) to the meta-analysis_score as it represents results from transcriptome meta-analysis, which indicates the functional significance of genes in terms of expression changes. The second-highest weight was given to GE_target_count, as we aimed to prioritize genes that had not been previously studied using GE. GE_deg_count received the third-highest weight, as it potentially indicates gene functionality from an expression perspective. While PD_score, which reflects the research attention a gene has received in Parkinson’s disease studies, may be significant in certain contexts, it was given the lowest priority in this study.

After calculating the total score for each gene, the top 40 genes were ranked and visualized in a bar plot (Fig. 2c). The table shown in Fig. 2b displays all 158 genes sorted by score. The top 10 ranked genes were MEGF10, HSPA6, LINC01588, CNBD1, GPR37L1, PRC1-AS1, PILRA, BTLA, TMPRSS4, and ADAMTS2. The scores for the 158 genes are listed in Table S7 of the Supplementary data. We also examined the top-ranked genes when the weights were slightly changed as follows: GE_target_count:0.25, GE_deg_count:0.10, meta-analysis_score:0.55, and PD_score:0.10. We observed that the composition of the top 10 genes remained consistent except for a swap between the first and second positions. The scores for the 158 genes with these modified weights are listed in Table S8 of the Supplementary data.

These top-ranked genes can be hypothesized as potential targets for future research, as they scored highly in the meta-analysis_score based on gene expression meta-analysis and the GE_deg_count based on literature interpretation by the LLM. In addition, they have low GE_target_count and PD_score values, indicating that they are understudied in the GE and PD research fields. This suggests that they may lead to new discoveries on in-depth investigations. However, these top-ranked genes have not yet been experimentally validated.

Discussion

This study identified a key issue with the existing GEM, in which genes linked to GE-related literature in the GEM can be classified into four major categories; however, the current GEM does not allow for determining which category each gene belongs to. To address this limitation, we employed LLMs. In a case study involving 259_GE_articles associated with the test 168 genes (including 146 NCBI Gene IDs and 158 Gene Symbols), we extracted information on GE contexts, including targeted genes of GE (GE_target) and genes reported as having altered expression due to GE of other genes (GE_deg). The results showed that GE_target was identified with an F1 score of 0.9384 and GE_deg with an F1 score of 0.8310. This demonstrates that combining LLMs with existing GEM can provide novel GE metadata that could not be systematically collected previously. By leveraging this enhanced GE information, we calculated two new metrics for each gene: “number of GE targeted cases (GE_target_count)” and “number of articles reporting differential expression due to GE of other genes (GE_deg_count).” These metrics were then used in an experimental approach to score and prioritize the 158 genes.

After normalizing the scores, we ranked the 158 genes and examined the characteristics of some of the top-ranked genes. For example, LINC01588, ranked third, had a score of zero for GE_target_count, GE_deg_count, and PD_score. Its high ranking was driven by its relatively strong OS meta-analysis_score. A PubMed search for “LINC01588” returned five articles, one of which reported that knocking down LINC01588 enhances OS through its interaction with HNRNPL [27]. LINC01588 encodes for a noncoding RNA, which may explain the limited research on this gene. The second-ranked gene, HSPA6, encodes the molecular chaperone heat shock 70-kDa Protein 6, which responds to stress. It had three cases of GE, with one study linking to PD. With the value of 2 for GE_deg_count and a meta-analysis_score of 50, HSPA6 ranked in the second position. As one of the GE articles also explained HSPA6 as an upregulated gene in PD [28], HSPA6 is an already established target gene for PD and can be considered a positive control for this ranking. CNBD1, ranked fourth, was downregulated in OS conditions. The protein encoded by cyclic nucleotide-binding domain-containing 1 remains relatively understudied, with limited available information. Research on CNBD1 in PD or OS is sparse, no GE cases were found using our method, and no PubMed hits querying with “CNBD1” were identified, making CNBD1 one of the potential new research target genes for the future PD or OS study.

A high score on these metrics does not necessarily imply that the genes are functionally linked to OS or PD. However, by incorporating GE information, this method can be seen as a way to prioritize candidate genes for further experimental functional analysis, particularly focusing on genes with limited prior research, but notable gene expression profiles. There is a growing concern regarding research bias toward well-studied human genes [29, 30]. In response, approaches aimed at promoting the investigation of understudied genes and fostering new discoveries, such as omics-based methods as “Unknomics,” have gained attention [31–33]. Unknomics seeks to advance research on novel genes. This scoring method, which integrates GE_target_count and PD_score in this study, can be viewed as a form of unknomics.

In the end, we outline the four major limitations of the pipeline developed in this study. First, the GEM dataset was collected based only on the literature from PubMed. As of 18 September 2024, the GEM contained 46 039 GE-related articles retrieved from PubMed using a custom search query (https://github.com/szktkyk/gem/blob/main/config.py). However, there is a possibility that irrelevant literature unrelated to GE was included, and relevant literature studies may have been missed. Articles that were not indexed in PubMed were excluded from the GEM dataset. Second, the number of articles processed using LLMs was limited in this study. As a feasibility study, only 742 of the 46 039 articles in the GEM dataset were processed using the pipeline. The evaluation data were limited to 259 of the 46 039 articles, indicating that the performance of the LLM was assessed only on a subset of the data registered in the GEM. The LLM performance on a large scale remains untested. Third, our evaluation of GE metadata was limited to GE_target and GE_deg due to insufficient labeled data for other metadata types. Additionally, we have not assessed how variations in extracted metadata types between the first and second rounds might affect the extraction performance of GE_target and GE_deg. Finally, there is always a risk of error in the LLM output. Information extraction for GE_target and GE_deg using LLM achieved accuracies of 95.11 and 85.28%, respectively, indicating error rates of 4.89 and 14.72% for GE_target and GE_deg. Thus, potential errors in the LLM output cannot be entirely ruled out. Although this pipeline shows the potential for systematically collecting a large amount of GE metadata for future research, the results generated by the LLM should not be regarded as definitive.

Conclusion

In this study, we explored a method to address the metadata issues in GEM using LLMs Through this approach, we were able to systematically collect metadata that could not be obtained through the conventional GEM. Furthermore, we developed a method to rank genes based on newly collected data by leveraging the concept of unknomics. The findings of this study are expected to contribute to the efficient design of research using GE. However, because the information extracted by the LLM and the gene rankings may contain errors, it is essential for users to manually verify the data as a final step.

Acknowledgements

This research was supported by the Center of Innovation for Bio-Digital Transformation (BioDX), the open innovation platform for industry-academia co-creation (COI-NEXT), the Japan Science and Technology Agency (JST; Grant Number JPMJPF2010). This work was also supported by the JST, which established university fellowships for the creation of science and technology innovation (Grant Number JPMJFS2129). Computations were performed on the computers at the Hiroshima University Genome Editing Innovation Center. We also would like to thank all laboratory members at Hiroshima University and the Database Center of Life Science for their valuable comments.

Author contributions

T.S. was responsible for data curation, software development, pipeline analysis, draft of the original manuscript. T.S. and H.B. were responsible for the study design, conceptualization, and methodology manuscript review and editing. H.B. was responsible for the project administration and funding acquisition. Both the authors have read and approved the final version of the manuscript.

Conflict of interest:

None declared.

Funding

This study was supported by the Center of Innovation for Bio-Digital Transformation, an open innovation platform for industry-academia co-creation of JST (COI-NEXT, grant number JPMJPF2010).

Data availability

The Supplementary File and the source codes for this research are available at https://github.com/szktkyk/extract_geinfo, https://github.com/szktkyk/visualize_geinfo, and figshare repository (https://doi.org/10.6084/m9.figshare.c.7497327).

References

Gaj

Sirk

Shui

et al.

Genome-editing technologies: principles and applications

Cold Spring Harb Perspect Biol

2016

;

:a023754.

10.1016/j.ggedit.2022.100018

Kim

Cha

Chandrasegaran

Hybrid restriction enzymes: zinc finger fusions to Fok I cleavage domain

Proc Natl Acad Sci USA

1996

;

1156

–

Christian

Cermak

Doyle

et al.

Targeting DNA double-strand breaks with TAL effector nucleases

Genetics

2010

;

186

757

–

Jinek

Chylinski

Fonfara

et al.

A programmable Dual-RNA–guided DNA endonuclease in adaptive bacterial immunity

Science

2012

;

337

816

–

Komor

Kim

Packer

et al.

Programmable editing of a target base in genomic DNA without double-stranded DNA cleavage

Nature

2016

;

533

420

–

Anzalone

Randolph

Davis

et al.

Search-and-replace genome editing without double-strand breaks or donor DNA

Nature

2019

;

576

149

–

Gilbert

Horlbeck

Adamson

et al.

Genome-scale CRISPR-mediated control of gene repression and activation

Cell

2014

;

159

647

–

Hartenian

Doench

Genetic screens and functional genomics using CRISPR/Cas9 technology

FEBS J

2015

;

282

1383

–

Nakamae

Bono

Genome editing and bioinformatics

Gene Genome Edit

2022

;

100018

. doi:

Crossref

10.

Suzuki

Bono

GEM: genome editing meta-database, a dataset of genome editing related metadata systematically extracted from PubMed literatures

Gene Genome Edit

2023

;

:100024.

11.

Liu

Homma

Sayadi

et al.

Sequence features associated with the cleavage efficiency of CRISPR/Cas9 system

Sci Rep

2016

;

:19675.

12.

Scott

Kriz

et al.

Genome-wide binding of the CRISPR endonuclease Cas9 in mammalian cells

Nat Biotechnol

2014

;

670

–

13.

Kim

Kweon

Kim

Recent advances in CRISPR-based functional genomics for the study of disease-associated genetic variants

Exp Mol Med

2024

;

861

–

14.

Pacesa

Pelea

Jinek

Past, present, and future of CRISPR genome editing technologies

Cell

2024

;

187

1076

–

100

15.

da Silva

Meyenberg

Loizou

Tissue specificity of DNA repair: the CRISPR compass

Trends Genet

2021

;

958

–

16.

Mikkelsen

Bak

Enrichment strategies to enhance genome editing

J Biomed Sci

2023

;

:51.

17.

Doench

Hartenian

Graham

et al.

Rational design of highly active sgRNAs for CRISPR-Cas9–mediated gene inactivation

Nat Biotechnol

2014

;

1262

–

18.

Zheng

Zhang

Martin

et al.

Plant genome editing database (PGED): a call for submission of information about genome-edited plant mutants

Mol Plant

2019

;

127

–

19.

Wei

Allot

Lai

et al.

PubTator 3.0: an AI-powered literature resource for unlocking biomedical knowledge | Nucleic Acids Research |

Oxford Academic

. https://academic.oup.com/nar/article/52/W1/W540/7640526. (10 February 2025, date last accessed)

20.

Pafilis

Bērziņš

Jensen

EXTRACT 2.0: text-mining-assisted interactive annotation of biomedical named entities and ontology terms

. 111088 Preprint at (

2017

21.

Pafilis

Buttigieg

Ferrell

et al.

EXTRACT: interactive extraction of environment metadata and term suggestion for metagenomic sample annotation

Database

2016

;

2016

:baw005.

22.

Zhao

et al.

A survey of large language models

. Preprint at http://arxiv.org/abs/2303.18223 (

2024

23.

Dagdelen

Dunn

Lee

et al.

Structured information extraction from scientific text with large language models

Nat Commun

2024

;

:1418.

24.

Gupta

Mahmood

Shetty

et al.

Data extraction from polymer literature using large language models

Commun Mater

2024

;

–

Crossref

25.

Polak

Morgan

Extracting accurate materials data from research papers with conversational language models and prompt engineering

Nat Commun

2024

;

:1569.

26.

Suzuki

Bono

A systematic exploration of unexploited genes for oxidative stress in Parkinson’s disease

Npj Parkinsons Dis

2024

;

–

27.

Song

Ren

Gao

et al.

LINC01588 regulates WWP2-mediated cardiomyocyte injury by interacting with HNRNPL

Environ Toxicol

2022

;

1629

–

28.

Jiao

Bai

et al.

Identification and functional analysis of the regulatory elements in the pHSPA6 promoter

Genes

2022

;

:189.

29.

Stoeger

Gerlach

Morimoto

et al.

Large-scale investigation of the reasons why potentially important genes are ignored

PLoS Biol

2018

;

:e2006643.

30.

Kustatscher

Collins

Gingras

A-C

et al.

Understudied proteins: opportunities and challenges for functional proteomics

Nat Methods

2022

;

774

–

31.

Rocha

Jayaram

Stevens

et al.

Functional unknomics: systematic screening of conserved genes of unknown function

PLoS Biol

2023

;

:e3002222.

32.

Rappsilber

A dive into the unknome

Trends Genet

2024

;

–

33.

Richardson

Tejedor Navarro

Amaral

LAN

et al.

Meta-research: understudied genes are lost in a leaky pipeline between genome-wide assays and reporting of results

eLife

2024

;

:RP93429.