Abstract

MicroRNAs (miRNAs) play important roles in post-transcriptional processes and regulate major cellular functions. The abnormal regulation of expression of miRNAs has been linked to numerous human diseases such as respiratory diseases, cancer, and neurodegenerative diseases. Latest miRNA–disease associations are predominantly found in unstructured biomedical literature. Retrieving these associations manually can be cumbersome and time-consuming due to the continuously expanding number of publications. We propose a deep learning-based text mining approach that extracts normalized miRNA–disease associations from biomedical literature. To train the deep learning models, we build a new training corpus that is extended by distant supervision utilizing multiple external databases. A quantitative evaluation shows that the workflow achieves an area under receiver operator characteristic curve of 98% on a holdout test set for the detection of miRNA–disease associations. We demonstrate the applicability of the approach by extracting new miRNA–disease associations from biomedical literature (PubMed and PubMed Central). We have shown through quantitative analysis and evaluation on three different neurodegenerative diseases that our approach can effectively extract miRNA–disease associations not yet available in public databases.

Database URL: https://zenodo.org/records/10523046

Introduction

Short RNA molecules such as microRNAs (miRNAs) that bind to target messengerRNAs (mRNAs) play important roles in post-transcriptional processes and regulate major cellular functions [1]. Deregulation of expression of miRNAs, which impacts the gene expression patterns and disrupts cellular processes, has been associated with several human diseases such as respiratory diseases [2–4], cancer [1], and Alzheimer’s disease (AD) [5–7]. Targeting disease-associated mRNAs through selected miRNAs makes these molecules interesting candidates for therapy, which is even more of significance with further clinical advancements in miRNA delivering technologies [1]. However, this requires a thorough knowledge of the involvement of specific miRNAs in normal biological processes and in diseases, which is obtained through in vivo and in vitro experiments and published in research literature.

Extraction of such miRNA–disease associations from the literature can be performed through text mining techniques. In the past, Bagewadi et al. [8] proposed the extraction of miRNA, species, genes/proteins, and disease annotations and their relations by creating new corpora and utilizing rule-based methods (such as regular expressions) and machine learning methods (such as support vector machines). They reached an F1-score of up to 76% for miRNA relations. In addition, Li et al. [9] created a rule-based text mining system called miRTex that focused on extracting just miRNA–gene and gene–miRNA regulation relations from scientific literature. Their final system achieved an F1-score of 88% on a test set of 150 PubMed abstracts; however, the recall (81%) was significantly lower than precision (96%), which is a common characteristic of a rule-based system. Gupta et al. [10] proposed the miRiaD text mining tool, which reached an F1-score of 89.4% on a set of 200 sentences, to extract miRNA–disease relations from the entire Medline identifying 8301 abstracts containing such relations. The tool BeFree, proposed by Bravo et al. [11], that exploits morphosyntactic information of the text reached an F1-score of 85% for the extraction of gene–disease associations that also includes miRNAs. The results of BeFree are integrated in the DisGeNET database, a platform for disease genomics [12].

In the meantime, transformer-based general language models such as Bidirectional Encoder Representations from Transformes (BERT) [13] or Generative Pre-trained Transformer (GPT) [14] have revolutionized the field of natural language processing (NLP), as they can effectively represent long-term interactions in text using the built-in attention mechanism [15]. These models are pretrained on large text corpora to model the English language. Furthermore, various biomedical domain-specific models such as BioBERT [16], BioMegatron [17], and ClinicalBERT [18] have been created by additional pretraining on PubMed abstracts, PMC full-text documents, and clinical notes. These biological language models have been proposed for various biomedical NLP (bioNLP) tasks, such as named entity recognition (NER), relation extraction (RE), and document classification. In the past, the bioNLP research has mostly focused on extracting protein–protein interactions [19], drug–drug interactions [20], adverse effects detection [21], clinical entity extraction [22, 23], molecular event extraction [24], and more [25, 26].

In this paper, we introduce a deep learning-based text mining workflow that extracts miRNA–disease associations from the literature. The text mining workflow defines three different tasks: (I) detection of miRNA and disease entities (NER), (II) linking of miRNA and disease entities to specific database identifiers [entity linking (EL)], and (III) detection of their associations (RE). We also create a new training dataset containing miRNA–disease associations using distant learning from multiple databases, which is used to train the relation extraction model. After evaluating the promising prediction performance of our workflow, we use it to extract miRNA–disease associations from PubMed between 2020 and 2023. We further discuss the predicted associations in the context of three diseases of interest. For re-usage, we publish the new corpus, the predicted associations, and the source code of our workflow.

Materials and methods

First, we describe all datasets that are required for the three tasks NER, EL, and RE. Next, the training, evaluation, and application of machine learning modeling approach are described in detail

Datasets

Collection of miRNA and disease entity recognition datasets

We used the openly available National Center for Biotechnology Information (NCBI) Disease published by NCBI [27] and BioCreative V Chemical Disease Relation (BC5CDR Disease) [28] corpora that both contain disease mention annotations. These annotations also include entity links to concept identifiers from the Medical Subject Headings (MeSH) database, whereas miRNA mentions are included in miRNA [8] and miRTex [9] corpora. For all datasets, we used the so-called Beginning-Inside-Outside-standoff format [29] for labeling the datasets, where ‘O’ is assigned to every token that does not represent an entity, ‘B’ corresponds to the first token of an entity, and ‘I’ is assigned to following tokens of an entity.

Building a corpus of miRNA–disease relations using distant supervision

Distant or weak supervision aims to create a training dataset (or corpus) by extracting instances from a single or multiple existing knowledge bases, in order to reduce the amount of manual curation effort [30]. To create a suitable training corpus containing miRNA–disease relations, we used two different databases, namely Human microRNA Disease Database 3 [31, 32] and miR2Disease [33]. We first applied rule-based approaches to extract miRNA and disease entities from PubMed abstracts using MiRNADetector [8] and JProMiner, a re-engineered NER algorithm based on the ProMiner software developed by Hanisch et al. [34]. In a postprocessing step, we filtered out sentences with no miRNA or disease annotations. Furthermore, sentences containing multiple miRNA or disease annotations were manually curated. We further extended our corpus with miRNA–disease relations published by [8]. An overview of all datasets used for training can be seen in Table 1, including some descriptive statistics on the number of mentions and relations for each individual dataset.

Table 1.

Overview of training and test dataset including number of sentences, mentions, and relations in each dataset

TrainingTest
NER classDataset nameSentences (%)Mentions (%)Sentences (%)Mentions (%)
DiseaseNCBI Disease [27]6224 (87)5920 (86)907 (13)960 (14)
DiseaseBC5CDR Disease [28]9278 (65)8427 (66)4950 (35)4424 (34)
miRNAmiRNA corpus [8]1864 (70)528 (58)780 (30)375 (42)
miRNAmiRTex corpus [9]2063 (57)1540 (56)1556 (43)1217 (44)
Training relationsTest relations
RE classDataset namePositive (%)Negative (%)Positive (%)Negative (%)
gene–diseaseGAD corpus [11]2520 (90)2276 (90)281 (10)253 (10)
gene–diseaseEU-ADR [53]235 (90)83 (90)27 (10)10 (10)
miRNA–diseaseSCAI-MDC (ours)1468 (76)1032 (78)460 (24)290 (22)
TrainingTest
NER classDataset nameSentences (%)Mentions (%)Sentences (%)Mentions (%)
DiseaseNCBI Disease [27]6224 (87)5920 (86)907 (13)960 (14)
DiseaseBC5CDR Disease [28]9278 (65)8427 (66)4950 (35)4424 (34)
miRNAmiRNA corpus [8]1864 (70)528 (58)780 (30)375 (42)
miRNAmiRTex corpus [9]2063 (57)1540 (56)1556 (43)1217 (44)
Training relationsTest relations
RE classDataset namePositive (%)Negative (%)Positive (%)Negative (%)
gene–diseaseGAD corpus [11]2520 (90)2276 (90)281 (10)253 (10)
gene–diseaseEU-ADR [53]235 (90)83 (90)27 (10)10 (10)
miRNA–diseaseSCAI-MDC (ours)1468 (76)1032 (78)460 (24)290 (22)

The numbers in brackets represent the proportions in the training and test sets. The proportions of all external datasets are kept as defined in the original studies. In the case of relation datasets, the number of sentences is identical to the number of relations.

Table 1.

Overview of training and test dataset including number of sentences, mentions, and relations in each dataset

TrainingTest
NER classDataset nameSentences (%)Mentions (%)Sentences (%)Mentions (%)
DiseaseNCBI Disease [27]6224 (87)5920 (86)907 (13)960 (14)
DiseaseBC5CDR Disease [28]9278 (65)8427 (66)4950 (35)4424 (34)
miRNAmiRNA corpus [8]1864 (70)528 (58)780 (30)375 (42)
miRNAmiRTex corpus [9]2063 (57)1540 (56)1556 (43)1217 (44)
Training relationsTest relations
RE classDataset namePositive (%)Negative (%)Positive (%)Negative (%)
gene–diseaseGAD corpus [11]2520 (90)2276 (90)281 (10)253 (10)
gene–diseaseEU-ADR [53]235 (90)83 (90)27 (10)10 (10)
miRNA–diseaseSCAI-MDC (ours)1468 (76)1032 (78)460 (24)290 (22)
TrainingTest
NER classDataset nameSentences (%)Mentions (%)Sentences (%)Mentions (%)
DiseaseNCBI Disease [27]6224 (87)5920 (86)907 (13)960 (14)
DiseaseBC5CDR Disease [28]9278 (65)8427 (66)4950 (35)4424 (34)
miRNAmiRNA corpus [8]1864 (70)528 (58)780 (30)375 (42)
miRNAmiRTex corpus [9]2063 (57)1540 (56)1556 (43)1217 (44)
Training relationsTest relations
RE classDataset namePositive (%)Negative (%)Positive (%)Negative (%)
gene–diseaseGAD corpus [11]2520 (90)2276 (90)281 (10)253 (10)
gene–diseaseEU-ADR [53]235 (90)83 (90)27 (10)10 (10)
miRNA–diseaseSCAI-MDC (ours)1468 (76)1032 (78)460 (24)290 (22)

The numbers in brackets represent the proportions in the training and test sets. The proportions of all external datasets are kept as defined in the original studies. In the case of relation datasets, the number of sentences is identical to the number of relations.

Training and application of the miRNA–disease detection pipeline

General workflow

The miRNA–disease association detection workflow consists of two pipelines, which are illustrated in Fig. 1. The training and evaluation pipeline is used to train models that are able to detect miRNA and disease entities and their underlying associations between them. The inferencing pipeline is used to apply the trained models to detect miRNA–disease associations from huge text collections.

Training, evaluation, and inferencing pipelines for extraction of miRNA and disease entities (NER) and their associations (RE).
Figure 1.

Training, evaluation, and inferencing pipelines for extraction of miRNA and disease entities (NER) and their associations (RE).

In the training and evaluation pipeline (Fig. 1), the first step consists of reading and preprocessing the NER and RE corpora. In the next step, we split the whole corpus in various training, validation, and test sets. For NER, we performed tokenization of sentences and prepared the entities and resulting tokens for IOB-tagging. For RE, we also tokenized the sentences and masked the miRNA and disease entities with predefined tokens for further processing. As each model has its own specific wordpiece tokenization scheme, we utilized the model-specific tokenizer that converts the instances into fixed-sized vectors. In the next stage, these instances are used to fine-tune and optimize the pretrained models for both NER and RE tasks. A model evaluation and selection reveals the best models that can be used for prediction. The inferencing pipeline (Fig. 1) is designed to predict associations from text. First, documents from databases [PubMed and PubMed Central (PMC)] are prepared for inferencing. Subsequently, the models for NER and RE are applied to detect miRNA entities, disease entities, and their associations. In a normalization step, the miRNA and disease entities are normalized to the specific database concepts, namely to Mirbase and MeSH identifiers.

Fine-tuning of BERT-based models

We used the BioBERT [16] and BioMegatron [17] models for our experiments. Both are based on the BERT model published by Google [13], which is trained in a self-supervised manner on huge amounts of text that were obtained from OpenBooks, Wikipedia, etc. BioBERT and BioMegatron used the pretrained BERT model and its wordpiece tokenizer. Both were trained further using both PubMed and PMC articles to obtain a domain-specific model for biomedicine. BERT, BioBERT, and BioMegatron are so-called general purpose language models that can be used for various text mining tasks such as NER, RE, document classification, or question answering. To use them for these tasks, they need to be further fine-tuned in a supervised manner on datasets that are specific to the underlying tasks.

For RE, we experimented with two different training modes, namely single-task mode (STM) and multi-task mode (MTM). In the STM, the models were fine-tuned on a single dataset, whereas in MTM, related datasets were used for fine-tuning the various classification heads of the BioBERT model. In MTM, we apply the paradigm of multi-task learning, where a single model is trained to accomplish multiple closely-related tasks simultaneously by using a shared representation [35]. Previous studies have shown that multi-task learning can be beneficial as it improves the generalization by focusing on the commonalities of the tasks and learning relevant features contained in training data of different tasks [35]. The architecture of the final model that is used for fine-tuning BioBERT and BioMegatron is depicted in Fig. 2. We also experimented with different variants for the classification head (such as multiple linear layers, bottleneck architecture). However, the experiments revealed that a simple linear layer works best in all cases. Therefore, our final model contains a single linear layer on top of the pre-trained BioBERT and BioMegatron models.

A general architecture of the model for task-specific fine-tuning of domain-specific language models (such as BioBERT and BioMegatron). The STM contains just one head. MTM contains additional heads for each auxiliary task or corpus.
Figure 2.

A general architecture of the model for task-specific fine-tuning of domain-specific language models (such as BioBERT and BioMegatron). The STM contains just one head. MTM contains additional heads for each auxiliary task or corpus.

Linking of miRNA and disease entities

We implemented a rule-based system to link miRNA entities to miRBase identifiers. miRBase [36] is a database that includes published miRNA sequences and annotations, and furthermore, it provides a registry with unique names for miRNAs. To link the recognized disease entities to MeSH identifiers, we used the software NormCo [37].

Evaluation of NER and RE models

For NER, we used precision, recall, and F1-measure to determine the performance of the models. For RE, which is defined as a binary classification, we used the area under receiver operator characteristic curve (AUROC) and precision–recall curve (AUPR) to evaluate performance. We also provide a confusion matrix report for tasks where it is appropriate, which includes true positive, false positive, false negative, and true negative cases.

In an initial stage, we prepared training and test splits for each dataset. It is important to note that the proportions of the splits are kept as defined in their original studies. Furthermore, we applied five-fold cross-validation to choose the best models. For each iteration, we created a stratified split of the training dataset into training (n − 1 folds) and validation (1 fold) datasets. We then trained on the training dataset and evaluated (and optimized the hyperparameters) on the validation dataset for n iterations. The results of n evaluations are aggregated and the standard deviation is reported. The final evaluation of the best models was performed on the withheld independent test set.

Hyperparameter optimization

We performed a Bayesian hyperparameter optimization [38] using the Optuna [39] framework for all our models with the appropriate training data. We assessed the intermediate and final performances of each experimental trial using the F1-measure (NER) and AUROC (RE). The results were captured in an SQL database for later analyses, such as identifying the best experimental trials. The captured trial data were also used by the Optuna pruner to identify and halt unpromising trials already at an early stage.

Comparison of predicted associations with DisGeNET

In a consecutive analysis, we compare our predicted associations with data from DisGeNET, where we focus on three different diseases, namely epilepsy, AD, and Parkinson’s disease (PD). To compare the associations, it was necessary to retrieve MeSH and UMLS concept identifiers for the disease terms as our workflow normalizes to MeSH and DisGeNet include UMLS identifiers. To retrieve the MeSH and UMLS classes for these diseases, we first gathered all subclasses of the disease from the MONDO ontology [40] and then retrieved their MeSH and UMLS associated identifiers. Both tasks were performed using the OLS4 API (https://www.ebi.ac.uk/ols4). After gathering the associations, we filtered them using the disease identifiers.

Results

Detection of miRNA and disease entities

To detect miRNA and disease mentions, we used the pretrained BioBERT and BioMegatron models and fine-tuned them on various datasets. To identify the best possible model variant based on the training data, we employed a 5-fold cross-validation during training. Based on the performance assessed through cross-validation, we used the optimized hyperparameters to train the final model on the whole training dataset. The generalization performance of the final models was assessed on the held-out test set. Table 2 presents the classification scores for each dataset in the specific test set.

Table 2.

Evaluation results of NER task models trained and tested on various datasets.

BioBERTBioMegatron
Entity classDatasetPrec.RecallF1Prec.RecallF1
DiseaseNCBI Disease84.6290.0987.2788.2291.2589.71
BC5CDR82.0785.3983.7085.4987.7586.60
NCBI Disease  +  BC5CDR86.2687.8387.04
miRNAmiRNA91.3298.1394.6091.75978794.71
miRTex93.9395.7994.8596.5997.6297.10
miRNA  +  miRTex94.5196.2395.36
BioBERTBioMegatron
Entity classDatasetPrec.RecallF1Prec.RecallF1
DiseaseNCBI Disease84.6290.0987.2788.2291.2589.71
BC5CDR82.0785.3983.7085.4987.7586.60
NCBI Disease  +  BC5CDR86.2687.8387.04
miRNAmiRNA91.3298.1394.6091.75978794.71
miRTex93.9395.7994.8596.5997.6297.10
miRNA  +  miRTex94.5196.2395.36

The confusion matrix of the BioMegatron model is included in Supplementary Table S7. – indicates data are not available. Bold entries represent the best results.

Table 2.

Evaluation results of NER task models trained and tested on various datasets.

BioBERTBioMegatron
Entity classDatasetPrec.RecallF1Prec.RecallF1
DiseaseNCBI Disease84.6290.0987.2788.2291.2589.71
BC5CDR82.0785.3983.7085.4987.7586.60
NCBI Disease  +  BC5CDR86.2687.8387.04
miRNAmiRNA91.3298.1394.6091.75978794.71
miRTex93.9395.7994.8596.5997.6297.10
miRNA  +  miRTex94.5196.2395.36
BioBERTBioMegatron
Entity classDatasetPrec.RecallF1Prec.RecallF1
DiseaseNCBI Disease84.6290.0987.2788.2291.2589.71
BC5CDR82.0785.3983.7085.4987.7586.60
NCBI Disease  +  BC5CDR86.2687.8387.04
miRNAmiRNA91.3298.1394.6091.75978794.71
miRTex93.9395.7994.8596.5997.6297.10
miRNA  +  miRTex94.5196.2395.36

The confusion matrix of the BioMegatron model is included in Supplementary Table S7. – indicates data are not available. Bold entries represent the best results.

For the NCBI dataset, we achieved the highest performance with an F1-score of 89.71%, precision of 88.22%, and recall of 91.25%. For the BC5CDR dataset, the best F1-score was 86.60% with precision of 85.49% and recall of 87.75%. We also trained a model with the combination of both datasets, where a micro F1-score of 87.04% was reached on the combined test set. Overall, BioMegatron performed better than BioBERT, which is probably due to the large parameter size of the BioMegatron model.

In the case of miRNA entity detection, the best F1-measure for the miRNA dataset was 94.70%, precision was 91.75%, and recall was 97.86%, and the best performance for the miRTex dataset was achieved with an F1-score of 97.10% and a precision of 96.59%. The training on the combined dataset reached a micro F1-score of 95.36%. Similar to the disease category, the BioMegtron model performed significantly better than BioBERT, while the BioBERT model delivered the best recall of 98.13% on the miRNA dataset. The confusion matrices of the best NER models are provided in Supplementary Table S7. The optimized hyperparameters of the best NER models are included in Supplementary Table S1–S4 and S6.

We also experimented with MTM; however, the results were not significantly better in comparison to the STM. Although the NER datasets and tasks share with each other certain similarities, the significant differences in the annotation guidelines and their varying levels of complexity of the mentions likely reduced the effectiveness of the multi-task approach. The shared representations in the model might have led to negative transfer, showing a drop in the model performance. Similar observations have also been made by Crichton et al. [41]. Hence, for further analysis we focused only on STM.

It is important to note that the BC5CDR corpus is a sub-corpus of the CTD-Pfizer corpus [42]. The creators of the corpus aimed to investigate the potential involvement of pharmaceutical drugs in cardiovascular, neurological, renal and hepatic toxicity. Therefore, the BC5CDR corpus is focused on drugs and their role in toxicity [42]. In contrast, the NCBI disease corpus is intended to represent the entire PubMed. In an analysis of both corpora performed by Kühnel and Fluck [43], they revealed that the BC5CDR corpus contains more complex contexts, including abbreviations from diseases but also mentions several gene names, such as BRCA1, resembling the structure of an abbreviation. This could explain why the model performances for the NCBI Disease dataset are slightly better.

Detection of miRNA–disease associations

We trained an association detection model using the BioMegatron model on our own training dataset (80% of the whole dataset). As BioMegatron, in comparison to BioBERT, delivered the best results in almost all cases, we only focused on experimenting with the BioMegatron model further. The model selection was based on five-fold cross-validation. After choosing the best hyperparameters, we evaluated the final model performance using measures, such as AUROC and AUPR on an independent test set (20% of the whole dataset). Table 3 illustrates the evaluation performances. We reached a high rate of 9758% AUROC and 9755% AUPR with the STM. Even higher scores are reached with the MTMs, amounting to 98.02% and 98.66% for AUROC and AUPR, respectively. The receiver operator characteristic and precision–recall curves of the best model are depicted in Supplementary Figure S1. The optimized hyperparameters of the best RE model are included in Supplementary Tables S5 and S6.

Table 3.

Evaluation results of RE task on test dataset for STM and MTM training modes based on BioMegatron model.

DatasetsModeAUROC (in %)AUPR (in %)
miRNA–diseaseSTM97.5897.55
MTM98.0298.66
DatasetsModeAUROC (in %)AUPR (in %)
miRNA–diseaseSTM97.5897.55
MTM98.0298.66

Bold entries represent the best results.

Table 3.

Evaluation results of RE task on test dataset for STM and MTM training modes based on BioMegatron model.

DatasetsModeAUROC (in %)AUPR (in %)
miRNA–diseaseSTM97.5897.55
MTM98.0298.66
DatasetsModeAUROC (in %)AUPR (in %)
miRNA–diseaseSTM97.5897.55
MTM98.0298.66

Bold entries represent the best results.

Prediction of miRNA–disease associations from PubMed

We applied our miRNA–disease association extraction workflow on around 6.1 million PubMed abstracts and 1.98 million PMC full-text documents published between 2020 and 2023. Overall, the workflow predicted 727 009 (unique: 75 887) normalized positive associations found in 69 816 PMC and PubMed documents. These associations include 2730 disease and 2427 miRNA concepts. Overall, 374029 positive associations (unique: 52624; found in 59187 PMC documents) of them have a high confidence score of 90% (retrieved through a sigmoid function). Figure 3 provides an overview of the total predicted miRNA–disease associations for PubMed abstracts, PMC full-text documents, and both corpora combined.

An overview of the total predicted unique associations between miRNA and diseases in comparison to the DisGeNET database. The three subfigures represent the results extracted between 2020 and 2023 from PubMed abstracts, PMC full-text documents, and both combined. Furthermore, it provides an overview over the miRNA–disease associations of three diseases (epilepsy, AD, and PD).
Figure 3.

An overview of the total predicted unique associations between miRNA and diseases in comparison to the DisGeNET database. The three subfigures represent the results extracted between 2020 and 2023 from PubMed abstracts, PMC full-text documents, and both combined. Furthermore, it provides an overview over the miRNA–disease associations of three diseases (epilepsy, AD, and PD).

In a subsequent analysis, we filtered for associations of three different diseases, namely epilepsy, AD, and PD (see Fig. 3). For epilepsy, AD, and PD, the workflow detected 2226 (unique: 211), 6306 (unique: 438), and 3159 (unique: 287) miRNA–disease associations, respectively. In a first step, we compared the extracted miRNA associations with those in the existing database DisGeNET, which contains curated miRNA–disease associations extracted from different resources before 2020. In all cases, we could significantly increase the number of miRNA–disease associations and found a high number of new relations not contained in DisGeNET (see Fig. 3). Since we focus on new findings, only a small number of the relations overlap with the relations contained in DisGeNet and others are only available in DisGeNet.

We also performed an analysis of the missed DisGeNet associations for the year 2020 of the three diseases. In total, DisGeNet contains four unique miRNA–disease associations for epilepsy, eight for AD, and eight for PD from publications published in the year 2020. Only two associations (one for AD and one for PD) were missed by our workflow. In these cases, the workflow predicted wrong association labels (no association). All the other missed associations were from publications published before 2020, which we have not included in our workflow. To expand this analysis, we randomly analyzed additional unique associations that were missed by our pipeline. In some cases, the association was detected, however, with a lower score (<0.9). In other cases, the disease and miRNA normalizer were not able to properly normalize the mentions. We provide some examples of these cases in the Supplementary Section ‘Examples of Workflow Issues’.

Evaluation of newly detected miRNA–disease associations

For all three diseases, AD, PD, and epilepsy, we randomly choose associations from the predicted results of the PubMed corpora that had a high score (>0.9). Examples of extracted associations with their corresponding sentences are shown in Table 4. For AD, our workflow detected three miRNA–disease associations from a study by Kumar et al. [5]. In this study, by analyzing postmortem brains of AD and control samples using a miRNAs microarray platform, the authors have addressed the question of whether synaptosomal miRNAs affect AD synapse activity. They found that three specific miRNAs are potentially associated with AD Braak stages. In the case of PD, our workflow detected two miRNA–disease associations from the study published by Chen et al. [44]. The authors investigated blood circulating miRNAs that are proposed to be promising biomarkers for neurodegenerative diseases such as PD. They analyzed the plasma of PD patients, multiple system atrophy patients, and healthy controls. Our workflow detected two associations from the study [45], where the authors studied the role of let-7b miRNAs in temporal lobe epilepsy (TLE). They found a novel noncoding RNA-mediated mechanism involving the miRNA let-7b and H19 [a long noncoding RNA (lncRNA)] in seizure-induced glial cell activation.

Table 4.

Examples of predicted miRNA–disease associations for AD, PD, and epilepsy with their corresponding sentences.

DiseasemiRNASentencePMID
ADhsa-miR-501-3p (MIRBASE:MIMAT0004774)The miR-501-3p, miR-502-3p, and miR-877-5p were identified as potential synaptosomal miRNAs upregulated with disease progression based on AD Braak stages.36454178
ADhsa-miR-502-3p (MIRBASE:MIMAT0004775)The miR-501-3p, miR-502-3p, and miR-877-5p were identified as potential synaptosomal miRNAs upregulated with disease progression based on AD Braak stages.36454178
ADhsa-miR-877-3p (MIRBASE:MIMAT0004949)The miR-501-3p, miR-502-3p, and miR-877-5p were identified as potential synaptosomal miRNAs upregulated with disease progression based on AD Braak stages.36454178
PDhsa-miR-133b (MIRBASE:MIMAT0000770)Elevated miR-133b and miR-221-3p distinguished PD from controls with 84.8% sensitivity and 88.9% specificity.34315950
PDhsa-miR-221-3p (MIRBASE:MIMAT0000278)Elevated miR-133b and miR-221-3p distinguished PD from controls with 84.8% sensitivity and 88.9% specificity.34315950
Epilepsyhsa-let-7b-5p (MIRBASE:MIMAT0000063)Overexpression of let-7b inhibited hippocampal glial cell activation, inflammatory response and epileptic seizures by targeting Stat3.32648622
Epilepsyhsa-let-7b-5p (MIRBASE:MIMAT0000063)LncRNA H19 could competitively bind to let-7b to promote hippocampal glial cell activation and epileptic seizures by targeting Stat3 in a rat model of TLE.32648622
DiseasemiRNASentencePMID
ADhsa-miR-501-3p (MIRBASE:MIMAT0004774)The miR-501-3p, miR-502-3p, and miR-877-5p were identified as potential synaptosomal miRNAs upregulated with disease progression based on AD Braak stages.36454178
ADhsa-miR-502-3p (MIRBASE:MIMAT0004775)The miR-501-3p, miR-502-3p, and miR-877-5p were identified as potential synaptosomal miRNAs upregulated with disease progression based on AD Braak stages.36454178
ADhsa-miR-877-3p (MIRBASE:MIMAT0004949)The miR-501-3p, miR-502-3p, and miR-877-5p were identified as potential synaptosomal miRNAs upregulated with disease progression based on AD Braak stages.36454178
PDhsa-miR-133b (MIRBASE:MIMAT0000770)Elevated miR-133b and miR-221-3p distinguished PD from controls with 84.8% sensitivity and 88.9% specificity.34315950
PDhsa-miR-221-3p (MIRBASE:MIMAT0000278)Elevated miR-133b and miR-221-3p distinguished PD from controls with 84.8% sensitivity and 88.9% specificity.34315950
Epilepsyhsa-let-7b-5p (MIRBASE:MIMAT0000063)Overexpression of let-7b inhibited hippocampal glial cell activation, inflammatory response and epileptic seizures by targeting Stat3.32648622
Epilepsyhsa-let-7b-5p (MIRBASE:MIMAT0000063)LncRNA H19 could competitively bind to let-7b to promote hippocampal glial cell activation and epileptic seizures by targeting Stat3 in a rat model of TLE.32648622

The normalized miRNA names mentioned in the second column corresponds to the bold miRNA names in the Sentence column.

Table 4.

Examples of predicted miRNA–disease associations for AD, PD, and epilepsy with their corresponding sentences.

DiseasemiRNASentencePMID
ADhsa-miR-501-3p (MIRBASE:MIMAT0004774)The miR-501-3p, miR-502-3p, and miR-877-5p were identified as potential synaptosomal miRNAs upregulated with disease progression based on AD Braak stages.36454178
ADhsa-miR-502-3p (MIRBASE:MIMAT0004775)The miR-501-3p, miR-502-3p, and miR-877-5p were identified as potential synaptosomal miRNAs upregulated with disease progression based on AD Braak stages.36454178
ADhsa-miR-877-3p (MIRBASE:MIMAT0004949)The miR-501-3p, miR-502-3p, and miR-877-5p were identified as potential synaptosomal miRNAs upregulated with disease progression based on AD Braak stages.36454178
PDhsa-miR-133b (MIRBASE:MIMAT0000770)Elevated miR-133b and miR-221-3p distinguished PD from controls with 84.8% sensitivity and 88.9% specificity.34315950
PDhsa-miR-221-3p (MIRBASE:MIMAT0000278)Elevated miR-133b and miR-221-3p distinguished PD from controls with 84.8% sensitivity and 88.9% specificity.34315950
Epilepsyhsa-let-7b-5p (MIRBASE:MIMAT0000063)Overexpression of let-7b inhibited hippocampal glial cell activation, inflammatory response and epileptic seizures by targeting Stat3.32648622
Epilepsyhsa-let-7b-5p (MIRBASE:MIMAT0000063)LncRNA H19 could competitively bind to let-7b to promote hippocampal glial cell activation and epileptic seizures by targeting Stat3 in a rat model of TLE.32648622
DiseasemiRNASentencePMID
ADhsa-miR-501-3p (MIRBASE:MIMAT0004774)The miR-501-3p, miR-502-3p, and miR-877-5p were identified as potential synaptosomal miRNAs upregulated with disease progression based on AD Braak stages.36454178
ADhsa-miR-502-3p (MIRBASE:MIMAT0004775)The miR-501-3p, miR-502-3p, and miR-877-5p were identified as potential synaptosomal miRNAs upregulated with disease progression based on AD Braak stages.36454178
ADhsa-miR-877-3p (MIRBASE:MIMAT0004949)The miR-501-3p, miR-502-3p, and miR-877-5p were identified as potential synaptosomal miRNAs upregulated with disease progression based on AD Braak stages.36454178
PDhsa-miR-133b (MIRBASE:MIMAT0000770)Elevated miR-133b and miR-221-3p distinguished PD from controls with 84.8% sensitivity and 88.9% specificity.34315950
PDhsa-miR-221-3p (MIRBASE:MIMAT0000278)Elevated miR-133b and miR-221-3p distinguished PD from controls with 84.8% sensitivity and 88.9% specificity.34315950
Epilepsyhsa-let-7b-5p (MIRBASE:MIMAT0000063)Overexpression of let-7b inhibited hippocampal glial cell activation, inflammatory response and epileptic seizures by targeting Stat3.32648622
Epilepsyhsa-let-7b-5p (MIRBASE:MIMAT0000063)LncRNA H19 could competitively bind to let-7b to promote hippocampal glial cell activation and epileptic seizures by targeting Stat3 in a rat model of TLE.32648622

The normalized miRNA names mentioned in the second column corresponds to the bold miRNA names in the Sentence column.

For a systematic analysis of the newly found associations, we analyzed the precision and recall for the PD–miRNA associations. To check the overall precision of the newly predicted associations, 30 sentences were examined. This analysis showed that only two extractions were incorrect. In the sentence ‘PD was associated with postoperative expression of GFAP; ePOCD was associated with postoperative expression of microRNA-21-5p and GFAP as well as intraoperative expression of NSP’ (PMID:34 300 256) [46], the abbreviation ‘PD’ means postoperative delirium and hence the association with the disease PD is incorrect. The second error occurred in an extraction from the text fragment ‘[…] Mitochondrial complex I deficiency and functional abnormalities are implicated in the development of PD. MicroRNA-29a […]’ (PMID: 36 174 668) [47] that consists of more than one sentence and could not be verified as the correct source for the extracted association, although the relation was mentioned later in the abstract. In summary, this analysis shows a precision of over 93%.

In order to analyze the recall, we utilized a systematic review of PD–miRNA associations published by [48]. All referenced associations from publications in 2020 to 2023 were compared with our automatically extracted set. Out of 23 associations, a total of 15 associations were extracted from the same abstract also referenced by the review, but 8 could not be found in the abstracts. These eight associations were reviewed further. Two associations (from Wu 2020 [49]) could not appear in our result set as the corresponding publication journal ‘Acta Medica Mediterranea’ is not part of the Medline. Furthermore, in the publication by Ravanidis et al. [50], the two associations were not mentioned in the abstract, but only in the full-text. Finally, in the publication by Cressati et al. [51] , miR-153 and miR-223 were mentioned in the abstract and these associations were correctly recognized, but in the review, they are listed as miR-153-3p and miR-223-5p. Only two associations were not found by the automated extraction system although they were mentioned in the abstract. These were missed because the corresponding miRNAs were not recognized. In summary, this analysis shows that 19 associations were described in the Medline articles, of which our system recognized 17 associations. This corresponds to a recall of 89%.

This evaluation shows that even after the sequential execution of automated NER, entity linkage, and association recognition, which have their own error rate that adds up in the overall result, the performance of the automated extraction system is remarkable and therefore very well suited to support systematic reviews such as that published for PD by Guévremont et al. [48].

Discussion

In this work, we presented a workflow for automatically extracting miRNA–disease associations from vast unstructured literature. The workflow is based on a large language model fine-tuned on a new corpus generated using a distant supervision technique. Due to the pretraining of large language model (for e.g. BioMegatron) on a huge corpora and the integrated self-attention mechanism, the model can exploit semantic and syntactic aspects of sentences and incorporates local contextual features of the included entities to extract relations with high accuracy. We used the workflow to extract miRNA–disease associations from Medline abstracts and analyzed the extracted set for AD, PD, and epilepsy. Compared to the existing curated database DisGeNet, where miRNA–disease associations were provided until 2020, we extracted a high number of new associations from Medline abstracts for the years 2020–23. An independent evaluation of the newly extracted PD–miRNA associations showed that we achieved high precision and high recall with this extraction workflow.

A current limitation of the corpus, and thus of the extraction workflow, is that the associations are encoded and recognized at sentence level. As authors may describe miRNA–disease relations beyond sentences in their publications, the workflow may miss these relations. Nevertheless, the evaluation showed that at least for PD, the PD–miRNA relations are usually expressed in the same sentence in abstracts. We missed associations due to false negative disease and miRNA recognition or because the relations were only expressed in the full-text tables. Strategies such as active learning might help to significantly reduce the curation effort to extend the corpus required for training a model that can perform extraction from tables included in full-text documents.

Although the large language models that are specifically designed for the biomedical domain produced great results in our work, incorporating the extensive prior knowledge on miRNAs and diseases directly in large language models might help to improve the results even further. Studies have shown that the process of knowledge fusion is able to overcome the limitations of individual sources by focusing on diverse knowledge. Information such miRNA sequences, disease embeddings obtained from ontologies (such as Disease ontology, MONDO) can be merged with the embeddings from large language models. Also, embeddings obtained through training of graph neural networks on sources such as DisGeNet can be further employed to improve the models. In addition, it might be interesting to combine the literature-based models with new prediction models learning feature embeddings for miRNAs and diseases through graph machine learning [52].

In summary, by automatically generating a training corpus using distance learning methods and training a model based on a state-of-the-art large language model, we have demonstrated the promising performance of our trained workflow. Our evaluation results based on PD-miRNA associations strongly suggest that our workflow can provide useful support for extracting miRNA–disease relations.

Conclusion

In this work, we proposed a well-performing large language model approach for the identification of miRNA–disease relations from biomedical literature. The approach consists of modules that can perform the detection of miRNA and disease mentions, as well as the identification of their relationship. In order to extend the miRNA–disease training corpora, we applied the distant supervision technique using multiple publicly available databases. In our experiments with multiple state-of-the-art large language models, BioMegatron performed the best for the extraction of miRNA–disease associations. A high number of new associations could be identified with a high level of precision of recall and precision, when applying the whole machinery to infer associations from biomedical literature between 2020 and 2023.

The creation and use of dedicated databases that can contain many types of relations is considered best practice in biomedical research and up-to-date information is in high demand. However, to keep these databases up to date with the current scientific advancements is a major challenge. The solution is often to establish collaborations with researchers and institutions to provide regular updates. However, this requires a huge amount of human effort. This creates a demand for automated data mining techniques that should always be employed to extract relevant information from scientific literature and update the databases accordingly. With the three different case studies on neurodegenerative diseases such as AD, for which we identified and discussed novel relations that are yet missing in databases such as DisGeNet, we demonstrated the applicability and feasibility of our workflow for retrieving novel, hidden relations from literature.

Automated techniques for information extraction need to be regularly revised to keep up with the pace of development in NLP. Recent large language models such as ChatGPT, BARD, some of which are unfortunately not yet available for scientific experimentation, open up new avenues for solving challenges. Future studies are required to find out exactly how these models can be utilized to not only extract a single type of relation but also to solve many complex bioNLP challenges at once.

Acknowledgements

We thank André Gemünd for their support regarding the computational infrastructure of SCAI. We thank Jürgen Klein for their support in preprocessing the PubMed Central documents.

Author contributions

S.M. and J.F. were involved in conceptualization; S.M. were involved in methodology; S.M. and L.K. were involved in data curation, formal analysis, visualization, investigation, and validation; S.M. and J.F. were involved in supervision; S.M. and L.K. were involved in writing the original draft; S.M., L.K., H.F., J.F., and M.H. were involved in writing, review and editing.

Supplementary data

Supplementary data is available at Database online.

Conflict of interest

None declared.

Funding

This work was funded through the project Integrative Data Semantics for Neurodegenerative research (IDSN), which was supported by the German Federal Ministry of Education and Research (BMBF) as part of the program ‘i:DSem–Integrative Data Semantics in the Systems Medicine’, project number 031L0029 [A-C].

Data availability

We provide our code at https://github.com/SCAI-BIO/mirna-disease-association-detection. Our database is located at https://zenodo.org/records/10523046

References

1.

Rupaimoole
R
,
Slack
FJ
.
MicroRNA therapeutics: towards a new era for the management of cancer and other diseases
.
Nat Rev Drug Discov
2017
;
16
:
203
22
.doi: https://doi.org/10.1038/nrd.2016.246

2.

Takamizawa
J
,
Konishi
H
,
Yanagisawa
K
et al. 
Reduced expression of the let-7 MicroRNAs in human lung cancers in association with shortened postoperative survival
.
Cancer Res
2004
;
64
:
3753
56
.doi: https://doi.org/10.1158/0008-5472.CAN-04-0637

3.

Lin
CW
,
Chang
YL
,
Chang
YC
et al. 
MicroRNA-135b promotes lung cancer metastasis by regulating multiple targets in the Hippo pathway and LZTS1
.
Nat Commun
2013
;
4
:1877.doi: https://doi.org/10.1038/ncomms2876

4.

Rupani
H
,
Sanchez-Elsner
T
,
Howarth
P
.
MicroRNAs and respiratory diseases
.
Eur Respir J
2013
;
41
:
695
705
.doi: https://doi.org/10.1183/09031936.00212011

5.

Kumar
S
,
Orlov
E
,
Gowda
P
et al. 
Synaptosome microRNAs regulate synapse functions in Alzheimer’s disease
.
NPJ Genom Med
2022
;
7
:47.doi: https://doi.org/10.1038/s41525-022-00319-8

6.

Takousis
P
,
Sadlon
A
,
Schulz
J
et al. 
Differential expression of microRNAs in Alzheimer’s disease brain, blood, and cerebrospinal fluid
.
Alzheimers Dement
2019
;
15
:
1468
77
.doi: https://doi.org/10.1016/j.jalz.2019.06.4952

7.

Hébert
SS
,
Delay
C
.
MicroRNAs and Alzheimer’s disease mouse models: current insights and future research avenues
.
Int J Alzheimer’s Dis
2011
;
2011
:894938.doi: https://doi.org/10.4061/2011/894938

8.

Bagewadi
S
,
Bobić
T
,
Hofmann-Apitius
M
et al. .
Detecting miRNA mentions and relations in biomedical literature
.
F1000Res
2015
;
3
:205.doi: https://doi.org/10.12688/f1000research.4591.3

9.

Li
G
,
Ross
KE
,
Arighi
CN
et al. 
miRTex: a text mining system for miRNA-gene relation extraction
.
PLoS Comput Biol
2015
;
11
:
1
24
.doi: https://doi.org/10.1371/journal.pcbi.1004391

10.

Gupta
S
,
Ross
KE
,
Tudor
CO
et al. 
miRiaD: a text mining tool for detecting associations of microRNAs with diseases
.
J Biomed Semant
2016
;
7
:9.doi: https://doi.org/10.1186/s13326-015-0044-y

11.

Bravo
À
,
Piñero
J
,
Queralt-Rosinach
N
et al. 
Extraction of relations between genes and diseases from text and large-scale data analysis: implications for translational research
.
BMC Bioinf
2015
;
16
:55..doi: https://doi.org/10.1186/s12859-015-0472-9

12.

Piñero
J
,
Ramírez-Anguita
JM
,
Saüch-Pitarch
J
et al. 
The DisGeNET knowledge platform for disease genomics: 2019 update
.
Nucleic Acids Res
2020
;
48
:
D845
55
.doi: https://doi.org/10.1093/nar/gkz1021

13.

Devlin
J
,
Chang
MW
,
Lee
K
et al. 
BERT: pre-training of deep bidirectional transformers for language understanding
. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies,
Minneapolis, Minnesota
, pp.
4171
86
. Association for Computational Linguistics,
2019
.doi: https://doi.org/10.18653/v1/N19-1423

14.

Brown
TB
,
Mann
B
,
Ryder
N
et al. 
Language models are few-shot learners
. arXiv preprint arXiv:200514165.
2020
.

15.

Vaswani
A
,
Shazeer
N
,
Parmar
N
et al. 
Attention is all you need
. In: Proceedings of the 31st International Conference on Neural Information Processing Systems, pp.
6000
10
.
Red Hook, NY, USA
:
Curran Associates Inc.
,
2017
. NIPS’17.

16.

Lee
J
,
Yoon
W
,
Kim
S
et al. 
BioBERT: a pre-trained biomedical language representation model for biomedical text mining
.
Bioinformatics
2020
;
36
:
1234
40
.doi: https://doi.org/10.1093/bioinformatics/btz682

17.

Shin
HC
,
Zhang
Y
,
Bakhturina
E
et al. 
BioMegatron: larger biomedical domain language model
. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP),
Online
, pp.
4700
06
.
Association for Computational Linguistics
,
2020
.doi: https://doi.org/10.18653/v1/2020.emnlp-main.379

18.

Huang
K
,
Altosaar
J
,
Ranganath
R
.
ClinicalBERT: modeling clinical notes and predicting hospital readmission
. arXiv:190405342 [cs].
2019
.

19.

Jiang
Z
,
Shuang
L
, and
Huang
D
.
A general protein-protein interaction extraction architecture based on word representation and feature selection
.
Int J Data Min Bioinform
2016
;
14
:
276
91
.doi: https://doi.org/10.1504/IJDMB.2016.074878

20.

Zhu
Y
,
Li
L
,
Lu
H
et al. 
Extracting drug-drug interactions from texts with BioBERT and multiple entity-aware attentions
.
J Biomed Informat
2020
;
106
:103451.doi: https://doi.org/10.1016/j.jbi.2020.103451

21.

Gurulingappa
H
,
Klinger
R
,
Hofmann-apitius
M
et al. .
An empirical evaluation of resources for the identification of disease and adverse effects in biomedical literature
. In: The seventh international conference on Language Resources and Evaluation (LREC), 2nd Workshop on Building and Evaluating Resources for Biomedical Text Mining, Valletta, Malta,
May 2010
, pp.
15
22
,
2010
. http://www.lrec-conf.org/proceedings/lrec2010/workshops/W8.pdf.

22.

Li
J
,
Zhou
Y
,
Jiang
X
et al. 
Are synthetic clinical notes useful for real natural language processing tasks: a case study on clinical entity recognition
.
J Am Med Inf Assoc
2021
;
28
:
2193
201
.doi: https://doi.org/10.1093/jamia/ocab112

23.

Lentzen
M
,
Madan
S
,
Lage-Rupprecht
V
et al. 
Critical assessment of transformer-based AI models for German clinical notes
.
JAMIA Open
2022
;
5
:ooac087.doi: https://doi.org/10.1093/jamiaopen/ooac087

24.

Pattankar
VV
, and
Priyanga
P
.
Review on event extraction for BioNLP with a survey
. In: 2023 International Conference for Advancement in Technology (ICONAT)
Goa, India
,
24-26 January 2023
, pp.
1
5
. Goa, India: IEEE,
2023
.doi: https://doi.org/10.1109/ICONAT57137.2023.10080428

25.

Shang
Y
,
Li
Y
,
Lin
H
et al. 
Enhancing biomedical text summarization using semantic relation extraction
.
PLoS One
2011
;
6
:e23862.doi: https://doi.org/10.1371/journal.pone.0023862

26.

Bressem
KK
,
Adams
LC
,
Gaudin
RA
et al. 
Highly accurate classification of chest radiographic reports using a deep learning natural language model pre-trained on 3.8 million text reports
.
Bioinformatics
2020
;
36
:
5255
61
.doi: https://doi.org/10.1093/bioinformatics/btaa668

27.

Doǧan
RI
,
Leaman
R
,
Lu
Z
.
NCBI disease corpus: a resource for disease name recognition and concept normalization
.
J Biomed Informat
2014
;
47
:
1
10
.doi: https://doi.org/10.1016/j.jbi.2013.12.006

28.

Li
J
,
Sun
Y
,
Johnson
RJ
et al. 
BioCreative V CDR task corpus: a resource for chemical disease relation extraction
.
Database (Oxford)
2016
;
2016
:baw068.doi: https://doi.org/10.1093/database/baw068

29.

Ramshaw
LA
, and
Marcus
MP
. Text chunking using transformation-based learning. In:
Armstrong
S
,
Church
K
,
Isabelle
P
,
Manzi
S
,
Tzoukermann
E
,
Yarowsky
D
, (eds.),
Natural Language Processing Using Very Large Corpora
, 1st edn.
Dordrecht, Netherlands
:
Springer
,
1999
,
157
76
.doi: https://doi.org/10.1007/978-94-017-2390-9_10

30.

Smirnova
A
, and
Cudré-Mauroux
P
.
Relation extraction using distant supervision: a survey
.
ACM Comput Surv
2018
;
51
:106:1–106:35.doi: https://doi.org/10.1145/3241741

31.

Li
Y
,
Qiu
C
,
Tu
J
et al. 
HMDD v2.0: a database for experimentally supported human microRNA and disease associations
.
Nucleic Acids Res
2014
;
42
:
1
5
.doi: https://doi.org/10.1093/nar/gkt1023

32.

Huang
Z
,
Shi
J
,
Gao
Y
et al. 
HMDD v3.0: a database for experimentally supported human microRNA-disease associations
.
Nucleic Acids Res
2019
;
47
:
D1013
7
.doi: https://doi.org/10.1093/nar/gky1010

33.

Jiang
Q
,
Wang
Y
,
Hao
Y
et al. 
miR2Disease: a manually curated database for microRNA deregulation in human disease
.
Nucleic Acids Res
2009
;
37
:
D98
104
.doi: https://doi.org/10.1093/nar/gkn714

34.

Hanisch
D
,
Fundel
K
,
Mevissen
H
et al. .
ProMiner: rule-based protein and gene entity recognition
.
BMC Bioinf
2005
;
6
:
S14
.doi: https://doi.org/10.1186/1471-2105-6-S1-S14

35.

Caruana
R
. Multitask learning. In:
Thrun
S
,
Pratt
L
(eds.),
Learning to Learn
.
Boston, MA
:
Springer US
,
1998
,
95
133
.

36.

Kozomara
A
,
Birgaoanu
M
,
Griffiths-Jones
S
.
miRBase: from microRNA sequences to function
.
Nucleic Acids Res
2019
;
47
:
D155
62
.doi: https://doi.org/10.1093/nar/gky1141

37.

Wright
D
,
Katsis
Y
,
Mehta
R
et al. 
NormCo: Deep Disease Normalization for Biomedical Knowledge Base Construction
. https://openreview.net/forum?id=BJerQWcp6Q (
1 June 2024, date last accessed
).

38.

Bergstra
J
,
Yamins
D
, and
Cox
D
Making a science of model search: hyperparameter optimization in hundreds of dimensions for vision architectures
. In: International Conference on Machine Learning. PMLR,
Atlanta, Georgia, USA: Machine Learning Research Press
, pp.
115
23
.
2013
.

39.

Akiba
T
,
Sano
S
,
Yanase
T
et al. 
Optuna: a next-generation hyperparameter optimization framework
. In: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining,
Anchorage, AK, USA
,
August 4 - 8, 2019
, pp.
2623
31
, New York, NY, United States: Association for Computing Machinery,
2019
.

40.

Vasilevsky
NA
,
Matentzoglu
NA
,
Toro
S
et al. 
Mondo: unifying diseases for the world, by the world
.
medRxiv
2022
;2022.04.13.22273750.

41.

Crichton
G
,
Pyysalo
S
,
Chiu
B
et al. 
A neural network multi-task learning approach to biomedical named entity recognition
.
BMC Bioinf
2017
;
18
:368.doi: https://doi.org/10.1186/s12859-017-1776-8

42.

Davis
AP
,
Wiegers
TC
,
Roberts
PM
et al. 
A CTD-Pfizer collaboration: manual curation of 88 000 scientific articles text mined for drug-disease and drug-phenotype interactions
.
Database
2013
;
2013
:bat080.doi: https://doi.org/10.1093/database/bat080

43.

Kühnel
L
,
Fluck
J
.
We are not ready yet: limitations of state-of-the-art disease named entity recognizers
.
J Biomed Semant
2022
;
13
:26.doi: https://doi.org/10.1186/s13326-022-00280-6

44.

Chen
Q
,
Deng
N
,
Lu
K
et al. 
Elevated plasma miR-133b and miR-221-3p as biomarkers for early Parkinson’s disease
.
Sci Rep
2021
;
11
:15268.doi: https://doi.org/10.1038/s41598-021-94734-z

45.

Han
CL
,
Liu
YP
,
Guo
CJ
et al. 
The lncRNA H19 binding to let-7b promotes hippocampal glial cell activation and epileptic seizures by targeting Stat3 in a rat model of temporal lobe epilepsy
.
Cell Prolif
2020
;
53
:e12856.doi: https://doi.org/10.1111/cpr.12856

46.

Szwed
K
,
Szwed
M
,
Kozakiewicz
M
et al. 
Circulating microRNAs and novel proteins as potential biomarkers of neurological complications after heart bypass surgery
.
J Clin Med
2021
;
10
:3091.doi: https://doi.org/10.3390/jcm10143091

47.

Yang
YL
,
Lin
TK
,
Huang
YH
.
MiR-29a inhibits MPP + - Induced cell death and inflammation in Parkinson’s disease model in vitro by potential targeting of MAVS
.
Eur J Pharmacol
2022
;
934
:175302.doi: https://doi.org/10.1016/j.ejphar.2022.175302

48.

Guévremont
D
,
Roy
J
,
Cutfield
NJ
et al. 
MicroRNAs in Parkinson’s disease: a systematic review and diagnostic accuracy meta-analysis
.
Sci Rep
2023
;
13
:16272.doi: https://doi.org/10.1038/s41598-023-43096-9

49.

Wu
L
,
Zhao
W
,
Kong
F
et al. 
Serum miR-9a and miR-133b, diagnostic markers for Parkinson’s sisease, are up-regulated after Levodopa treatment
.
Acta Med Mediterr
2020;
36
:
1857
1863
. doi: https://doi.org/10.19193/0393-6384_2020_3_291.

50.

Ravanidis
S
et al. 
Circulating Brain-enriched MicroRNAs for detection and discrimination of idiopathic and genetic Parkinson’s disease
.
Mov Disord
2020
;
35
:
457
467
. doi: https://doi.org/10.1002/mds.27928

51.

Cressatti
M
,
Juwara
L
,
Galindez
JM
et al. 
Salivary microR-153 and microR-223 Levels as Potential Diagnostic Biomarkers of Idiopathic Parkinson’s Disease
.
Mov Disord
2020
;
35
:
468
477
. doi: https://doi.org/10.1002/mds.27935

52.

Peng
W
,
Che
Z
,
Dai
W
et al. 
Predicting miRNA-disease associations from miRNA-gene-disease heterogeneous network with multi-relational graph convolutional network model
.
IEEE/ACM Trans Comput Biol Bioinform
2023
;
20
:
3363
75
.doi: https://doi.org/10.1109/TCBB.2022.3187739

53.

van Mulligen
EM
,
Fourrier-Reglat
A
,
Gurwitz
D
et al. 
The EU-ADR corpus: annotated drugs, diseases, targets, and their relationships
.
J Biomed Informat
2012
;
45
:
879
84
.doi: https://doi.org/10.1016/j.jbi.2012.04.004

This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.

Supplementary data