RegulaTome: a corpus of typed, directed, and signed relations between biomedical entities in the scientific literature

Number of annotated relations between the different NE types in RegulaTome

NE types	Relation count
Protein–protein	10 593
Protein–family	2320
Protein–complex	1703
Protein–chemical	1310
Family–family	339
Family–chemical	228
Family–complex	219
Complex–chemical	146
Complex–complex	86
Chemical–chemical	17

NE types	Relation count
Protein–protein	10 593
Protein–family	2320
Protein–complex	1703
Protein–chemical	1310
Family–family	339
Family–chemical	228
Family–complex	219
Complex–chemical	146
Complex–complex	86
Chemical–chemical	17

Table 1.

Number of annotated relations between the different NE types in RegulaTome

NE types	Relation count
Protein–protein	10 593
Protein–family	2320
Protein–complex	1703
Protein–chemical	1310
Family–family	339
Family–chemical	228
Family–complex	219
Complex–chemical	146
Complex–complex	86
Chemical–chemical	17

NE types	Relation count
Protein–protein	10 593
Protein–family	2320
Protein–complex	1703
Protein–chemical	1310
Family–family	339
Family–chemical	228
Family–complex	219
Complex–chemical	146
Complex–complex	86
Chemical–chemical	17

RE system evaluation

We used an extended grid search to find the optimal values of hyper-parameters on the development set of the RegulaTome corpus. Our best result was achieved using the RoBERTa-large-PM-M3-Voc model [5] and the following set of hyper-parameters: “MSL = 128, learning rate = 4e-6, training epochs = 26, and batch size = 16.”

Our best experiment achieved an average precision of 68.9%, an average recall of 67.0%, and an average F1-score of 67.9% on the RegulaTome development set. The four models used in this experiment and the evaluation scores measured on the development set are shown in Table 2.

Table 2.

Performance of the best experiment on the RegulaTome development set

	Precision	Recall	F1-score
Model-1	69.1	67.4	68.3
Model-2	68.3	66.3	67.3
Model-3	69.7	66.8	68.2
Model-4	68.6	67.3	67.9
Average	68.9	67.0	67.9
SD	0.61	0.51	0.45

The best model (highlighted in bold) is used to perform a run on the held-out test set and for a large-scale run on the entire biomedical scientific literature.

Table 2.

Performance of the best experiment on the RegulaTome development set

	Precision	Recall	F1-score
Model-1	69.1	67.4	68.3
Model-2	68.3	66.3	67.3
Model-3	69.7	66.8	68.2
Model-4	68.6	67.3	67.9
Average	68.9	67.0	67.9
SD	0.61	0.51	0.45

The best model (highlighted in bold) is used to perform a run on the held-out test set and for a large-scale run on the entire biomedical scientific literature.

The best model presented in Table 2 (model-1) achieved 66.6% F1-score (67.7% precision, 65.5% recall) on the RegulaTome held-out test set.

In Supplementary Section 6, evaluation metrics on the test set are presented on a per-relation-label basis. Complex formation—the label with the highest level of support—is, unsurprisingly, among the relations where the model achieves its best performance (F1- score = 78.8%). Performance varies significantly for the catalysis of posttranslational modification relations, with F1-scores varying from 85.7% for catalysis of deubiquitination to 0% for another catalysis of small protein removal. Results in these cases seem to be directly affected by the level of support per label (Supplementary Section 4), with labels with a higher level of support, such as catalysis of ubiquitination, catalysis of phosphorylation, catalysis of dephosphorylation, and catalysis of methylation, having F1-scores ∼70%. Regulation-related labels seem to be the most difficult to predict, a result consistent with the literature on similar tasks [8]. Relationship sign assignment seems to be easier than the general class prediction, with positive regulation and negative regulation having F1-scores >62%, while regulation, despite its high level of support, achieves an F1-score of only 49.3%. Moreover, regulation of transcription seems easier to predict than regulation of gene expression, but this could again be explained by the fact that the level of support for regulation of transcription is double that of regulation of gene expression (Supplementary Section 4).

In the next sections, we perform a manual error analysis and a semiautomated label confusion analysis, which allows us to look deeper into these results.

Manual error analysis

We have selected 20% of documents in the test set and manually analyzed and categorized the errors generated by the best-performing RE model on these documents. An overview of these errors is shown in Table 3, while a case-by-case analysis is provided in Supplementary Section 7.

Table 3.

Manual error analysis on 20% of documents in the RegulaTome test set

	Count
Error type	FP	FN	Total
Ambiguous keyword	65	35	100
Rare keyword	0	57	57
Co-reference resolution	31	29	60
Convoluted text excerpt	61	55	116
Model error	17	32	49
Annotation error	26	15	41
Total	200	223	423

	Count
Error type	FP	FN	Total
Ambiguous keyword	65	35	100
Rare keyword	0	57	57
Co-reference resolution	31	29	60
Convoluted text excerpt	61	55	116
Model error	17	32	49
Annotation error	26	15	41
Total	200	223	423

Table 3.

Manual error analysis on 20% of documents in the RegulaTome test set

	Count
Error type	FP	FN	Total
Ambiguous keyword	65	35	100
Rare keyword	0	57	57
Co-reference resolution	31	29	60
Convoluted text excerpt	61	55	116
Model error	17	32	49
Annotation error	26	15	41
Total	200	223	423

	Count
Error type	FP	FN	Total
Ambiguous keyword	65	35	100
Rare keyword	0	57	57
Co-reference resolution	31	29	60
Convoluted text excerpt	61	55	116
Model error	17	32	49
Annotation error	26	15	41
Total	200	223	423

From the error categories presented in Table 3, the main sources of errors appear to be “ambiguous keyword” and “convoluted text excerpt,” with over half of the errors being a result of these. The first category encapsulates instances where ambiguous words, such as “target,” can denote either a regulatory (e.g. “The promoter of the CD19 gene is a ‘target’ for BSAP”) relation or a catalytic (e.g. “Tea1 is a substrate ‘target’ of Shk1”) relation and result in model confusion. The second most common category (“convoluted text excerpt”) encompasses text segments with complex syntax, including intricate sentences and cross-sentence relations, which are inherently difficult to annotate and subsequently predict. A closely related category is “coreference resolution,” where the syntactical structure makes it especially difficult for the model to determine the subject to which a given relation pertains, resulting in both false positives (FPs) and false negatives (FNs). The “rare keyword” category results only in FNs as a consequence of words or phrases rarely found in scientific texts (e.g. “protection from inhibition or non-covalent association”), which are recognized and correctly annotated by biology experts, but do not result in enough examples for the model to train on to have a chance to detect them during prediction.

There are two more categories—with lower numbers of errors—which are inherently different than the rest of the categories presented earlier. “Model error” refers to cases where there are clear keywords to denote relations and where there were no clear explanations as to why these have not been correctly predicted by the model. On the other hand, “annotation error” refers to cases in which annotators have inaccurately labeled or not labeled relations, frequently as a result of text ambiguity, which would require correction in the corpus.

Label confusion analysis

Next, we have categorized the errors based on the confusion of relation labels (Supplementary Sections 8 and 9). Overall, the vast majority of all FPs (81%) are cases where relations are predicted and there should be no relation of any type according to our manual annotations (Supplementary Section 8, bold and italics). Similarly, 82% of FNs are relations that were completely missed (Supplementary Section 9, bold and italics) and are not a result of confusion between labels predicted by the model. For a full categorization of each FP and FN in the RegulaTome test set in terms of label confusion, refer to the Supplementary Table (“Error analysis full results”) available via Zenodo.

For the remaining errors, some label confusion categories are less severe than others. Specifically, 10% of all errors in the test set (126 out of 1048 FPs and 118 out of 1160 FNs)—i.e. half of the remaining errors—have to do with confusion among closely related labels (“Supplementary Sections 8 and 9”). For example, in the regulation of gene expression branch (Fig. 1), either a too-specific label (i.e. regulation of transcription instead of regulation of gene expression) or a too-broad label (i.e. regulation of gene expression instead of regulation of transcription) was predicted. If all confusion within the regulation of gene expression branch was ignored, i.e. if all confusion between regulation of transcription, regulation of translation, and regulation of gene expression labels is counted as true positives (TPs) instead of FPs and FNs, the average F1-score for the regulation of gene expression branch increases to 68.8%, which is 9% better than regulation of transcription and 15% better than regulation of gene expression (Supplementary Section 6). Similarly, if all confusion within the catalysis of posttranslational modification branch is ignored, the average F1-score for catalysis of posttranslational modification increases to 70.6%, which is better than the F1-scores for 18 of the 22 relation types within that branch (Supplementary Section 6).

Error analysis of direction and sign

The directed relations that can be mined from the literature using our model can provide important information for the analysis of regulatory networks. In this use case, relations are viewed as edges, and what matters most is to have the correct edges, with the right direction, and ideally the right sign (i.e. positive regulation or negative regulation). To evaluate the usefulness of our model’s predictions for this purpose, we categorized label confusion errors into six categories, considering only directed predictions and annotations (i.e. the presence or absence of predicted or annotated complex formation has no impact), namely cases where the model

failed to assign a directed interaction, where there should be one,
assigned a directed interaction, where there should be none,
assigned a directed interaction, but the direction is wrong,
failed to assign a sign (positive or negative), where there should be one,
assigned a sign, where there should be none, and
assigned a sign, but the sign is wrong.

We found 1394 edges with correctly assigned directions and 737, 620, and 5 errors from the first three categories, respectively. While the network that would be produced is somewhat incomplete—missing 737 interactions—its precision would be 70% in terms of connecting the right entities with an edge pointing the right way. It should be noted that in reality, the precision would be even higher since some of the relations counted as FPs are annotation errors in the corpus. Of the correctly detected directed edges, 539 have the correct sign, 31 are missing a sign (Category 4), and 61 have a wrong sign (47 from Category 5 and 14 from Category 6). For the remaining 763 edges, we correctly did not predict a sign. These results further showcase the potential of deep learning–based models trained on RegulaTome for downstream biomedical applications. For details on calculations presented in this section, refer to Supplementary Section 10.

Large-scale execution for protein relations

We used the best model to extract relations from >36 million PubMed abstracts (as of March 2024) and 6 million articles from the PMC BioC open access collection [29] (as of November 2023). The Jensenlab tagger [30] was used to obtain matches for Protein NEs with normalizations to Ensembl [31] identifiers, and the results were filtered to documents that contain at least two NEs and as a result at least one pair for prediction. A total of 6 920 139 documents complied with this criterion (3 157 239 abstracts and full-text and 3 762 900 abstracts only), which were converted to BRAT standoff format and provided to the model for relation prediction. Predictions were produced for >1.2 billion pairs, with ∼1.5% (18.4 million) having at least one “positive” label. A tab-delimited file with results from the large-scale run is provided through Zenodo.

Conclusions

In this work, we introduced RegulaTome, a corpus aimed at enhancing biomedical RE, with a focus on proteins, protein-containing entities such as complexes and families, and chemicals. This work represents a significant advancement in the field of biomedical text mining, addressing a limitation of several existing RE corpora that mainly focus on single-type relations at the sentence level. RegulaTome distinguishes itself by its breadth, encompassing 2521 documents with 16 961 relations between 54 951 entities. It is meticulously curated to include 43 types of relations, extending well beyond the scope traditionally covered in biomedical RE tasks, thereby establishing a new standard for complexity and depth in the field.

The effectiveness of RegulaTome is further demonstrated through the deployment of a transformer-based model, which has shown remarkable accuracy in RE, achieving an F1-score of 66.6% that underlines the corpus’s utility in accurately identifying and categorizing a diverse range of biological relations. This achievement showcases the corpus’s capacity to broaden the scope of detectable relations and its potential to significantly enhance the development of sophisticated, efficient, and accurate RE systems for biomedical applications. By providing RegulaTome to the scientific community, we aim to facilitate the advancement of biomedical RE systems both through theoretical research and practical applications in the field. Our work sets a new benchmark in biomedical text mining and opens up new avenues for exploring and validating a plethora of complex relations between biomedical entities.

Acknowledgements

We thank the CSC—IT Center for Science, Finland, for generous computational resources.

Supplementary data

Supplementary data is available at Database online.

Conflict of interest

None declared.

Data Availability

Data underlying this article are available in its online supplementary material and are openly accessible via Zenodo (https://zenodo.org/doi/10.5281/zenodo.10808330) and GitHub (https://github.com/farmeh/RegulaTome_extraction).

Funding

This project has received funding from the Novo Nordisk Foundation (Grant no.: NNF14CC0001) and from the Academy of Finland (grant no.: 332844). K.N. has received funding from the European Union’s Horizon 2020 research and innovation program under the Marie Sklodowska-Curie (grant no.: 101023676).

References

Milosevic

Thielemann

Comparison of biomedical relationship extraction methods and models for knowledge graph creation

J Web Semant

2023

;

:100756.

Szklarczyk

Kirsch

Koutrouli

et al.

The string database in 2023: protein–protein association networks and functional enrichment analyses for any sequenced genome of interest

Nucleic Acids Res

2023

;

D638

–

Lee

Park

et al.

Bronco: biomedical entity relation oncology corpus for extracting gene-variant-disease-drug relations

Database

2016

;

2016

:baw043.

Lee

Yoon

Kim

et al.

BioBERT: a pre-trained biomedical language representation model for biomedical text mining

Bioinformatics

2019

;

1234

–

Lewis

Ott

et al.

Pretrained language models for biomedical and clinical tasks: understanding and extending the state-of-the-art

. In: Proceedings of the 3rd Clinical Natural Language Processing Workshop, Association for Computational Linguistics,

Online

. pp.

146

–

2020

Bunescu

Kate

et al.

Comparative experiments on learning information extractors for proteins and their interactions

Artif Intell Med

2005

;

139

–

Herrero-Zazo

Segura-Bedmar

Martínez

et al.

The DDI corpus: an annotated corpus with pharmacological substances and drug–drug interactions

J Biomed Informat

2013

;

914

–

Miranda-Escalada

Mehryary

Luoma

et al.

Overview of DrugProt task at BioCreative VII: data and methods for large-scale text mining and knowledge graph generation of heterogenous chemical–protein relations

Database

2023

;

2023

:baad080.

Pyysalo

Ginter

Heimonen

et al.

Bioinfer: a corpus for information extraction in the biomedical domain

BMC Bioinf

2007

;

–

10.

Krallinger

Leitner

Rodriguez-Penagos

et al.

Overview of the protein-protein interaction annotation extraction task of BioCreative II

Genome Biol

2008

;

–

11.

Sun

Johnson

et al.

BioCreative V CDR task corpus: a resource for chemical disease relation extraction

Database

2016

;

2016

12.

Doughty

Kertesz-Farkas

Bodenreider

et al.

Toward an automatic method for extracting cancer- and other disease-related point mutations from the biomedical literature

Bioinformatics

2011

;

408

–

13.

Luo

Lai

P-T

Wei

C-H

et al.

BioRED: a rich biomedical relation extraction dataset

Brief Bioinf

2022

;

:bbac282.

14.

Ting

H-F

et al.

Renet2: high-performance full-text gene–disease relation extraction with iterative training data expansion

NAR Genomics Bioinform

2021

;

:lqab062.

15.

Kim

J-D

Ohta

Pyysalo

et al.

Overview of BioNLP’09 shared task on event extraction

. In Proceedings of the BioNLP 2009 Workshop Companion Volume for Shared Task. Association for Computational Linguistics,

Boulder, Colorado

. pp.

–

2009

16.

Ohta

Pyysalo

Miwa

et al.

Event extraction for post-translational modifications

. In: Proceedings of the 2010 Workshop on Biomedical Natural Language Processing,

Uppsala, Sweden

. pp.

–

Association for Computational Linguistics

2010

17.

Pyysalo

Ohta

Rak

et al.

Overview of the ID, EPI And REL tasks of BioNLP shared task 2011

BMC Bioinf

2012

;

–

18.

Aleksander

Balhoff

Carbon

et al.

The gene ontology knowledgebase in 2023

Genetics

2023

;

224

:iyad031.

10.1101/2023.12.10.570999

19.

Ashburner

Ball

Blake

et al.

Gene ontology: tool for the unification of biology

Nat Genet

2000

;

–

20.

Mehryary

Nastou

Ohta

et al.

String-ing together protein complexes: extracting physical protein interactions from the literature

BioRxiv

2023

. doi:

21.

Oughtred

Rust

Chang

et al.

The BioGRID database: a comprehensive biomedical resource of curated protein, genetic, and chemical interactions

Protein Sci

2021

;

187

–

200

22.

Orchard

Ammari

Aranda

et al.

The MIntACT project—intact as a common curation platform for 11 molecular interaction databases

Nucleic Acids Res

2014

;

D358

–

23.

Licata

Briganti

Peluso

et al.

MINT, the molecular interaction database: 2012 update

Nucleic Acids Res

2012

;

D857

–

24.

Gillespie

Jassal

Stephan

et al.

The reactome pathway knowledgebase 2022

Nucleic Acids Res

2022

;

D687

–

25.

Paysan-Lafosse

Blum

Chuguransky

et al.

InterPro in 2022

Nucleic Acids Res

2023

;

D418

–

26.

Stenetorp

Pyysalo

Topi´c

et al.

brat: a web-based tool for NLP-assisted text annotation

. In: Proceedings of the Demonstrations at the 13th Conference of the European Chapter of the Association for Computational Linguistics,

Avignon, France

. pp.

102

–

Association for Computational Linguistics

2012

27.

Mehryary

Björne

Pyysalo

et al.

Deep learning with minimal training data: TurkuNLP entry in the BioNLP shared task 2016

. In: Proceedings of the 4th BioNLP Shared Task Workshop,

Berlin, Germany

. pp.

–

2016

28.

Smith

Tanabe

Ando

RJN

et al.

Overview of BioCreative II gene mention recognition

Genome Biol

2008

;

–

29.

Comeau

Wei

C-H

Islamaj Do˘gan

et al.

PMC text mining subset in BioC: about three million full-text articles and growing

Bioinformatics

2019

;

3533

–

30.

Jensen

One tagger, many uses: illustrating the power of ontologies in dictionary-based named entity recognition

bioRxiv

2016

:067132.

31.

Martin

Amode

Aneja

et al.

Ensembl 2023

Nucleic Acids Res

2022

;

D933

–