OncoCardioDB: a public and curated database of molecular information in onco-cardiology/cardio-oncology Open Access

Top five most frequent types of cancer reported to date in the OncoCardio database

CD10	Name	Frequency
C50	Malignant neoplasm of breast	122
C85	Other specified and unspecified types of non-Hodgkin lymphoma	24
C34	Malignant neoplasm of bronchus and lung	22
C95	Leukaemia of unspecified cell type	12
C90	Multiple myeloma and malignant plasma cell neoplasms	11

CD10	Name	Frequency
C50	Malignant neoplasm of breast	122
C85	Other specified and unspecified types of non-Hodgkin lymphoma	24
C34	Malignant neoplasm of bronchus and lung	22
C95	Leukaemia of unspecified cell type	12
C90	Multiple myeloma and malignant plasma cell neoplasms	11

Table 3.

Top five most frequent types of cancer reported to date in the OncoCardio database

CD10	Name	Frequency
C50	Malignant neoplasm of breast	122
C85	Other specified and unspecified types of non-Hodgkin lymphoma	24
C34	Malignant neoplasm of bronchus and lung	22
C95	Leukaemia of unspecified cell type	12
C90	Multiple myeloma and malignant plasma cell neoplasms	11

CD10	Name	Frequency
C50	Malignant neoplasm of breast	122
C85	Other specified and unspecified types of non-Hodgkin lymphoma	24
C34	Malignant neoplasm of bronchus and lung	22
C95	Leukaemia of unspecified cell type	12
C90	Multiple myeloma and malignant plasma cell neoplasms	11

Six different basic therapies (chemotherapy, hormonal therapy, immunotherapy, radiotherapy, surgery and targeted therapy) have been considered since no others are mentioned in any study, and also any combination of two or more of them is in principle possible, which generates 64 possibilities. From these 64, only 7 different possibilities appear in the studies. The most frequent is chemotherapy alone, which was used 144 times, followed by radiotherapy with 39 occurrences. The most common combined therapy is chemotherapy+radiotherapy in the fourth place with 15 occurrences.

The total number of drugs stored in the DB with their corresponding ATC code is 6460. Out of them, 259 were used in any study. This number is greater than the number of studies since the prescribed treatment in many studies uses two or more drugs. In fact, only 26 of them are different. The most frequently used drug is doxorubicin (ATC code: L01DB01) with 58 occurrences, followed by cyclophosphamide (ATC code: L01AA01) with 35 occurrences. Finally, there are eight drugs mentioned in only one study each.

The total number of different cardiovascular symptoms reported in the studies is 42. Ordering by the frequency of appearance shows that the decline in left ventricular ejection fraction (LVEF) is the most common with 125 occurrences, followed by heart failure with 20 occurrences. On the other hand, 15 symptoms are mentioned only once each.

Description and examples of predefined queries

The database contains a collection of (currently) seven predefined searches (Queries 1–7) that answer common questions. Among them, only one field is left free and the user is requested to provide a value. The predefined queries are as follows (values in typewriter font correspond to their denomination in the tables (see Figure 2 and part 3 of Supplementary material).

Input: a pathology code (IdPathologyCD10).
Output: all variations (as IdVariation) associated to the input pathology and, for each of these variations, all its associated pathologies.
Input: a variation name (IdVariation).
Output: all pathologies (as IdPathologyCD10) associated with that variation and, for each of them, all studies (as DOIReference) and, for each study, all symptoms (if any) associated with it (as IdCardiovascularSymptom).
Input: a variation (IdVariation).
Output: all panel kits (as IdPanelKitName) that can test the expression of the variation.
Input: a variation (IdVariation).
Output: all drugs (as IdDrugATC) which have been used with patients that exhibit the variation.
Input: a pathology name (IdPathologyCD10).
Output: all variations (as IdVariation) detected in any study associated with that pathology and, for each of them, the gene (as IdGeneSymbol) in which that variation occurs and the number of studies that refer to the found association.
Input: a gene (IdGeneSymbol).
Output: all variations (as IdVariation) that occur in that gene and, for each of them, all the drugs (as IdDrugATC) that are supposed to modify the variation.
Input: a gene (IdGeneSymbol).
Output: all variations (as IdVariation) that occur in that gene and, for each of them, all the panel kits (as IdPanelKitName) that can test the expression of that variation.

For example, the user might want to know what onco-cardiology genes have been identified in lung cancer and how many studies have reported said association? This can be answered using Query 5, giving, as input the ICD10-code of the pathology. In this example, we use C34 (malignant neoplasm of bronchus and lung). The result is Table 1, which shows all the reported genes and the number of studies.

Table 1.

Genes related to oncocardiology in malignant neoplasm of bronchus and lung (ICD10 code: C34) and the number of studies that report the association

ID gene Entrez	Symbol	COUNT(DISTINCT s.IdStudy)
1401	CRP	3
1535	CYBA	1
3569	IL6	1
4689	NCF4	1
4879	NPPB	3
6288	SAA1	1
7137	TNNI3	4
9518	GDF15	1
406 885	MIRLET7C	1
406 938	MIR146A	1
406 967	MIR192	1
406 984	MIR200B	1
407 040	MIR34A	1
693 159	MIR574	1
100 126 334	MIR885	1

ID gene Entrez	Symbol	COUNT(DISTINCT s.IdStudy)
1401	CRP	3
1535	CYBA	1
3569	IL6	1
4689	NCF4	1
4879	NPPB	3
6288	SAA1	1
7137	TNNI3	4
9518	GDF15	1
406 885	MIRLET7C	1
406 938	MIR146A	1
406 967	MIR192	1
406 984	MIR200B	1
407 040	MIR34A	1
693 159	MIR574	1
100 126 334	MIR885	1

Table 1.

Genes related to oncocardiology in malignant neoplasm of bronchus and lung (ICD10 code: C34) and the number of studies that report the association

ID gene Entrez	Symbol	COUNT(DISTINCT s.IdStudy)
1401	CRP	3
1535	CYBA	1
3569	IL6	1
4689	NCF4	1
4879	NPPB	3
6288	SAA1	1
7137	TNNI3	4
9518	GDF15	1
406 885	MIRLET7C	1
406 938	MIR146A	1
406 967	MIR192	1
406 984	MIR200B	1
407 040	MIR34A	1
693 159	MIR574	1
100 126 334	MIR885	1

ID gene Entrez	Symbol	COUNT(DISTINCT s.IdStudy)
1401	CRP	3
1535	CYBA	1
3569	IL6	1
4689	NCF4	1
4879	NPPB	3
6288	SAA1	1
7137	TNNI3	4
9518	GDF15	1
406 885	MIRLET7C	1
406 938	MIR146A	1
406 967	MIR192	1
406 984	MIR200B	1
407 040	MIR34A	1
693 159	MIR574	1
100 126 334	MIR885	1

Compound questions are also included, such as which pathology is associated with a molecular feature, which cardiovascular symptoms are associated with it and which papers report said association? This can be answered in Query 2. The input in this case is a variation; for example, let us use myeloperoxidase (MPO). MPO was always reported in breast cancer but associated with three cardiovascular symptoms in two different studies, as shown in Table 2.

Table 2.

Results obtained using Query 2, with MPO variation as input

Pathology name	Cardiovascular symptom	DOIReference
Malignant neoplasm of breast	Cancer therapy–related cardiac dysfunction	10.1161/JAHA.119.014708
Malignant neoplasm of breast	Decline in LVEF	10.1373/clinchem.2015.241232
Malignant neoplasm of breast	Heart failure	10.1373/clinchem.2015.241232

Pathology name	Cardiovascular symptom	DOIReference
Malignant neoplasm of breast	Cancer therapy–related cardiac dysfunction	10.1161/JAHA.119.014708
Malignant neoplasm of breast	Decline in LVEF	10.1373/clinchem.2015.241232
Malignant neoplasm of breast	Heart failure	10.1373/clinchem.2015.241232

Table 2.

Results obtained using Query 2, with MPO variation as input

Pathology name	Cardiovascular symptom	DOIReference
Malignant neoplasm of breast	Cancer therapy–related cardiac dysfunction	10.1161/JAHA.119.014708
Malignant neoplasm of breast	Decline in LVEF	10.1373/clinchem.2015.241232
Malignant neoplasm of breast	Heart failure	10.1373/clinchem.2015.241232

Pathology name	Cardiovascular symptom	DOIReference
Malignant neoplasm of breast	Cancer therapy–related cardiac dysfunction	10.1161/JAHA.119.014708
Malignant neoplasm of breast	Decline in LVEF	10.1373/clinchem.2015.241232
Malignant neoplasm of breast	Heart failure	10.1373/clinchem.2015.241232

The list of predefined queries is obviously not exhaustive and can be augmented in the future if the authors receive feedback from the users concerning what they would like to get from the database. To do so, the program provides a specific item where the users can write their queries, in plain English, which are sent to us; we will do our best to translate them into SQL and even incorporate them as predefined questions if they prove to be of general interest.

Examples of advanced queries

Often a researcher’s question cannot be answered using any of the seven predefined queries. Nevertheless, to provide, from the beginning, a way to make any query, knowledgeable users can access a SQL console where they can write any unrestricted query. Since all the contents of the database come from public information, and no personal or sensitive data are stored, this does not represent a legal or privacy problem. Also, since users are allowed to access the database in read-only mode, integrity is not a concern, either. Notice that very few biomedical databases allow this unrestricted access.

Thus, new questions can be asked using advanced queries. For instance, how many types of cancer can be found in the database and how often do they appear? To do this, the input in the advanced query would be:

SELECT patients_affected_by.IdPathologyCD10, Pathology.Name, COUNT(patients_affected_by.IdPathologyCD10) FROM patients_affected_by,Pathology WHERE patients_affected_by.IdPathologyCD10= Pathology.IdPathologyCD10 GROUP BY patients_affected_by.IdPathologyCD10 ORDER BY COUNT(patients_affected_by.IdPathologyCD10) DESC

The output corresponds to the 23 different types of cancer with their ICD10 codes, names and frequencies with which they are reported. The five most frequent types are shown in Table 3.

Another example of an advanced query is: what is the average percentage of cardiotoxicity reported in all the studies which have measured it?

The answer is 14.4 % and to obtain this the following advanced query needs to be introduced:

SELECT SUM(ObservedCardiotoxicity*NumSubjects)/ (SUM(NumSubjects)) FROM Study WHERE ObservedCardiotoxicity IS NOT NULL;

Notice that this result has been calculated weighting the cardiotoxicity percentages reported in each study proportionally to the number of subjects in it with respect to the total number of subjects.

Another interesting question is: which are the top three variations in the database? To answer this, the following advanced query needs to be used:

SELECT IdVariation, COUNT(IdVariation) FROM variation_studied_by GROUP BY IdVariation ORDER BY COUNT(IdVariation) DESC LIMIT 3;

the most frequent being natriuretic peptide B (frequency = 20), troponin T (frequency = 15) and troponin I (frequency = 14).

Discussion

The area of onco-cardiology is in its early years of study regarding the molecular relationship between cancer and CVDs. This makes possible the collection of all the information in this regard early. The OncoCardioDB designed in this work allows storing all the molecular information obtained from patients in the area of onco-cardiology, incorporating new relevant studies on a regular basis. The global burden of cancer and CVDs continues to increase, and therefore the growing number of patients who survive cancer have an increasing risk of developing CVDs. This is why we think that the OncoCardio database will be an important contribution to identify molecular targets related to both diseases. Concrete applications that can be mentioned are in precision medicine, e.g. when designing new targeted drugs with fewer cardiotoxic effects, or as a repository of possible molecular biomarkers in onco-cardiology. The above comment is based on the fact that the OncoCardio database contains information on all the molecular targets (variations) identified to date associated with cancer that cause a cardiovascular effect, whether due to cardiotoxicity or other reasons.

Another important aspect to be discussed is the usability of the system. It will be really useful only as long as the biomedical community finds it valuable and accessible. Although it is not especially aesthetic, the authors have made an effort to provide simple access and allow only one course of action at any time, so the interface is robust and can be extended or adapted very easily according to the user’s requirements. An important point is the addition of predefined queries on demand; due to the modular organization, this will require the alteration of only a restricted part of the GUI, which is a simple task.

The generation of the SQL sentences to fill the DB in the case of the generic tables (those whose information comes from predefined lists or standards), especially gene-related ones, is relatively slow (it takes ~10 min on a normal modern computer) and its introduction into the database (the call to the generated SQL macro) is even slower (~20 min), but this only has to be done once and is valid forever. On the other hand, the generation of the sentences for the specific tables (namely, the studies and their related values such as observed symptoms and pathologies) and their loading into the database is almost instantaneous, so the addition of further studies to update the information is not a problem.

Apart from this, probably the main drawback for complete user-friendliness is the need to introduce the entities such as genes, pathologies and drugs using their normalized designations exclusively (Entrez code, ATC and ICD10 codes), instead of common or normally accepted names. The use of fully standardized vocabularies is an almost universally acknowledged need since automatic information processing was introduced in medicine. Nevertheless, this is something that is not always accepted by the medical community enthusiastically. It is our opinion that an effort should be carried out by both parties (doctors and information processing experts) to reduce this gap. To that aim, a proposal for future work is provided in the ‘Future work’ section.

Extendibility and reusability are important aspects to be mentioned, too. The GUI is a Java application in which the model interacts with a MariaDB server to make any query and return any result. This makes it possible to use it in any database application with few changes; e.g., more predefined queries can be added without much effort. On the other hand, the controller and the views have been devised for this application and even though they are a good basis for similar ones, a substantial part of their code would have to be changed.

Biological knowledge is completely contained inside the database and to some extent can be reused. Tables containing general entities (genes, drugs and pathologies) can be exported separately and used as such for other medical databases. Reorganization of the database by adding fields, or even new tables, is possible, too, but may require rewriting of one or more predefined queries.

Finally, the authentication and user-management parts are the most reusable components. They are completely independent of the GUI and the database; indeed, they can work with any other program where access is to be offered through the network with just a browser, even if it is not a user interface or a database.

Future work

The system is functional and awaits the interaction of interested users. Their suggestions and comments will be important to introduce improvements, but some of them are already planned, namely:

In the case that the user does not know the ICD10 code of a pathology but only its name, a search through the ICD10 database by similar words or expressions will be provided. The similarity between character strings, including metrics, to show possible matches ordered by closeness is a habitual task in natural language processing, and several libraries aimed at this objective are available.
The interface to look for pathology codes can be organized, too, based on a rational taxonomy of diseases, but this is already provided by the ICD10 web (see https://www.icd10data.com/ICD10CM/Codes) which has been made accessible as part of the GUI of the system.
A similar strategy can be applied to the ATC codes of drugs. In this case, the rational taxonomy is available at https://www.atccode.com/ and is also available in our system.
A program option is currently implemented which allows authors of papers on onco-cardiology to inform us about their works so that the content can be incorporated into the database. It will also be possible to provide a form where the authors themselves can fill in the values of each of the relevant fields mentioned in their work. Such information will be curated in its form by the syntactic checks mentioned in the section ‘Knowledge extraction and curation’ and in its content by the database maintainers.

With respect to the system’s organization and the experience acquired during its construction, it is likely that can be used to build similar systems in other biomedical areas. The system is sufficiently modular, and in particular, the organization of the user interface and the orchestration of all the parts under Guacamole are fully reusable.

In a similar way, some of the tables of the database can be used without changes in applications that require these data (drug names/codes or information on genes and variations).

Supplementary data

Supplementary data are available at Database Online.

Data availability

The database can be consulted through a web browser, but for those wishing to reproduce this work or to use the data for other purposes, our software is available under a free GNU licence. This includes an R script to fill in the gene-related tables from data in the Bioconductor package org.Hs.eg.db and Perl scripts to fill in the drug-related tables from the ATC as provided by the WHO Collaborating Centre for Drug Statistics Methodology of Norway (23) and to fill in the pathology-related tables from the ICD10 as provided by the Web’s Free 2022 ICD-10-CM/PCS Medical Coding Reference (24).

The distribution of the complete populated database as files loadable by a database system such as MariaDB or MySQL is possible, too, since the data contained in it are publicly available. The R package org.Hs.eg.db is distributed under Artistic License 2.0. With respect to the ATC and ICD10, the authors acknowledge, in their respective pages, the public availability of the provided information.

To facilitate adaptation and reuse, we installed, and are currently running, the system inside a virtual machine created with QEMU/Kernel Virtual Machine (integrated inside the Linux kernel) which runs a GNU/Linux operating system. GNU/Linux and MariaDB are distributed under the terms of the GNU General Public License (GPL). Guacamole is available under the Apache License. Concerning the software produced by the authors (GUI in Java, Perl macros of data treatment and Perl daemon), this is distributed by us on demand in source form under the GPL, too. Therefore, there is no legal obstacle to making the full virtual machine available. It can be downloaded from https://johnford.uv.es/OncocardioVM and ported to another domain following the directions on the download page. Obviously, it is also possible to install and run each part of the system in a real machine.

Funding

The I+D+i project PGC type B reference PID2020-117114GB-I00 funded by the Spanish Ministry of Science and Education (MCIN/AEI/10.13039/501100011033/); Maria Zambrano (ZA21-063) for the requalification of the Spanish University system—NextGeneration EU; Chilean (ANID/Fondecyt-Postdoctorado no. 3180486); Universidad de La Frontera (DIUFRO DI22-0014 to A.L.R.-C.).

Conflict of interest statement

None declared.

Acknowledgements

The role of Anyelo Norambuena Ruth, an undergraduate student at the Universidad de La Frontera, as a research assistant, is much appreciated.

References

Yeh

E.T.H.

(

2011

)

Onco-cardiology: the time has come

Tex. Heart Inst. J.

246

–

247

PubMed

Roth

G.A.

Mensah

G.A.

Johnson

C.O.

et al. . (

2020

)

Global burden of cardiovascular diseases and risk factors, 1990–2019: Update from the GBD 2019 study

J. Am. Coll. Cardiol.

2982

–

3021

G. B. of Disease Cancer Collaboration

(

2019

)

Global, regional, and national cancer incidence, mortality, years of life lost, years lived with disability, and disability-adjusted life-years for 29 cancer groups, 1990 to 2017: a systematic analysis for the Global Burden of Disease Study

JAMA Oncology

1749

–

1768

PubMed

10.1161/CIRCULATIONAHA.115.020406

Koene

R.J.

Prizment

A.E.

Blaes

et al. (

2016

)

Shared risk factors in cardiovascular disease and cancer

Circulation

133

1104

–

1114

. doi:

Aleman

B.M.

Moser

E.C.

Nuver

et al. (

2014

)

Cardiovascular disease after cancer therapy

EJC Suppl.

–

. doi:

10.1016/j.ejcsup.2014.03.002

Moslehi

J.J.

(

2016

)

Cardiovascular toxic effects of targeted cancer therapies

N. Engl. J. Med.

375

1457

–

1467

. doi:

10.1056/NEJMra1100265

Alexandre

Cautela

Ederhy

et al. (

2020

)

Cardiovascular toxicity related to cancer treatment: a pragmatic approach to the American and European Cardio-Oncology Guidelines

J. Am. Heart Assoc.

, e018403. doi:

10.1161/JAHA.120.018403

10.1038/s41467-020-15639-5

Stoltzfus

K.C.

Zhang

Sturgeon

et al. (

2020

)

Fatal heart disease among cancer patients

Nat. Commun.

–

. doi:

Gernaat

S.A.M.

P.J.

Rijnberg

et al. (

2017

)

Risk of death from cardiovascular disease following breast cancer: a systematic review

Breast Cancer Research and Treatment

164

537

–

555

. doi:

10.1007/s10549-017-4282-9

10.

Strongman

Gadd

Matthews

et al. (

2019

)

Medium and long-term risks of specific cardiovascular diseases in survivors of 20 adult cancers: a population-based cohort study using multiple linked UK electronic health records databases

The Lancet

394

1041

–

1054

. doi:

10.1016/S0140-6736(19)31674-5

10.1016/j.atherosclerosis.2017.06.001

11.

Masoudkabir

Sarrafzadegan

Gotay

et al. (

2017

)

Cardiovascular disease and cancer: evidence for shared disease pathways and pharmacologic prevention

Atherosclerosis

263

343

–

351

. doi:

12.

Hinrichs

Mrotzek

S.M.

Mincu

R.I.

et al. (

2020

)

Troponins and natriuretic peptides in cardio-oncology patients–data from the ECoR Registry

Front. Pharmacol.

–

. doi:

10.3389/fphar.2020.00740

13.

(

2022

)

Opportunities and challenges in cardio-oncology: a bibliometric analysis from 2010 to 2022

Curr. Probl. Cardiol.

, 101227.

14.

Pavlopoulou

Spandidos

D.A.

and

Michalopoulos

(

2015

)

Human cancer databases (review)

Oncol. Rep.

–

15.

Weinstein

J.N.

Collisson

E.A.

Mills

G.B.

et al. (

2013

)

The Cancer Genome Atlas Pan-Cancer analysis project

Nat. Genet

1113

–

1120

16.

Fernandes

Patel

and

Husi

(

2018

)

C/VDdb: a multi-omics expression profiling database for a knowledge-driven approach in cardiovascular disease (CVD

PLoS One

–

17.

Alexandar

Nayar

P.G.

Murugesan

et al. (

2015

)

CardioGen base: a literature based multi-omics database for major cardiovascular diseases

PLoS One

–

18.

Hamosh

Scott

A.F.

Amberger

J.S.

et al. (

2005

)

Online Mendelian Inheritance in Man (OMIM), a knowledgebase of human genes and genetic disorders

Nucleic Acids Res.

514

–

517

19.

Sweeting

Oliver-Williams

Teece

et al. (

2022

)

VICORI Collaborative Data Resource Profile: The Virtual Cardio-Oncology Research Initiative (VICORI) linking national English cancer registration and cardiovascular audits

Int. J. Epidemiol.

1768

–

1779

20.

Francis

(

2017

)

Cardio-oncology: the role of big data

Heart Fail. Clin.

403

–

408

PubMed

21.

Wishart

D.S.

Bartok

Oler

et al. (

2020

)

MarkerDB: an online database of molecular biomarkers

Nucleic Acids Res.

D1259

–

D1267

22.

Cartas-Espinel

Telechea-Fernández

Manterola Delgado

et al. (

2022

)

Novel molecular biomarkers of cancer therapy-induced cardiotoxicity in adult population: a scoping review

ESC Heart Fail.

1651

–

1665

23.

WHO Collaborating Centre for Drug Statistics Methodology of Norway

. (

2021

)

Guidelines for ATC Classification and DDD Assignment, 2021

. Oslo, 2020. https://www.whocc.no/atc_ddd_index/ (11 November 2021, date last accessed).

24.

ICD10Data

. (

2021

)

The Web’s Free 2022 ICD-10-CM/PCS Medical Coding Reference

. https://www.icd10data.com/ (21 April 2023, date last accessed).

25.

Roth

Y.D.

Lian

Pochiraju

et al. (

2020

)

Datanator: an integrated database of molecular data for quantitatively modeling cellular behavior

Nucleic Acids Res.

D516

–

D522