CANCROX: a cross-species cancer therapy database Open Access

Biological databases used in this study

Type of information in the database	Number of entries in database
Cancer genes	477
Oncogenes	355
Tumor suppressor genes	122
Pharmacological data and therapies	11 688
Drugs and substance processed	1308
DrugBank	656
PubChem	451
ChemSpider	104
New drugs and substances	97
Combination of drugs and substances	10 380

Type of information in the database	Number of entries in database
Cancer genes	477
Oncogenes	355
Tumor suppressor genes	122
Pharmacological data and therapies	11 688
Drugs and substance processed	1308
DrugBank	656
PubChem	451
ChemSpider	104
New drugs and substances	97
Combination of drugs and substances	10 380

Table 1

Biological databases used in this study

Type of information in the database	Number of entries in database
Cancer genes	477
Oncogenes	355
Tumor suppressor genes	122
Pharmacological data and therapies	11 688
Drugs and substance processed	1308
DrugBank	656
PubChem	451
ChemSpider	104
New drugs and substances	97
Combination of drugs and substances	10 380

Type of information in the database	Number of entries in database
Cancer genes	477
Oncogenes	355
Tumor suppressor genes	122
Pharmacological data and therapies	11 688
Drugs and substance processed	1308
DrugBank	656
PubChem	451
ChemSpider	104
New drugs and substances	97
Combination of drugs and substances	10 380

Table 2

Metrics of the random forest classifier

Classifier	Precision	Recall	F-score
Random forest	91.8%	85.39%	88.47%

Table 2

Metrics of the random forest classifier

Classifier	Precision	Recall	F-score
Random forest	91.8%	85.39%	88.47%

Table 3

Metrics of the NER

Classifier	Precision	Recall	F-score
Max entropy (A)	71.20%	66.39%	68.71%
Max entropy + regular expression (B)	87.12%	86.61%	86.86%

Table 3

Open in new tab Download slide

Metrics of the NER

Classifier	Precision	Recall	F-score
Max entropy (A)	71.20%	66.39%	68.71%
Max entropy + regular expression (B)	87.12%	86.61%	86.86%

Figure 3

Statistical analysis of the drug Gleevec or imatinib mesylate. (A) Main therapies combined with Gleevec for the treatment of different types of cancer, including interferons, radiation, gemcitabine and nilotinib, among others. (B) Data obtained by accessing the databases of the CHAT project (29) (http://chat.lionproject.net./). According to the literature, Gleevec is classified as ‘evasion and metastasis’. (C) Graphical representation of the 20 types of cancer treated with the drug. (D) A snipped image of the relationship network between Gleevec and its respective therapy combinations, associated with the respective types of cancer.

It is important to point out that the data shown in Table 1 were obtained by the analysis of more than 338 000 articles. From this processing, 1308 drugs were identified, in addition to substances and therapies. In addition, a total of 10 380 drug combinations were obtained and persisted in the database. To increase the reliability of the data, a process known as cross-referencing was applied to establish a relationship of the drugs identified with the CANCROX tool with other external databases such as ChemSpider and PubChem. During this process, 1211 drugs and substances were identified and mapped to the external databases. Interestingly, 97 drugs and substances were not located in these databases. These included experimental drugs, drugs synthesized from an analogous drug and new drugs not yet properly stored in these databases. This fact is interesting and confirms the efficiency of the NER algorithm, ~87% accuracy, in identifying and extracting drugs from the literature. The Web interface offers a search mechanism for 477 genes of the 10 380 drug combinations and 40 types of cancer.

Evaluation metrics for the classifiers and NER

Evaluation metrics permit to verify the efficiency of the classification algorithm (random forest) and of the algorithm for identifying therapies and drugs (NER). The following data were obtained by training of the random forest algorithm using 4000 articles that were annotated with value 0 for those reporting treatment failure and value 1 for those reporting treatment success. This number corresponds to 2.95% of the 338445 articles available for classification. The training parameters of the algorithm were defined as |$k\textrm {-fold}=10$| for cross-validation and construct validation of 300 trees in the random forest algorithm. The algorithm was executed, and the metrics obtained are shown in Table 2.

As can be seen in Table 3, two models were implemented to NER. This was necessary after the low performance obtained with algorithm (A). Interestingly, the precision of 71.2% is the result of a large number of false positives, i.e. the model identifies correctly drugs and therapies, but a series of other words are recognized erroneously. Model (B) considerably improved the final result. The combination of the maximum entropy algorithm (34) implemented in the Apache OpenNLP library with regular expressions improved the final result by 22.35%. The combination of techniques permitted considerable elimination of noise in the final result.

Case study: imatinib mesylate

Imatinib mesylate, also known as Gleevec and Glivec, is a tyrosine kinase inhibitor. According to the National Cancer Institute site, this drug is used for the treatment of the following cancers: acute lymphoblastic leukemia, chronic eosinophilic leukemia, chronic myelogenous leukemia, dermatofibrosarcoma protuberans, gastrointestinal stromal tumor, myelodysplastic/myeloproliferative neoplasms and systemic mastocytosis. Thus, a set of seven cancers are treated with this drug. Compared to the data of the National Cancer Institute, the CANCROX tool identified 20 cancer categories and 45 therapy combinations, numbers larger than those reported by the National Cancer Institute. A consolidated view of these data is shown in Figure 3.

Figure 3 shows the statistical analysis of the drug Gleevec. It is important to note that the tool permits to analyze any of the 1308 treatments (drugs and therapies) that make up the database. Using the statistical analysis tool, researchers can verify the main combinations of therapies applied to different types of cancer. In addition, it is possible to access the title and abstract referring to each therapy combination identified by the tool.

Comparison with other databases

For comparison and evaluation of the CANCROX database in terms of drug combination, the following databases were selected in the literature: Antifungal Synergistic Drug Combination Database (ASDCD) (30) and Drug Combination Database (DCDB) (31). The number of databases that explore drug combinations is limited. The ASDCD lists a total of 1225 drug combinations and 105 individual drugs obtained from 12 000 references of the medical literature. This database is specialized on drug combinations used for the treatment of fungal infections. However, some of these drugs are also used to treat different types of cancer. Comparison between the ASDCD and CANCROX databases identified 35 individual drugs present in both databases, i.e. drugs used to treat fungi and to combat cancer. A total of 92 drug combinations were identified simultaneously in the two databases. Another database that uses the approach of drug combinations is DCDB. This database possesses a collection of 1363 combined drugs and 904 individual drugs. Compared to the CANCROX database, the number of individual drugs is 31% lower. Since that database was last updated in 2014, this difference of 404 individual drugs can be explained by the research and discovery of new compounds during the period after the last update of the DCDB. Similar to the approach adopted in the present study, ~14% of the studies on drug combinations report ‘failure’ of the experiments. This number is close to that obtained during the classification phase of the articles of the present study, in which ~17% of the articles were identified as reporting ‘failure’ of the treatment employed.

Conclusions

CANCROX is the first tool that focuses on the implementation of a reference database of similar human and canine genes associated with cancer. This database provides researchers using this animal model with opportunities to access and analyze a set of 477 genes associated with more than 40 types of cancer and more than 10 000 combinations of drugs and therapies for this disease. In this version, CANCROX focuses on the canine model. However, the architecture of the tool permits the use of other models as a reference, for example, the mouse (32) and zebrafish (33). The CANCROX database contains important information about cancer treatment, prevention of cancer and associated genes and drugs and therapies and their different combination, thus providing data for groups studying animal models, in this case the dog, as well as groups studying cancer in humans. The CANCROX database is therefore expected to become a platform of consolidated information that helps the scientific community in this important field of research, i.e. cancer. The initial planning of this work foresees the updating of the CANCROX database in 2 years.

Conflict of interest. None declared.

Database URL:http://cancrox.gmb.bio.br/

References

Larsen

P.O.

and

von Ins

(

2010

)

The rate of growth in scientific publication and the decline in coverage provided by Science Citation Index

Scientometrics

575

–

603

Wheeler

D.L.

Barrett

Benson

D.A.

et al. (

2006

)

Database resources of the National Center for Biotechnology Information

Nucleic Acids Res.

–

D12

Acland

Agarwala

Barrett

et al. (

2014

)

Database resources of the National Center for Biotechnology Information

Nucleic Acids Res.

–

McGuire

(

2016

)

World cancer report 2014. Geneva, Switzerland: World Health Organization, International Agency for Research on Cancer, WHO Press, 2015

Adv. Nutr.

418

–

419

Workman

Aboagye

Balkwill

et al. (

2010

)

Guidelines for the welfare and use of animals in cancer research

Br. J. Cancer

102

1555

–

1566

Khanna

Lindblad-Toh

Vail

et al. (

2006

)

The dog as a cancer model

Nat. Biotechnol.

1065

–

1066

Lindblad-Toh

Wade

C.M.

Mikkelsen

T.S.

et al. (

2005

)

Genome sequence, comparative analysis and haplotype structure of the domestic dog

Nature

438

803

–

816

Boyko

A.R.

Boyko

R.H.

Boyko

C.M.

et al. (

2009

)

Complex population structure in African village dogs and its implications for inferring dog domestication history

Proc. Natl. Acad. Sci. USA

106

13903

–

13908

Ostrander

E.A.

and

Wayne

R.K.

(

2005

)

The canine genome

Genome Res.

1706

–

1716

Ihaka

and

Gentleman

(

1996

)

R: a language for data analysis and graphics

J. Comput. Graph. Stat.

299

–

314

Durinck

Spellman

P.T.

Birney

et al. (

2009

)

Mapping identifiers for the integration of genomic datasets with the R/bioconductor package biomaRt

Nat. Protoc.

1184

Yates

Akanni

Amode

M.R.

et al. (

2015

)

Ensembl 2016. Nucleic acids research, 44, D1, D710–D716

Forbes

S.A.

Bindal

Bamford

et al. (

2010

)

COSMIC: mining complete cancer genomes in the catalogue of somatic mutations in Cancer

Nucleic Acids Res.

D945

–

D950

Maglott

Ostell

Pruitt

K.D.

et al. (

2005

)

Entrez Gene: gene-centered information at NCBI

Nucleic Acids Res.

D54

–

D58

Cunningham

Amode

M.R.

Barrell

et al. (

2014

)

Ensembl 2015

Nucleic Acids Res.

D662

–

D669

Gray

K.A.

Yates

Seal

R.L.

et al. (

2014

)

Genenames. Org: the HGNC resources in 2015

Nucleic Acids Res.

D1079

–

D1085

Hamosh

Scott

A.F.

Amberger

J.S.

et al. (

2005

)

Online Mendelian Inheritance in Man (OMIM), a knowledgebase of human genes and genetic disorders

Nucleic Acids Res.

D514

–

D517

Wishart

D.S.

Knox

Guo

A.C.

et al. (

2006

)

DrugBank: a comprehensive resource for in silico drug discovery and exploration

Nucleic Acids Res.

D668

–

D672

Bolton

E.E.

Wang

Thiessen

P.A.

et al. (

2008

)

PubChem: integrated platform of small molecules and biological activities

. In

Annual Reports in Computational Chemistry

Elsevier

San Francisco

217

–

241

OpenURL Placeholder Text

Wagner

A.H.

Coffman

A.C.

Ainscough

B.J.

et al. (

2015

)

DGIdb 2.0: mining clinically relevant drug–gene interactions

Nucleic Acids Res.

D1036

–

D1044

Hewett

Oliver

D.E.

Rubin

D.L.

et al. (

2002

)

PharmGKB: the Pharmacogenetics Knowledge Base

Nucleic Acids Res.

163

–

165

doi:

10.1093/nar/30.1.163

Pence

H.E.

and

Williams

(

2010

ChemSpider: an online chemical information resource

Journal of Chemical Education

1123

–

1124

doi:

10.1021/ed100697w

Quinlan

J.R.

(

1986

)

Induction of decision trees

Mach. Learn.

–

106

Pedregosa

Varoquaux

Gramfort

et al. (

2011

)

Scikit-learn: machine learning in python

J. Mach Learn Res

2825

–

2830

OpenURL Placeholder Text

https://opennlp.apache.org/

Kottmann

Margulies

Ingersoll

et al. (

2013

Apache OpenNLP

The Apache Software Foundation

(04 March 2019, date last accessed).

Berger

A.L.

Pietra

V.J.D.

and

Pietra

S.A.D.

(

1996

)

A maximum entropy approach to natural language processing

Comput. Linguist.

–

DeVita

V.T.

and

Schein

P.S.

(

1973

)

The use of drugs in combination for the treatment of cancer: rationale and results

N. Engl. J. Med.

288

998

–

1006

Chauhan

Velankar

Brahmandam

et al. (

2007

)

A novel Bcl-2/Bcl-X L/Bcl-w inhibitor ABT-737 as therapy in multiple myeloma

Oncogene

2374

Baker

Ali

Silins

et al. (

2017

)

Cancer Hallmarks Analytics Tool (CHAT): a text mining approach to organize and evaluate scientific literature on cancer

Bioinformatics

3973

–

3981

Chen

Ren

Chen

et al. (

2014

)

ASDCD: Antifungal Synergistic Drug Combination Database

PLoS One

–

Liu

Wei

et al. (

2014

)

DCDB 2.0: a major update of the drug combination database

Database (Oxford)

2014

bau124

Frese

K.K.

and

Tuveson

D.A.

(

2007

)

Maximizing mouse cancer models

Nat. Rev. Cancer

654

Crossref

Feitsma

and

Cuppen

(

2008

)

Zebrafish as a cancer model

Mol. Cancer Res.

685

–

694

Berger

A.L.

Della Pietra

V.J.

and

Della Pietra

S.A.

(

1996

)

A maximum entropy approach to natural language processing

Computational linguistics

–

MIT Press

OpenURL Placeholder Text