CBGDA: a manually curated resource for gene–disease associations based on genome-wide CRISPR Open Access

Gene type distribution in CBGDA

Gene type	Count	Percentage
Genes with protein products	209	86.70
Long noncoding microRNAs	7	3
microRNAs	24	10
Pseudogene	1	0.30

Table 1.

Gene type distribution in CBGDA

Gene type	Count	Percentage
Genes with protein products	209	86.70
Long noncoding microRNAs	7	3
microRNAs	24	10
Pseudogene	1	0.30

Furthermore, we utilized the GO (Gene Ontology) and KEGG (Kyoto Encyclopedia of Genes and Genomes) enrichment functions of the R package “clusterProfiler” to perform enrichment analysis on 241 genes [25]. These genes were found to be significantly enriched in critical cellular survival pathways and exhibited vital biomolecular functions (P ≤ .05) (Table 2 and Fig. 1). The GO enrichment analysis revealed the top five biological processes (BPs) to be histone modification, mitotic cell cycle phase transition, regulation of mitotic cell cycle phase transition, regulation of cell cycle phase transition, and positive regulation of erythrocyte differentiation. The top five cellular components were methyltransferase complex, histone methyltransferase complex, transferase complex, transferring phosphorus-containing groups, serine/threonine protein kinase complex, and RISC complex (RNA-induced silencing complex). The top five molecular functions (MFs) were protein serine/threonine kinase activity, mRNA base-pairing translational repressor activity, transcription corepressor activity, protein serine kinase activity, and translation repressor activity. Furthermore, the KEGG enrichment analysis showed the top five pathways to be cell cycle, microRNAs in cancer, FoxO signaling pathway, cellular senescence, and Polycomb repressive complex. The aberrations in these pathways are intimately associated with disease onset and progression. Therefore, investigating these pathways is certain to advance our understanding of disease progression and facilitate the development of targeted therapeutics.

Figure 1.

Chord diagram of GO and KEGG enrichment, the outermost layer represents the adjusted P-value, the inner layer represents the enrichment ratio, and the strings represent the overlapping number of genes between different enriched terms.

Table 2.

Top five terms of KEGG and GO enrichment analysis

Enrichment	Term	Gene count	P-value
KEGG	Cell cycle	22	5.57 × 10⁻¹²
	MicroRNAs in cancer	29	7.42 × 10⁻¹²
	FoxO signaling pathway	19	6.56 × 10⁻¹¹
	Cellular senescence	17	8.4 × 10^-8
	Polycomb repressive complex	12	9.22 × 10^-7
GO
BPs	Histone modification	36	3.61 × 10⁻¹⁵
	Mitotic cell cycle phase transition	32	3.28 × 10⁻¹²
	Regulation of the mitotic cell cycle phase transition	26	2.7 × 10⁻¹⁰
	Regulation of cell cycle phase transition	29	2.7 × 10⁻¹⁰
	Positive regulation of erythrocyte differentiation	9	9.79 × 10⁻⁸
Cellular components	Methyltransferase complex	16	2.8 × 10⁻¹²
	Histone methyltransferase complex	13	2.46 × 10⁻¹¹
	Transferase complex, transferring phosphorus-containing groups	19	3.33 × 10^-7
	Serine/Threonine protein kinase complex	12	1.37 × 10^-6
	RISC complex	20	4.13 × 10^-6
MFs	Protein serine/threonine kinase activity	23	7.7 × 10^-7
	mRNA base-pairing translational repressor activity	18	1.41 × 10^-6
	Transcription corepressor activity	15	1.67 × 10^-6
	Protein serine kinase activity	20	2.07 × 10^-6
	Translation repressor activity	18	2.71 × 10^-6

Enrichment	Term	Gene count	P-value
KEGG	Cell cycle	22	5.57 × 10⁻¹²
	MicroRNAs in cancer	29	7.42 × 10⁻¹²
	FoxO signaling pathway	19	6.56 × 10⁻¹¹
	Cellular senescence	17	8.4 × 10^-8
	Polycomb repressive complex	12	9.22 × 10^-7
GO
BPs	Histone modification	36	3.61 × 10⁻¹⁵
	Mitotic cell cycle phase transition	32	3.28 × 10⁻¹²
	Regulation of the mitotic cell cycle phase transition	26	2.7 × 10⁻¹⁰
	Regulation of cell cycle phase transition	29	2.7 × 10⁻¹⁰
	Positive regulation of erythrocyte differentiation	9	9.79 × 10⁻⁸
Cellular components	Methyltransferase complex	16	2.8 × 10⁻¹²
	Histone methyltransferase complex	13	2.46 × 10⁻¹¹
	Transferase complex, transferring phosphorus-containing groups	19	3.33 × 10^-7
	Serine/Threonine protein kinase complex	12	1.37 × 10^-6
	RISC complex	20	4.13 × 10^-6
MFs	Protein serine/threonine kinase activity	23	7.7 × 10^-7
	mRNA base-pairing translational repressor activity	18	1.41 × 10^-6
	Transcription corepressor activity	15	1.67 × 10^-6
	Protein serine kinase activity	20	2.07 × 10^-6
	Translation repressor activity	18	2.71 × 10^-6

Table 2.

Top five terms of KEGG and GO enrichment analysis

Enrichment	Term	Gene count	P-value
KEGG	Cell cycle	22	5.57 × 10⁻¹²
	MicroRNAs in cancer	29	7.42 × 10⁻¹²
	FoxO signaling pathway	19	6.56 × 10⁻¹¹
	Cellular senescence	17	8.4 × 10^-8
	Polycomb repressive complex	12	9.22 × 10^-7
GO
BPs	Histone modification	36	3.61 × 10⁻¹⁵
	Mitotic cell cycle phase transition	32	3.28 × 10⁻¹²
	Regulation of the mitotic cell cycle phase transition	26	2.7 × 10⁻¹⁰
	Regulation of cell cycle phase transition	29	2.7 × 10⁻¹⁰
	Positive regulation of erythrocyte differentiation	9	9.79 × 10⁻⁸
Cellular components	Methyltransferase complex	16	2.8 × 10⁻¹²
	Histone methyltransferase complex	13	2.46 × 10⁻¹¹
	Transferase complex, transferring phosphorus-containing groups	19	3.33 × 10^-7
	Serine/Threonine protein kinase complex	12	1.37 × 10^-6
	RISC complex	20	4.13 × 10^-6
MFs	Protein serine/threonine kinase activity	23	7.7 × 10^-7
	mRNA base-pairing translational repressor activity	18	1.41 × 10^-6
	Transcription corepressor activity	15	1.67 × 10^-6
	Protein serine kinase activity	20	2.07 × 10^-6
	Translation repressor activity	18	2.71 × 10^-6

Enrichment	Term	Gene count	P-value
KEGG	Cell cycle	22	5.57 × 10⁻¹²
	MicroRNAs in cancer	29	7.42 × 10⁻¹²
	FoxO signaling pathway	19	6.56 × 10⁻¹¹
	Cellular senescence	17	8.4 × 10^-8
	Polycomb repressive complex	12	9.22 × 10^-7
GO
BPs	Histone modification	36	3.61 × 10⁻¹⁵
	Mitotic cell cycle phase transition	32	3.28 × 10⁻¹²
	Regulation of the mitotic cell cycle phase transition	26	2.7 × 10⁻¹⁰
	Regulation of cell cycle phase transition	29	2.7 × 10⁻¹⁰
	Positive regulation of erythrocyte differentiation	9	9.79 × 10⁻⁸
Cellular components	Methyltransferase complex	16	2.8 × 10⁻¹²
	Histone methyltransferase complex	13	2.46 × 10⁻¹¹
	Transferase complex, transferring phosphorus-containing groups	19	3.33 × 10^-7
	Serine/Threonine protein kinase complex	12	1.37 × 10^-6
	RISC complex	20	4.13 × 10^-6
MFs	Protein serine/threonine kinase activity	23	7.7 × 10^-7
	mRNA base-pairing translational repressor activity	18	1.41 × 10^-6
	Transcription corepressor activity	15	1.67 × 10^-6
	Protein serine kinase activity	20	2.07 × 10^-6
	Translation repressor activity	18	2.71 × 10^-6

Interface description and application

Main page

On the homepage, a brief introduction to the primary functions and potential impact of the CBGDA database is provided. A search bar allows users to input a gene or disease of interest; when a gene or a disease is found within the database, a dropdown menu is triggered; and upon clicking “search,” they are redirected to the detailed page of the respective gene or disease, or if the queried entity is not found, a “Not found” note is displayed (Fig. 2a).

Figure 2.

Main page of CBGDA: (a) An example of searching “AG”; (b) Diagrams of statistics of different types of genes, diseases, phenotypes, and variants; (c) Interaction diagram of the associations between genes and diseases.

The pie charts visualize the categorization and proportional representation of the core data within the database, including genes, diseases, mutation data, and gene–disease relationship data. Hovering the cursor over each section of the pie chart displays the corresponding category’s name and quantity (Fig. 2b).

The interaction diagram presents a network interaction diagram of the gene–disease relationships in the database. For instance, HCC was found to be associated with the highest number of genes, with 34 genes impacting its disease progression, indicating the high heterogeneity of HCC. In contrast, 25 diseases were found to have only one related gene (Fig. 2c).

The navigation bar permits browsing of the core data information in the database via the “Browse” option. The “HELP” button provides a manual to assist database users in understanding the meaning and usage of each term. The “CONTACT” option enables users to contact the author.

Browse

We present the core data of the database using a tabulated format, with pagination buttons to segregate gene data and disease data. (Fig. 3) In the gene data section, each row represents a gene–disease relationship identified from a publication. The displayed data include the gene symbol, disease name, phenotype, type of CRISPR screening, evidence, and PubMed Unique Identifier, among other parameters. The specific explanation for each column header is provided in the “HELP” interface.

Figure 3.

Browse page of CBGDA: The core data of the database using a tabulated format, with pagination buttons to segregate gene data and disease data.

By clicking on the “Disease” tab, the user can see the information for all retrieved diseases, including the disease description and Disease Ontology database ID, among others. This feature facilitates users in searching for the disease across different databases. In both the gene data and disease data sections, clicking on the gene name or disease name in the first column redirects the user to the corresponding detailed page.

For instance, in the first column of data, researchers have identified, through genome-wide CRISPR screening, that the gene KAT7 promotes the onset and progression of Werner syndrome. This specific functional pathway is detailed in the “Function” column. The CRISPR screening employed positive selection, utilized the GeCKOv2 library, and was conducted on the hMPC cell line. This conclusion was further substantiated through additional experimental validations.

Gene detail page

We delineate the detailed gene page, which is accessible by clicking either “search” in the homepage search bar or the gene symbol in the table on the “Browse” page. For instance, we illustrate this process using the KAT7 gene detail page.

The content on this page is organized into several sections. The “Summary” section displays fundamental information about KAT7, such as its HGNC ID, UniProt ID, and approved symbol. Adjacent to this basic information, the interactive 3D protein structure of KAT7 from the Alpha Fold database is presented using the Mol* plugin. All the information regarding the associations between genes and diseases, collected from the article, is displayed in this “Association” section. In the “Chemicals” section, chemicals that interact with or directly influence KAT7 expression are listed in a table. The first column contains the names of the chemicals, the second column describes their interaction or influence on KAT7, and the third column provides the related PubMed ID (Fig. 4). Following the “Chemicals” section, the expression of KAT7 in cancers and noncancer diseases is presented. For cancer, clicking the “show” button reveals box plots of KAT7 expression across various types of cancer cell lines (Supplementary Fig. S1). For noncancer diseases, selecting a disease of interest from the dropdown menu and then clicking “show” displays a volcano plot of KAT7 expression in that disease, with KAT7 marked on the plot (Supplementary Fig. S2). Both types of plots can be downloaded in PDF format. If the gene is not present in the data downloaded from DepMap or GEN, it will be displayed as “Failed.” In the “Variants” section, the mutation sites of KAT7 in related diseases are listed, including specific chromosomal location and mutation type (Fig. 4).

Figure 4.

Detail page of CBGDA: Summary section of KAT7 displays fundamental information and the interactive 3D protein structure, associations between genes and diseases, chemicals for KAT7, variants for KAT7 in related diseases.

Disease detail page

For the disease detail page, we used Werner syndrome as an example.

The “Summary” section introduces Werner syndrome and is followed by “External Resources,” which are direct links to authoritative databases or websites for more detailed information on Werner syndrome. The “Chemicals” section lists chemicals that interact with or directly affect the expression of Werner syndrome in a tabulated format. In the “Variants” section, gene mutations caused by Werner syndrome are detailed. The information in this section not only slightly varies from that on the gene page but also includes specifics like chromosomal location. This organization ensures a comprehensive understanding of the disease, its associated mutations, and potential chemical interactions.

Download page

We created the download page. The first version (up to September 2023) of the CBGDA data is now ready. Downloadable files including information on gene, disease, chemical, and variant were prepared according to annotation datasets. To meet the users’ needs, we provide datasets in both CSV and TXT formats. If needed, users can select the desired dataset from the dropdown menu above for downloading.

Discussion and conclusion

The relationship between diseases and genes is intricate and complex. Certain genes have been shown to promote one type of disease while inhibiting another [26], highlighting the need to uncover clues within the interactions between genes and their associated diseases. Such interaction relationships can be found in various database resources. CRISPR, undoubtedly, is the most efficient gene editing technology available today. By utilizing CRISPR technology for genome-wide screening, we can identify genes that are relevant to diseases with greater effectiveness, precision, and reliability compared to other methods and techniques.

In our approach, we specifically collected articles that employed genome-wide CRISPR screening to uncover the association between genes and diseases. We focused on studies that provided clear and definitive evidence of the relationship between genes and diseases and manually annotated the research findings. Leveraging the advantages of CRISPR, we have developed the CBGDA database, which provides a comprehensive collection of key host genes involved in various disease mechanisms, discovered through genome-wide CRISPR screening, along with their associations. The core data in CBGDA are derived from the manual curation of thousands of genome-wide CRISPR-related publications sourced from PubMed. Additionally, external data from authoritative databases and literature are also incorporated. In future iterations, we plan to continually update the database with the latest data and integrate online analysis tools to provide a robust platform for biologists to explore the intricate relationships between genes and diseases.

Our work also has certain limitations. On one hand, the data we have collected focus solely on gene regions identified through CRISPR screening. However, numerous other functional components within the genome have been found to be associated with certain diseases. In fact, the majority of disease-related mutations are located in noncoding regions, such as silencers, insulators, 5ʹ UTR and 3ʹ UTR, and introns [27], all of which can be targeted using CRISPR. On the other hand, the data we have collected may not encompass all relevant publications that have been published. Many more relevant publications could be retrieved by searching full-text articles in PubMed Central. Considering these two aspects, we will adopt new strategies in the future to further enhance and improve the database, ensuring that it becomes more comprehensive and reliable.

Overall, by harnessing the power of CRISPR technology, our database provides a valuable resource for researchers in the field of computational biology, enabling them to delve deeper into the intricate network of gene–disease relationships. This database serves as a foundation for further research and paves the way for advancements in disease diagnosis, prevention, and treatment.

Acknowledgements

The authors express their sincere gratitude to all individuals who contributed to this research and to the various public repositories that provided open access data, software, and tools.

Author contributions

Q.D. carried out this project and wrote the draft of the main section of the manuscript. N.Z. and Q.D. were responsible for the conception and design of the subject. Data analysis and data collection were primarily carried out by Z.Z. and Q.D. Q.D. and N.Z. took primary responsibility for web development and server setup. W.Y. and X.Z. gave some suggestions and helped the conception and construction of the database. J.B., C.W., and N.Z. reviewed and critically revised the content of the study and finally approved the latest version to be published.

Supplementary data

Supplementary data is available at Database online.

Conflict of interest

None declared.

Funding

This work was supported by the National Natural Science Foundation of China (grant numbers 31971162, U20A20410, and 32071275) and the Guangzhou Medical University Research Enhancement Project (50010724-1158).

Data availability

CBGDA is freely accessible at http://cbgda.zhounan.org/main.

References

Wang

Wei

Sabatini

et al.

Genetic screens in human cells using the CRISPR-Cas9 system

Science

2014

;

343

–

. doi: https://doi.org/10.1126/science.1246981

Karimian

Azizian

Parsian

et al.

CRISPR/Cas9 technology as a potent molecular tool for gene therapy

J Cell Physiol

2019

;

234

12267

–

. doi: https://doi.org/10.1002/jcp.27972

Morgens

Deans

et al.

Systematic comparison of CRISPR/Cas9 and RNAi screens for essential genes

Nat Biotechnol

2016

;

634

–

. doi: https://doi.org/10.1038/nbt.3567

Kantor

McClements

MacLaren

CRISPR-Cas9 DNA base-editing and prime-editing

Int J Mol Sci

2020

;

:6240. doi: https://doi.org/10.3390/ijms21176240

Uddin

Rudin

Sen

CRISPR gene therapy: applications, limitations, and implications for the future

Front Oncol

2020

;

:1387.

Amberger

Hamosh

Searching online Mendelian inheritance in man (OMIM): a knowledgebase of human genes and genetic phenotypes

Curr Protoc Bioinform

2017

;

–

. doi: https://doi.org/10.1002/cpbi.27

Crossref

Piñero

Queralt-Rosinach

Bravo

et al.

DisGeNET: a discovery platform for the dynamical exploration of human diseases and their genes

Database

2015

;

2015

:bav028. doi: https://doi.org/10.1093/database/bav028

Landrum

Lee

Benson

et al.

ClinVar: public archive of interpretations of clinically relevant variants

Nucleic Acids Res

2016

;

D862

–

. doi: https://doi.org/10.1093/nar/gkv1222

Seal

Braschi

Gray

et al.

Genenames.org: the HGNC resources in 2023

Nucleic Acids Res

2023

;

D1003

–

. doi: https://doi.org/10.1093/nar/gkac888

10.

Schriml

Munro

Schor

et al.

The human disease ontology 2022 update

Nucleic Acids Res

2022

;

D1255

–

. doi: https://doi.org/10.1093/nar/gkab1063

11.

Harrison

Weber

Jakob

et al.

ICD-11: an international classification of diseases for the twenty-first century

BMC Med Inform Decis Mak

2021

;

–

. doi: https://doi.org/10.1186/s12911-021-01534-6

12.

Nelson

Schulman

Orthopaedic literature and MeSH

Clin Orthop Relat Res

2010

;

468

2621

–

. doi: https://doi.org/10.1007/s11999-010-1387-4

13.

Esmaeili

Narimani

Vasighi

Discovering SNP-disease relationships in genome-wide SNP data using an improved harmony search based on SNP locus and genetic inheritance patterns

PLoS One

2023

;

:e0292266. doi: https://doi.org/10.1371/journal.pone.0292266

14.

Klimova

Kuca

Novotny

et al.

Cystic fibrosis revisited—a review study

Med Chem

2017

;

102

–

. doi: https://doi.org/10.2174/1573406412666160608113235

15.

Ponti

De Angelis

Ponti

et al.

Hereditary breast and ovarian cancer: from genes to molecular targeted therapies

Crit Rev Clin Lab Sci

2023

;

640

–

. doi: https://doi.org/10.1080/10408363.2023.2234488

16.

Nienhuis

Nathwani

Davidoff

Gene therapy for hemophilia

Mol Ther

2017

;

1163

–

. doi: https://doi.org/10.1016/j.ymthe.2017.03.033

17.

Prasher

Greenway

Singh

The impact of epigenetics on cardiovascular disease

Biochem Cell Biol

2020

;

–

. doi: https://doi.org/10.1139/bcb-2019-0045

18.

Martínez-Jiménez

Muiños

Sentís

et al.

A compendium of mutational cancer driver genes

Nat Rev Cancer

2020

;

555

–

. doi: https://doi.org/10.1038/s41568-020-0290-x

19.

Zhang

Zou

Zhu

et al.

Gene Expression Nebulas (GEN): a comprehensive data portal integrating transcriptomic profiles across multiple species at both bulk and single-cell levels

Nucleic Acids Res

2022

;

D1016

–

. doi: https://doi.org/10.1093/nar/gkab878

20.

Love

Huber

Anders

Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2

Genome Biol

2014

;

–

. doi: https://doi.org/10.1186/s13059-014-0550-8

Crossref

21.

Sehnal

Bittrich

Deshpande

et al.

Mol* Viewer: modern web app for 3D visualization and analysis of large biomolecular structures

Nucleic Acids Res

2021

;

W431

–

. doi: https://doi.org/10.1093/nar/gkab314

22.

Franz

Lopes

Huck

et al.

Cytoscape.js: a graph theory library for visualisation and analysis

Bioinformatics

2016

;

309

–

. doi: https://doi.org/10.1093/bioinformatics/btv557

23.

The UniProt Consortium

UniProt: the Universal Protein Knowledgebase in 2023

Nucleic Acids Res

2023

;

D523

–

Crossref

PubMed