Article Navigation

Journal Article

curatedOvarianData: clinically annotated data for the ovarian cancer transcriptome

Author Notes

Abstract

This article introduces a manually curated data collection for gene expression meta-analysis of patients with ovarian cancer and software for reproducible preparation of similar databases. This resource provides uniformly prepared microarray data for 2970 patients from 23 studies with curated and documented clinical metadata. It allows users to efficiently identify studies and patient subgroups of interest for analysis and to perform meta-analysis immediately without the challenges posed by harmonizing heterogeneous microarray technologies, study designs, expression data processing methods and clinical data formats. We confirm that the recently proposed biomarker CXCL12 is associated with patient survival, independently of stage and optimal surgical debulking, which was possible only through meta-analysis owing to insufficient sample sizes of the individual studies. The database is implemented as the curatedOvarianData Bioconductor package for the R statistical computing language, providing a comprehensive and flexible resource for clinically oriented investigation of the ovarian cancer transcriptome. The package and pipeline for producing it are available from http://bcb.dfci.harvard.edu/ovariancancer.

Database URL:http://bcb.dfci.harvard.edu/ovariancancer

Introduction

A wealth of genomic data, in particular microarray data, is publicly available through diverse online resources. Major databases of gene expression data, e.g. the Gene Expression Omnibus (GEO) (1) and ArrayExpress (2), offer the potential to identify sets of genes predictive of cancer survival and of patient resistance to chemotherapy using thousands of samples from multiple laboratories. Such high numbers of samples are needed to robustly identify and validate gene signatures for incorporation into routine clinical practice (3). However, inconsistent formatting among database interfaces, expression data storage and clinical metadata annotations present formidable obstacles to making efficient use of these resources.

Existing resources aiming to make large-scale high-dimensional analysis across multiple studies tend to serve only a few specifically targeted needs. To develop reproducible biomarker discovery methods appropriate for clinical translation, a data resource must be accurate and retain clinical variables of known importance as much as possible. The insilicoDB (4) project provides many curated gene expression data sets; however, it is not a focused resource in terms of retention or quality assurance of clinical annotations, or retention of all relevant data sets and clinical variables for any one cancer type. The other major database of curated gene expression studies, the Gene Expression Atlas (2), provides machine- rather than manually annotated data, resulting in reduced consistency of annotation across studies. These are among the only databases that offer basics such as uniform gene identifiers to enable cross-study analysis, and then for only the most common microarray technologies. Carey et al. (5) describe a framework for the curation, annotation and storage of microarray and high-throughput data in general. This framework allows, for example, institutions to provide researchers access to in-house and public data in a standardized and convenient fashion. However, there is no existing database that provides these resources for ovarian cancer.

Ovarian cancer is the fifth-leading cause of cancer deaths among women (6) and has been the focus of numerous clinical transcriptome investigations. The curatedOvarianData database is the result of a focused effort to enable meta-analysis of these studies and to provide the highest quality and most comprehensive gene expression data resource for any cancer. It provides standardized gene expression and clinical data for 2970 ovarian cancer patients from 23 studies spanning 11 gene expression measurement platforms, in the form of documented ExpressionSet objects for R/Bioconductor (7). Gene expression data were collected from public databases and author websites, processed in a consistent manner and mapped uniformly to official Human Gene Nomenclature Committee (HGNC) (8) gene symbols. Curation of clinical annotations was machine-checked for correctness of syntax and human-checked by two individuals to ensure accuracy. This data package is geared primarily towards bioinformatic and statistical researchers, providing an ideal resource for development and assessment of algorithms for high-dimensional classification, clustering and survival analysis. It will also be valuable to ovarian cancer researchers for biomarker identification and validation. In addition to providing all publicly available gene expression studies with patient survival in common forms of ovarian cancer, it includes tumours of rare histologies, normal tissues and uncommon early-stage tumours. Special effort is made to retain the most important clinical variables from author-provided metadata and from the original publications: overall survival, optimal debulking surgery and tumour stage, grade and histology.

We also developed a software pipeline for automated and reproducible production of this and comparable data libraries. The pipeline includes a controlled language for curation of clinical annotations, defined by a template, which is intuitive for non-programmers to create and edit, but which is also used directly for machine syntax checking of curated annotations. The pipeline handles all steps of the process including data download, microarray preprocessing, merging of duplicate probe sets and sample technical replicates, up-to-date probe-set to gene mapping and building of the R/Bioconductor objects and package.

One important application of the database is testing of hypothesized prognostic markers of ovarian cancer using multiple independent studies. We validated a recently proposed independent prognostic indicator of ovarian cancer, CXCL12 (9), using 13 published studies, demonstrating for this biomarker that numerous studies are needed to overcome the lack of power in individual studies of smaller sample size. We provide code in the documentation of the curatedOvarianData package demonstrating how this comprehensive analysis, which was previously impractical to achieve, is a straightforward application of the database.

Methods and implementation

The pipeline for creating the data package from public databases (Table 1) is fully automated, with the exceptions of manual curation of clinical annotations (Figure 1). This manual curation was integrated in the pipeline with short R scripts that reformat user-provided annotations into a standardized template, which largely follows the format of The Cancer Genome Atlas (29). This template is provided in Table 2 and used as a unit test in the curatedOvarianData package, i.e. the curation is automatically checked for valid values in the package building process. Downloading phenotype data and expression data from GEO (1), syntax validation of curated clinical metadata, microarray data preprocessing, normalization, gene mapping and the creation of Bioconductor ‘ExpressionSet’ objects, which link gene expression data and phenotype annotations, were fully automated. The generation of the package is reproducible using the pipeline provided at https://bitbucket.org/lwaldron/curatedovariandata.

Figure 1

Flowchart of the data collection and curation pipeline. The software implementing this pipeline reproduces all steps from downloading of data to final packaging, requiring manual intervention only for identifying studies, curation of clinical metadata and documentation of the package.

Open in new tab Download slide

Table 1

Open in new tab

Data sets in the curatedOvarianData database

Data set	Reference	Platform	Samples	Late Stage^a (%)	Serous Subtype (%)	Median Survival (Months)	Median Follow-up (Months)	Censoring (%)
E.MTAB.386	(10)	Ill. HumanRef-8 v2	129	99	100	42	55	43
GSE12418	(11)	SWEGENE v2.1.1_27k	54	100	100	N/A	N/A	N/A
GSE12470	(12)	Agilent G4110b	53	66	81	N/A	N/A	N/A
GSE13876	(13)	Operon Human v3	157	100	100	25	72	28
GSE14764	(14)	Affy U133a	80	89	85	54	37	74
GSE17260	(15)	Agilent G4112a	110	100	100	53	47	58
GSE18520	(16)	Affy U133 Plus 2.0	63	84	84	25	140	23
GSE19829.GPL570	(17)	Affy U133 Plus 2.0	28	N/A	N/A	47	62	39
GSE19829.GPL8300	(17)	Affy U95 v2	42	N/A	N/A	45	50	45
GSE20565	(18)	Affy U133 Plus 2.0	140	48	51	N/A	N/A	N/A
GSE2109	N/A	Affy U133 Plus 2.0	204	42	42	N/A	N/A	N/A
GSE26712	(19)	Affy U133a	195	96	95	46	90	30
GSE30009	(20)	TaqMan qRT-PCR 380	103	100	99	41	53	45
GSE30161	(21)	Affy U133 Plus 2.0	58	100	81	50	83	38
GSE32062.GPL6480	(22)	Agilent G4112a	260	100	100	59	56	53
GSE32063	(22)	Agilent G4112a	40	100	100	53	81	45
GSE6008	(23)	Affy U133a	99	54	41	N/A	N/A	N/A
GSE6822	(24)	Affy Hu6800	66	N/A	62	N/A	N/A	N/A
GSE9891	(25)	Affy U133 Plus 2.0	285	85	93	47	36	59
PMID15897565^b	(26)	Affy U133a	63	83	100	N/A	N/A	N/A
PMID17290060^c	(27)	Affy U133a	117	98	100	63	82	43
PMID19318476	(28)	Affy U133a	42	93	100	34	89	48
TCGA	(29)	Affy HT U133a	578	90	98	45	52	48

Data set	Reference	Platform	Samples	Late Stage^a (%)	Serous Subtype (%)	Median Survival (Months)	Median Follow-up (Months)	Censoring (%)
E.MTAB.386	(10)	Ill. HumanRef-8 v2	129	99	100	42	55	43
GSE12418	(11)	SWEGENE v2.1.1_27k	54	100	100	N/A	N/A	N/A
GSE12470	(12)	Agilent G4110b	53	66	81	N/A	N/A	N/A
GSE13876	(13)	Operon Human v3	157	100	100	25	72	28
GSE14764	(14)	Affy U133a	80	89	85	54	37	74
GSE17260	(15)	Agilent G4112a	110	100	100	53	47	58
GSE18520	(16)	Affy U133 Plus 2.0	63	84	84	25	140	23
GSE19829.GPL570	(17)	Affy U133 Plus 2.0	28	N/A	N/A	47	62	39
GSE19829.GPL8300	(17)	Affy U95 v2	42	N/A	N/A	45	50	45
GSE20565	(18)	Affy U133 Plus 2.0	140	48	51	N/A	N/A	N/A
GSE2109	N/A	Affy U133 Plus 2.0	204	42	42	N/A	N/A	N/A
GSE26712	(19)	Affy U133a	195	96	95	46	90	30
GSE30009	(20)	TaqMan qRT-PCR 380	103	100	99	41	53	45
GSE30161	(21)	Affy U133 Plus 2.0	58	100	81	50	83	38
GSE32062.GPL6480	(22)	Agilent G4112a	260	100	100	59	56	53
GSE32063	(22)	Agilent G4112a	40	100	100	53	81	45
GSE6008	(23)	Affy U133a	99	54	41	N/A	N/A	N/A
GSE6822	(24)	Affy Hu6800	66	N/A	62	N/A	N/A	N/A
GSE9891	(25)	Affy U133 Plus 2.0	285	85	93	47	36	59
PMID15897565^b	(26)	Affy U133a	63	83	100	N/A	N/A	N/A
PMID17290060^c	(27)	Affy U133a	117	98	100	63	82	43
PMID19318476	(28)	Affy U133a	42	93	100	34	89	48
TCGA	(29)	Affy HT U133a	578	90	98	45	52	48

These data sets provide curated gene expression and clinical data for a total of 2970 samples, including all publicly ovarian cancer gene expression experiments with individual patient survival information at the time of press.

^aOnly FIGO Stages III and IV.

^bData set is a subset of the samples from the retracted paper PMID17290060, Dressman et al. (27).

^cPaper was retracted because of a misalignment of genomic and survival data (30); the corrected data are provided here.

N/A, not available.

Table 1

Open in new tab

Data sets in the curatedOvarianData database

Data set	Reference	Platform	Samples	Late Stage^a (%)	Serous Subtype (%)	Median Survival (Months)	Median Follow-up (Months)	Censoring (%)
E.MTAB.386	(10)	Ill. HumanRef-8 v2	129	99	100	42	55	43
GSE12418	(11)	SWEGENE v2.1.1_27k	54	100	100	N/A	N/A	N/A
GSE12470	(12)	Agilent G4110b	53	66	81	N/A	N/A	N/A
GSE13876	(13)	Operon Human v3	157	100	100	25	72	28
GSE14764	(14)	Affy U133a	80	89	85	54	37	74
GSE17260	(15)	Agilent G4112a	110	100	100	53	47	58
GSE18520	(16)	Affy U133 Plus 2.0	63	84	84	25	140	23
GSE19829.GPL570	(17)	Affy U133 Plus 2.0	28	N/A	N/A	47	62	39
GSE19829.GPL8300	(17)	Affy U95 v2	42	N/A	N/A	45	50	45
GSE20565	(18)	Affy U133 Plus 2.0	140	48	51	N/A	N/A	N/A
GSE2109	N/A	Affy U133 Plus 2.0	204	42	42	N/A	N/A	N/A
GSE26712	(19)	Affy U133a	195	96	95	46	90	30
GSE30009	(20)	TaqMan qRT-PCR 380	103	100	99	41	53	45
GSE30161	(21)	Affy U133 Plus 2.0	58	100	81	50	83	38
GSE32062.GPL6480	(22)	Agilent G4112a	260	100	100	59	56	53
GSE32063	(22)	Agilent G4112a	40	100	100	53	81	45
GSE6008	(23)	Affy U133a	99	54	41	N/A	N/A	N/A
GSE6822	(24)	Affy Hu6800	66	N/A	62	N/A	N/A	N/A
GSE9891	(25)	Affy U133 Plus 2.0	285	85	93	47	36	59
PMID15897565^b	(26)	Affy U133a	63	83	100	N/A	N/A	N/A
PMID17290060^c	(27)	Affy U133a	117	98	100	63	82	43
PMID19318476	(28)	Affy U133a	42	93	100	34	89	48
TCGA	(29)	Affy HT U133a	578	90	98	45	52	48

Data set	Reference	Platform	Samples	Late Stage^a (%)	Serous Subtype (%)	Median Survival (Months)	Median Follow-up (Months)	Censoring (%)
E.MTAB.386	(10)	Ill. HumanRef-8 v2	129	99	100	42	55	43
GSE12418	(11)	SWEGENE v2.1.1_27k	54	100	100	N/A	N/A	N/A
GSE12470	(12)	Agilent G4110b	53	66	81	N/A	N/A	N/A
GSE13876	(13)	Operon Human v3	157	100	100	25	72	28
GSE14764	(14)	Affy U133a	80	89	85	54	37	74
GSE17260	(15)	Agilent G4112a	110	100	100	53	47	58
GSE18520	(16)	Affy U133 Plus 2.0	63	84	84	25	140	23
GSE19829.GPL570	(17)	Affy U133 Plus 2.0	28	N/A	N/A	47	62	39
GSE19829.GPL8300	(17)	Affy U95 v2	42	N/A	N/A	45	50	45
GSE20565	(18)	Affy U133 Plus 2.0	140	48	51	N/A	N/A	N/A
GSE2109	N/A	Affy U133 Plus 2.0	204	42	42	N/A	N/A	N/A
GSE26712	(19)	Affy U133a	195	96	95	46	90	30
GSE30009	(20)	TaqMan qRT-PCR 380	103	100	99	41	53	45
GSE30161	(21)	Affy U133 Plus 2.0	58	100	81	50	83	38
GSE32062.GPL6480	(22)	Agilent G4112a	260	100	100	59	56	53
GSE32063	(22)	Agilent G4112a	40	100	100	53	81	45
GSE6008	(23)	Affy U133a	99	54	41	N/A	N/A	N/A
GSE6822	(24)	Affy Hu6800	66	N/A	62	N/A	N/A	N/A
GSE9891	(25)	Affy U133 Plus 2.0	285	85	93	47	36	59
PMID15897565^b	(26)	Affy U133a	63	83	100	N/A	N/A	N/A
PMID17290060^c	(27)	Affy U133a	117	98	100	63	82	43
PMID19318476	(28)	Affy U133a	42	93	100	34	89	48
TCGA	(29)	Affy HT U133a	578	90	98	45	52	48

^aOnly FIGO Stages III and IV.

^bData set is a subset of the samples from the retracted paper PMID17290060, Dressman et al. (27).

^cPaper was retracted because of a misalignment of genomic and survival data (30); the corrected data are provided here.

N/A, not available.

Table 2

Open in new tab

Curated clinical annotations

Characteristic	Allowed values	Description
sample_type	tumour, metastatic, cellline, healthy, adjacentnormal	Healthy, only from individuals without cancer; adjacentnormal, from individuals with cancer;
histological_type	ser, endo, clearcell, mucinous, other, mix, undifferentiated	ser, serous; endo, endometrioid; clearcell, mixture of ser + endo. Other includes sarcomatoid, endometroid, papillary serous, adenocarcinoma, dysgerminoma
primarysite	ov, ft, other	Ov, ovary; ft, fallopian tube
arrayedsite	ov, ft, other	ov, ovary; ft, fallopian tube
summarygrade^a	low, high	low, 1, 2, LMP (low malignant potential); high, 3, 2/3
summarystage	early, late	early, FIGO I, II, I/II; late, FIGO III, IV, II/III, III/IV
tumourstage	1, 2, 3, 4	FIGO Stage (I–IV, translated to 1–4 for R usage)
substage	a, b, c, d	Substage (abcd)
grade^a	1, 2, 3	Grade (1–3)
age_at_initial_pathologic_diagnosis	1-99	Age at initial pathologic diagnosis in years
pltx	y/n	Patient treated with Platin
tax	y/n	Patient treated with Taxol
neo	y/n	Neoadjuvant treatment
days_to_tumour_recurrence	decimal	Time to recurrence or last follow-up in days
recurrence_status	recurrence, no recurrence	Recurrence censoring variable
days_to_death	decimal	Time to death or last follow-up in days
vital_status	living, deceased	Overall survival censoring variable
os_binary	short, long	Dichotomized overall survival time; as defined by study
relapse_binary	short, long	Dichotomized relapse variable; as defined by the study
site_of_tumour_first_recurrence	metastasis, locoregional, etc.	Site of the first recurrence
primary_therapy_outcome_success	completeresponse, etc.	Response to any kind of therapy
bebulking	optimal, suboptimal	Amount of residual disease (optimal ≤ 1 cm)
percent_normal_cells	0–100+/−	Estimated percentage of normal cells; 20− ≤ 20%
percent_stromal_cells	0–100+/−	Estimated percentage of stromal cells
percent_tumour_cells	0–100+/−	Estimated percentage of tumour cells; 80+ ≥ 80%
batch	character	Hybridization date or other available batch variable
uncurated_author_metadata	character	All original, uncurated metadata

Characteristic	Allowed values	Description
sample_type	tumour, metastatic, cellline, healthy, adjacentnormal	Healthy, only from individuals without cancer; adjacentnormal, from individuals with cancer;
histological_type	ser, endo, clearcell, mucinous, other, mix, undifferentiated	ser, serous; endo, endometrioid; clearcell, mixture of ser + endo. Other includes sarcomatoid, endometroid, papillary serous, adenocarcinoma, dysgerminoma
primarysite	ov, ft, other	Ov, ovary; ft, fallopian tube
arrayedsite	ov, ft, other	ov, ovary; ft, fallopian tube
summarygrade^a	low, high	low, 1, 2, LMP (low malignant potential); high, 3, 2/3
summarystage	early, late	early, FIGO I, II, I/II; late, FIGO III, IV, II/III, III/IV
tumourstage	1, 2, 3, 4	FIGO Stage (I–IV, translated to 1–4 for R usage)
substage	a, b, c, d	Substage (abcd)
grade^a	1, 2, 3	Grade (1–3)
age_at_initial_pathologic_diagnosis	1-99	Age at initial pathologic diagnosis in years
pltx	y/n	Patient treated with Platin
tax	y/n	Patient treated with Taxol
neo	y/n	Neoadjuvant treatment
days_to_tumour_recurrence	decimal	Time to recurrence or last follow-up in days
recurrence_status	recurrence, no recurrence	Recurrence censoring variable
days_to_death	decimal	Time to death or last follow-up in days
vital_status	living, deceased	Overall survival censoring variable
os_binary	short, long	Dichotomized overall survival time; as defined by study
relapse_binary	short, long	Dichotomized relapse variable; as defined by the study
site_of_tumour_first_recurrence	metastasis, locoregional, etc.	Site of the first recurrence
primary_therapy_outcome_success	completeresponse, etc.	Response to any kind of therapy
bebulking	optimal, suboptimal	Amount of residual disease (optimal ≤ 1 cm)
percent_normal_cells	0–100+/−	Estimated percentage of normal cells; 20− ≤ 20%
percent_stromal_cells	0–100+/−	Estimated percentage of stromal cells
percent_tumour_cells	0–100+/−	Estimated percentage of tumour cells; 80+ ≥ 80%
batch	character	Hybridization date or other available batch variable
uncurated_author_metadata	character	All original, uncurated metadata

Additional study-specific details are provided in the package manual.

^aMost ovarian cancer pathologists follow the FIGO grading system, although some exceptions (15, 22, 25) are noted in the package manual.

Table 2

Open in new tab

Curated clinical annotations

Characteristic	Allowed values	Description
sample_type	tumour, metastatic, cellline, healthy, adjacentnormal	Healthy, only from individuals without cancer; adjacentnormal, from individuals with cancer;
histological_type	ser, endo, clearcell, mucinous, other, mix, undifferentiated	ser, serous; endo, endometrioid; clearcell, mixture of ser + endo. Other includes sarcomatoid, endometroid, papillary serous, adenocarcinoma, dysgerminoma
primarysite	ov, ft, other	Ov, ovary; ft, fallopian tube
arrayedsite	ov, ft, other	ov, ovary; ft, fallopian tube
summarygrade^a	low, high	low, 1, 2, LMP (low malignant potential); high, 3, 2/3
summarystage	early, late	early, FIGO I, II, I/II; late, FIGO III, IV, II/III, III/IV
tumourstage	1, 2, 3, 4	FIGO Stage (I–IV, translated to 1–4 for R usage)
substage	a, b, c, d	Substage (abcd)
grade^a	1, 2, 3	Grade (1–3)
age_at_initial_pathologic_diagnosis	1-99	Age at initial pathologic diagnosis in years
pltx	y/n	Patient treated with Platin
tax	y/n	Patient treated with Taxol
neo	y/n	Neoadjuvant treatment
days_to_tumour_recurrence	decimal	Time to recurrence or last follow-up in days
recurrence_status	recurrence, no recurrence	Recurrence censoring variable
days_to_death	decimal	Time to death or last follow-up in days
vital_status	living, deceased	Overall survival censoring variable
os_binary	short, long	Dichotomized overall survival time; as defined by study
relapse_binary	short, long	Dichotomized relapse variable; as defined by the study
site_of_tumour_first_recurrence	metastasis, locoregional, etc.	Site of the first recurrence
primary_therapy_outcome_success	completeresponse, etc.	Response to any kind of therapy
bebulking	optimal, suboptimal	Amount of residual disease (optimal ≤ 1 cm)
percent_normal_cells	0–100+/−	Estimated percentage of normal cells; 20− ≤ 20%
percent_stromal_cells	0–100+/−	Estimated percentage of stromal cells
percent_tumour_cells	0–100+/−	Estimated percentage of tumour cells; 80+ ≥ 80%
batch	character	Hybridization date or other available batch variable
uncurated_author_metadata	character	All original, uncurated metadata

Characteristic	Allowed values	Description
sample_type	tumour, metastatic, cellline, healthy, adjacentnormal	Healthy, only from individuals without cancer; adjacentnormal, from individuals with cancer;
histological_type	ser, endo, clearcell, mucinous, other, mix, undifferentiated	ser, serous; endo, endometrioid; clearcell, mixture of ser + endo. Other includes sarcomatoid, endometroid, papillary serous, adenocarcinoma, dysgerminoma
primarysite	ov, ft, other	Ov, ovary; ft, fallopian tube
arrayedsite	ov, ft, other	ov, ovary; ft, fallopian tube
summarygrade^a	low, high	low, 1, 2, LMP (low malignant potential); high, 3, 2/3
summarystage	early, late	early, FIGO I, II, I/II; late, FIGO III, IV, II/III, III/IV
tumourstage	1, 2, 3, 4	FIGO Stage (I–IV, translated to 1–4 for R usage)
substage	a, b, c, d	Substage (abcd)
grade^a	1, 2, 3	Grade (1–3)
age_at_initial_pathologic_diagnosis	1-99	Age at initial pathologic diagnosis in years
pltx	y/n	Patient treated with Platin
tax	y/n	Patient treated with Taxol
neo	y/n	Neoadjuvant treatment
days_to_tumour_recurrence	decimal	Time to recurrence or last follow-up in days
recurrence_status	recurrence, no recurrence	Recurrence censoring variable
days_to_death	decimal	Time to death or last follow-up in days
vital_status	living, deceased	Overall survival censoring variable
os_binary	short, long	Dichotomized overall survival time; as defined by study
relapse_binary	short, long	Dichotomized relapse variable; as defined by the study
site_of_tumour_first_recurrence	metastasis, locoregional, etc.	Site of the first recurrence
primary_therapy_outcome_success	completeresponse, etc.	Response to any kind of therapy
bebulking	optimal, suboptimal	Amount of residual disease (optimal ≤ 1 cm)
percent_normal_cells	0–100+/−	Estimated percentage of normal cells; 20− ≤ 20%
percent_stromal_cells	0–100+/−	Estimated percentage of stromal cells
percent_tumour_cells	0–100+/−	Estimated percentage of tumour cells; 80+ ≥ 80%
batch	character	Hybridization date or other available batch variable
uncurated_author_metadata	character	All original, uncurated metadata

Additional study-specific details are provided in the package manual.

^aMost ovarian cancer pathologists follow the FIGO grading system, although some exceptions (15, 22, 25) are noted in the package manual.

Data acquisition and curation

Our search for clinically annotated ovarian cancer microarray studies identified 21 published studies, which provided 23 publicly available data sets from various sources (Table 1). The search not only targeted studies of primary tumours annotated with patient survival but also included studies providing other potentially valuable clinical annotation. Other main factors of interest included drug resistance, outcome of the primary tumour debulking surgery, histology, stage and grade. We excluded studies not measuring gene expression (i.e. studies of genomic copy number), studies of cell lines, animal models, or non-primary tumours, and data sets not providing clinical information. Expression and clinical data were obtained from the two major public repositories GEO (i) and ArrayExpress (ii), otherwise from supplementary data of the original publications. Data from GEO were obtained using the GEOquery package (31). Clinical annotations were manually curated using one R script per data set, and original uncurated annotations were retained as a single field. Curated annotations were checked by syntax against a template, which standardized all the known clinically relevant indicators and allowable data values. Clinical data were twice independently curated (authors B.G. and T.R.), and all discrepancies were resolved for the final version. The availability of clinical data varied substantially across datasets (Figure 2).

Figure 2

Available clinical annotation. This heatmap visualizes for each curated clinical characteristic (rows) the availability in each data set (columns). Red indicates that the corresponding characteristic is available for at least one sample in the data set. See Table 2 for descriptions of these characteristics.

Open in new tab Download slide

Gene expression processing and gene mapping

Where raw data from Affymetrix U133a or U133 Plus 2.0 platforms were available, these were pre-processed by frozen Robust Multi-array Analysis (fRMA) (32), for other Affymetrix platforms by Robust Multi-array Average (RMA) (33), and otherwise we used pre-processed data as provided by the authors. Up-to-date maps from probe set IDs to gene symbols were obtained from BioMart (34). Where BioMart maps were not available but target sequences were provided for the microarray platforms, we used the BLAST algorithm (35) to map these sequences against the human genome (build GRCh37) and to identify the gene transcript targeted by each probe. Otherwise, the annotations provided with the platform on GEO were used. In the curatedOvarianData version of the package, genes with multiple probe sets were represented by the probe set with the highest mean across all data sets of the sample platform (36), and this original probe set identifier was also stored in the ExpressionSet object (7). We selected the same representative probe set for all studies of a common microarray platform. Finally, we provide two alternative versions of the package: NormalizerVcuratedOvarianData, where redundant probe sets are averaged after filtering probe sets with low correlation to their redundant probe sets, using the Normalizer function of the Sleipnir library for computational functional genomics (37), and FULLVcuratedOvarianData, which does not collapse redundant probe sets targeting the same gene transcript but instead provides a probe set to gene symbol map in the featureData slot of each ExpressionSet.

Final packaging

Technical replicate samples were merged by averaging. Microarray expression data and clinical metadata were then represented as ExpressionSet objects (7) for each study. The ExpressionSet objects were also populated with citations, platform identifiers and details, data preprocessing methods and warnings of retracted papers (27) and specimens also used in other studies (26, 28, 29, 38). ExpressionSets were packaged as the curatedOvarianData R library, which provides a reference manual including descriptions of the syntax template and summaries of the annotations, citation, microarray platform and other information for each study.

Discussion

We introduce a data package for the R/Bioconductor statistical programming environment that includes all current major ovarian cancer gene expression data sets (Table 1). The process of downloading clinically annotated public genomic data and proceeding to a final computational analysis is, despite recent efforts (4, 5), still long and prone to errors. This is particularly true when the various data sets need to be comparable for meta-analyses, which requires a fully standardized annotation. Our data resource provides a comprehensive and highly curated resource for efficient meta-analysis of the ovarian cancer transcriptome, for biological analysis and bioinformatic methods development. It additionally provides a complete computational pipeline to reproduce this process for other cancers or data sources.

Two common problems of publicly available genomic data are the scarcity of clinical annotation and inconsistent definitions of clinical characteristics across independent data sets (5). In our review of original papers and curation of clinical annotations, we were however able to retain, in most studies, the clinical variables of proven importance: overall survival, age, optimal debulking surgery, tumour histology, grade and stage (Figure 2). Other characteristics such as detailed treatment information or recurrence free survival times were rarely available; however, ovarian cancer has a relatively standard treatment regimen of platinum chemotherapy and no radiotherapy. The most important clinical variables were in general consistently defined between studies, with these definitions provided in Table 2. Notably, all studies used the Federation of Gynecology and Obstetrics (FIGO) staging system, and all but one study (11) defined suboptimal debulking surgery as residual tumour mass > 1 cm (Table 2). The relatively large number of well-annotated data sets in this database may allow interesting future work, addressing the problem of recovering missing annotations from genomic data only (40).

One important use of this database is the assessment of prognostic biomarkers. As a demonstration, we examined a recent study by Popple et al. (9), which analysed the expression of the chemokine protein CXCL12 using a tissue microarray of 289 primary ovarian cancers. CXCL12/CXCR4 is a chemokine/chemokine receptor axis that has previously been shown to be directly involved in cancer pathogenesis (41, 42). Ovarian cancer constitutively expresses CXCL12 and CXCR4, and both tumour CXCL12/CXCR4 expression and stroma-derived CXCL12 expression have been reported to be prognostic factors in human ovarian cancer (41). Popple et al. found that high levels of CXCL12 protein were associated with significantly poorer survival compared with patients whose tumours produce low amounts of this chemokine, independently of stage, residual disease (optimal debulking) and adjuvant chemotherapy. The patient cohort was heterogeneous, with various histologic types, grades and stages, leaving open the question of whether this biomarker would be generalizable to other patient populations. Furthermore, differences in protein abundance may not be associated with RNA abundance.

To investigate these questions, we analysed CXCL12 expression in all primary tumour samples included in curatedOvarianData for which overall survival information was available. To ensure that the expression values were on the same scale across studies, all data sets were centred by their means and scaled by their standard deviations. A population hazard ratio (HR) was then pooled with a fixed-effects model, in which the HR for each cohort was weighted with the inverse of the standard error. This is visualized as a forest plot in Figure 3. Although the effect is only significant (P < 0.05) in three cohorts individually, the pooled HR is significantly larger than 1 (HR = 1.15, 95% CI 1.09–1.23). HR refers to the HR between patients differing by one standard deviation in CXCL12 expression. This confirms the hypothesis that upregulation of CXCL12 is associated with poor outcome in 2108 patients from 13 independent studies with mixed stage, grade and histologies. The effect is thus small but consistently detected, emphasizing the importance of biomarker validation in sufficiently large data collections. To assess the independence of CXCL12 with stage and residual disease, we also analysed the 1776 patients from 10 studies where both FIGO tumour stage and success of debulking surgery were known. Adjustment for these two established predictors in multivariate analysis had little effect on the observed association between CXCL12 and overall survival (HR = 1.13, 95% CI 1.05–1.21). These HRs are comparable in magnitude to that reported by Popple et al. for ‘moderate’ CXCL12 staining (HR = 1.215, 95% CI 0.892–1.655), but lower than reported for ‘high’ staining (HR = 1.684, 95% CI 1.180–2.404). This potentially reflects that the function of this gene is at the protein level. Consistent with previous reports (9, 38), we found no significant association of the receptor CXCR4 with overall survival (HR = 0.95, 95% CI 0.9–1.01, P = 0.09). These analyses are straightforward and fully reproduced as examples in the package documentation. Additional analyses limited to more homogeneous patient subsets, e.g. limited to tumours of the same histology, are needed, but they are another straightforward application of the package.

Figure 3

The database confirms CXCL12 as prognostic of overall survival in patients with ovarian cancer. Forest plot of the expression of the chemokine CXCL12 as a univariate predictor of overall survival, using all 14 data sets with applicable expression and survival information. HR indicates the factor by which overall risk of death increases with a one standard deviation increase in CXCL12 expression. A summary HR significantly larger than 1 indicates that patients with high CXCL12 levels had poor outcome and confirms in several lines of code the previously reported association between CXCL12 abundance and patient survival (9). Consideration of important clinicopathological features such as stage, grade, histology and residual disease (optimal surgical debulking) is also straightforward; examples are provided in the package vignette.

Open in new tab Download slide

In constructing curatedOvarianData, we took several steps to minimize across-study batch effects. Where raw Affymetrix microarray data were available, we used a standardized pre-processing protocol. All data sets from the same platform were normalized with the same algorithms and parameters. For the Affymetrix U133A and U133 Plus 2.0, we chose the fRMA (32) normalizing algorithm, a variant of the standard RMA (33) algorithm that uses publicly available microarray databases to estimate probe-specific effects and variances, instead of using only the samples from the data set to be normalized. We provide example code in the database documentation for removing between-platform batch effects with the ComBat method (43). Such a batch effect removal is typically necessary when data sets are merged.

If different platforms are compared, then the mapping of probe sets to common identifiers such as gene symbols is a critical and error prone step. In particular when older platforms are considered, care must be taken to ensure that the probe sets target identical transcripts; gene identification is a persistent problem in genome-scale data integration. We used the BioMart database (34) to map stable manufacturer probe set identifiers or Genbank IDs to current standard gene symbols. For cases in which no stable identifiers were available, we used the BLAST algorithm (35) to identify gene symbols from the probe oligonucleotide sequences. When many genes are targeted by more than one probe set, several approaches of collapsing probe sets to single genes have been proposed (36, 44, 45). In the main version of the package, we selected the probe set with highest mean across all data sets from the same platform to represent each gene transcript, a method shown to perform well (36) and with the advantage of being traceable back to a single oligonucleotide probe sequence for each platform. We also provide two alternative packages with averaged and un-collapsed probe sets. The version with un-collapsed probe sets provides current HGNC symbols in the featureData slot of the ExpressionSet objects, which makes the application of alternative methods for collapsing probe sets to unique gene symbols straightforward, e.g. with the WGCNA R package (46).

We demonstrated meta-analytical use of the package by showing a survival association of the recently proposed prognostic biomarker CXCL12 (9). Other possible uses include the validation of multi-gene signatures, and identification of novel gene signatures and biomarkers for patient survival and response to chemotherapy. Finally, this package enables rigorous assessment of high-dimensional machine-learning algorithms in terms of their performance and computational requirements. We plan to continually include newly published ovarian cancer data sets in future versions of this package.

Conclusions

The curatedOvarianData package provides a comprehensive resource of curated gene expression and clinical data for the development and validation of ovarian cancer prognostic models, the investigation of ovarian cancer subtypes (10, 25, 29), and the comparative assessment of machine learning algorithms for gene expression data. This database greatly reduces the burden of time, expertise and error involved in assembling a compendium of curated gene expression data from tumours of known histopathology and from patients with known clinical progression. These advantages will be appealing to biostatisticians and bioinformaticians for development of analytical methods from high-dimensional genomic data, but the database will additionally provide a common, version-controlled and transparent platform for reproducible investigation of the ovarian cancer transcriptome. The pipeline for creating this database is published under an open license and will facilitate creating similar resources for other cancers. As such, we hope this database and pipeline will provide one part of the solution to reproducibility in high-dimensional genomic research.

Acknowledgements

The authors thank Shaina Andelman for her contributions to graphic design, and also Steve Skates, Jie Ding and Dave Zhao.

Funding

National Cancer Institute at the National Institutes of Health [1RC4CA156551-01 to G.P. and M.B.]; the National Science Foundation [CAREER DBI-1053486 to C.H.]. M.R. acknowledges support from the National Cancer Institute initiative to found Physical Science Oncology Centers [U54CA143798]. Funding for open access charge: National Science Foundation [CAREER DBI-1053486 to C.H.].

Conflict of interest. None declared.

References

Edgar

Domrachev

Lash

. ,

Gene Expression Omnibus: NCBI gene expression and hybridization array data repository

Nucleic Acids Res.

2002

, vol.

(pg.

207

210

)

Parkinson

Sarkans

Kolesnikov

, et al. ,

ArrayExpress update—an archive of microarray and high-throughput sequencing-based functional genomics experiments

Nucleic Acids Res.

2011

, vol.

(pg.

D1002

D1004

)

McDermott

Downing

Stratton

. ,

Genomics and the continuum of cancer care

N. Engl. J. Med.

2011

, vol.

364

(pg.

340

350

)

Taminau

Steenhoff

Coletta

, et al. ,

inSilicoDb: an R/Bioconductor package for accessing human Affymetrix expert-curated datasets from GEO

Bioinformatics

2011

, vol.

(pg.

3204

3205

)

Carey

Gentry

Sarkar

, et al. ,

SGDI: system for genomic data integration

Pac. Symp. Biocomput.

2008

(pg.

141

152

)

Google Scholar

OpenURL Placeholder Text

WorldCat

Siegel

Naishadham

Jemal

. ,

Cancer statistics, 2012

CA Cancer J. Clin.

2012

, vol.

(pg.

)

Gentleman

Carey

Bates

, et al. ,

Bioconductor: open software development for computational biology and bioinformatics

Genome Biol.

2004

, vol.

pg.

R80

Seal

Gordon

Lush

, et al. ,

genenames.org: the HGNC resources in 2011

Nucleic Acids Res.

2011

, vol.

(pg.

D514

D519

)

Popple

Durrant

Spendlove

, et al. ,

The chemokine, CXCL12, is an independent predictor of poor survival in ovarian cancer

Br. J. Cancer

2012

, vol.

106

(pg.

1306

1313

)

Bentink

Haibe-Kains

Risch

, et al. ,

Angiogenic mRNA and microRNA gene expression signature predicts a novel subtype of serous ovarian cancer

PLoS One

2012

, vol.

pg.

e30269

Partheen

Levan

Osterberg

, et al. ,

Expression analysis of stage III serous ovarian adenocarcinoma distinguishes a sub-group of survivors

Eur. J. Cancer

2006

, vol.

(pg.

2846

2854

)

Yoshida

Furukawa

Haruta

, et al. ,

Expression profiles of genes involved in poor prognosis of epithelial ovarian carcinoma: a review

Int. J. Gynecol. Cancer

2009

, vol.

(pg.

992

997

)

Crijns

Fehrmann

de Jong

, et al. ,

Survival-related profile, pathways, and transcription factors in ovarian cancer

PLoS Med.

2009

, vol.

pg.

e24

Denkert

Budczies

Darb-Esfahani

, et al. ,

A prognostic gene expression index in ovarian cancer - validation across different independent data sets

J. Pathol.

2009

, vol.

218

(pg.

273

280

)

Yoshihara

Tajima

Yahata

, et al. ,

Gene expression profile for predicting survival in advanced-stage serous ovarian cancer across two independent datasets

PLoS One

2010

, vol.

pg.

e9615

Mok

Bonome

Vathipadiekal

, et al. ,

A gene signature predictive for outcome in advanced ovarian cancer identifies a survival factor: microfibril-associated Glycoprotein 2

Cancer Cell

2009

, vol.

(pg.

521

532

)

Konstantinopoulos

Spentzos

Karlan

, et al. ,

Gene expression profile of BRCAness that correlates with responsiveness to chemotherapy and with outcome in patients with epithelial ovarian cancer

J. Clin. Oncol.

2010

, vol.

(pg.

3555

3561

)

Meyniel

Cottu

Decraene

, et al. ,

A genomic and transcriptomic approach for a differential diagnosis between primary and secondary ovarian carcinomas in patients with a previous history of breast cancer

BMC Cancer

2010

, vol.

pg.

222

Bonome

Levine

Shih

, et al. ,

A gene signature predicting for survival in suboptimally debulked patients with ovarian cancer

Cancer Res.

2008

, vol.

(pg.

5478

5486

)

Gillet

Calcagno

Varma

, et al. ,

Multidrug resistance-linked gene signature predicts overall survival of patients with primary ovarian serous carcinoma

Clin. Cancer Res.

2012

, vol.

(pg.

3197

3206

)

Ferriss

Kim

Duska

, et al. ,

Multi-gene expression predictors of single drug responses to adjuvant chemotherapy in ovarian carcinoma: predicting platinum resistance

PLoS One

2012

, vol.

pg.

e30550

Yoshihara

Tsunoda

Shigemizu

, et al. ,

High-risk ovarian cancer based on 126-gene expression signature is uniquely characterized by downregulation of antigen presentation pathway

Clin. Cancer Res.

2012

, vol.

(pg.

1374

1385

)

Murph

Liu

, et al. ,

Lysophosphatidic acid-induced transcriptional profile represents serous epithelial ovarian carcinoma and worsened prognosis

PLoS One

2009

, vol.

pg.

e5583

Ouellet

Provencher

Maugard

, et al. ,

Discrimination between serous low malignant potential and invasive epithelial ovarian tumors using molecular profiling

Oncogene

2005

, vol.

(pg.

4672

4687

)

Tothill

Tinker

George

, et al. ,

Novel molecular subtypes of serous and endometrioid ovarian cancer linked to clinical outcome

Clin. Cancer Res.

2008

, vol.

(pg.

5198

5208

)

Berchuck

Iversen

Lancaster

, et al. ,

Patterns of gene expression that characterize long-term survival in advanced stage serous ovarian cancers

Clin. Cancer Res.

2005

, vol.

(pg.

3686

3696

)

Dressman

Berchuck

Chan

, et al. ,

An integrated genomic-based approach to individualized treatment of patients with advanced-stage ovarian cancer

J. Clin. Oncol.

2007

, vol.

(pg.

517

525

)

Berchuck

Iversen

Luo

, et al. ,

Microarray analysis of early stage serous ovarian cancers shows profiles predictive of favorable outcome

Clin. Cancer Res.

2009

, vol.

(pg.

2448

2455

)

Cancer Genome Atlas Research Network

Integrated genomic analyses of ovarian carcinoma

Nature

2011

, vol.

474

(pg.

609

615

)

Crossref

PubMed

WorldCat

Dressman

Berchuck

Chan

, et al. ,

Retraction. An integrated genomic-based approach to individualized treatment of patients with advanced-stage ovarian cancer

J. Clin. Oncol.

2012

, vol.

pg.

678

Sean

Meltzer

. ,

GEOquery: a bridge between the Gene Expression Omnibus (GEO) and BioConductor

Bioinformatics

2007

, vol.

(pg.

1846

1847

)

McCall

Bolstad

Irizarry

. ,

Frozen robust multiarray analysis (fRMA)

Biostatistics

2010

, vol.

(pg.

242

253

)

Bolstad

Irizarry

Astrand

, et al. ,

A comparison of normalization methods for high density oligonucleotide array data based on variance and bias

Bioinformatics

2003

, vol.

(pg.

185

193

)

Durinck

Moreau

Kasprzyk

, et al. ,

BioMart and Bioconductor: a powerful link between biological databases and microarray data analysis

Bioinformatics

2005

, vol.

(pg.

3439

3440

)

Altschul

Madden

Schaffer

, et al. ,

Gapped BLAST and PSI-BLAST: a new generation of protein database search programs

Nucleic Acids Res.

1997

, vol.

(pg.

3389

3402

)

Miller

Cai

Langfelder

, et al. ,

Strategies for aggregating gene expression data: the collapseRows R function

BMC Bioinformatics

2011

, vol.

pg.

322

Huttenhower

Schroeder

Chikina

, et al. ,

The sleipnir library for computational functional genomics

Bioinformatics

2008

, vol.

(pg.

1559

1561

)

Bild

Yao

Chang

, et al. ,

Oncogenic pathway signatures in human cancers as a guide to targeted therapies

Nature

2006

, vol.

439

(pg.

353

357

)

Kauffmann

Rayner

Parkinson

, et al. ,

Importing ArrayExpress datasets into R/Bioconductor

Bioinformatics

2009

, vol.

(pg.

2092

2094

)

Shah

Jonquet

Chiang

, et al. ,

Ontology-driven indexing of public datasets for translational bioinformatics

BMC Bioinformatics

2009

, vol.

Suppl. 2

pg.

Kajiyama

Shibata

Terauchi

, et al. ,

Involvement of SDF-1alpha/CXCR4 axis in the enhanced peritoneal metastasis of epithelial ovarian carcinoma

Int. J. Cancer

2008

, vol.

122

(pg.

)

Kulbe

Chakravarty

Leinster

, et al. ,

A dynamic inflammatory cytokine network in the human ovarian cancer microenvironment

Cancer Res.

2012

, vol.

(pg.

)

Johnson

Rabinovic

. ,

Adjusting batch effects in microarray expression data using empirical Bayes methods

Biostatistics

2007

, vol.

(pg.

118

127

)

Dai

Wang

Boyd

, et al. ,

Evolving gene/transcript definitions significantly alter the interpretation of GeneChip data

Nucleic Acids Res.

2005

, vol.

pg.

e175

Birkbak

Gyorffy

, et al. ,

Jetset: selecting the optimal microarray probe set to represent a gene

BMC Bioinformatics

2011

, vol.

pg.

474

Langfelder

Horvath

. ,

WGCNA: an R package for weighted correlation network analysis

BMC Bioinformatics

2008

, vol.

pg.

559

Author notes

^†These authors contributed equally to this work.

Citation details: Ganzfried,B.F., Riester,M., Haibe-Kains,B., et al. curatedOvarianData: clinically annotated data for the ovarian cancer transcriptome. Database (20,1,3) Vol. 2013: article ID bat013; doi:10.1093/database/bat013

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/), which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

Download all slides

Views

7,589

Altmetric

Total Views 7,589

5,818 Pageviews

1,771 PDF Downloads

Since 11/1/2016

Month:	Total Views:
November 2016	11
December 2016	14
January 2017	22
February 2017	9
March 2017	21
April 2017	15
May 2017	20
June 2017	48
July 2017	11
August 2017	12
September 2017	19
October 2017	46
November 2017	53
December 2017	64
January 2018	82
February 2018	60
March 2018	102
April 2018	77
May 2018	73
June 2018	58
July 2018	121
August 2018	68
September 2018	37
October 2018	52
November 2018	118
December 2018	61
January 2019	72
February 2019	90
March 2019	196
April 2019	226
May 2019	128
June 2019	132
July 2019	79
August 2019	76
September 2019	50
October 2019	118
November 2019	54
December 2019	69
January 2020	89
February 2020	75
March 2020	83
April 2020	60
May 2020	70
June 2020	88
July 2020	92
August 2020	73
September 2020	61
October 2020	43
November 2020	71
December 2020	70
January 2021	62
February 2021	84
March 2021	96
April 2021	74
May 2021	72
June 2021	82
July 2021	64
August 2021	80
September 2021	71
October 2021	108
November 2021	79
December 2021	72
January 2022	62
February 2022	30
March 2022	69
April 2022	91
May 2022	32
June 2022	42
July 2022	36
August 2022	41
September 2022	42
October 2022	111
November 2022	55
December 2022	44
January 2023	53
February 2023	70
March 2023	76
April 2023	51
May 2023	60
June 2023	44
July 2023	59
August 2023	50
September 2023	34
October 2023	67
November 2023	85
December 2023	115
January 2024	88
February 2024	104
March 2024	86
April 2024	82
May 2024	46
June 2024	138
July 2024	159
August 2024	61
September 2024	75
October 2024	74
November 2024	103
December 2024	42
January 2025	72
February 2025	61
March 2025	46
April 2025	51
May 2025	37
June 2025	50
July 2025	69
August 2025	51
September 2025	48
October 2025	66
November 2025	39
December 2025	34
January 2026	40
February 2026	51
March 2026	14

Article Contents

curatedOvarianData: clinically annotated data for the ovarian cancer transcriptome

Abstract

Introduction

Methods and implementation

Data acquisition and curation

Gene expression processing and gene mapping

Final packaging

Discussion

Conclusions

Acknowledgements

Funding

References

Author notes

Citations

Views

Altmetric

Citing articles via

Latest

Most Read

Most Cited

Article Contents

curatedOvarianData: clinically annotated data for the ovarian cancer transcriptome Open Access

Abstract

Introduction

Methods and implementation

Data acquisition and curation

Gene expression processing and gene mapping

Final packaging

Discussion

Conclusions

Acknowledgements

Funding

References

Author notes

Citations

Views

Altmetric

Citing articles via

Latest

Most Read

Most Cited

This Feature Is Available To Subscribers Only

Gift article access

Gift article access

Gift article access

Gift article access

curatedOvarianData: clinically annotated data for the ovarian cancer transcriptome