The COMPARE Data Hubs Open Access

List of the bacterial lineages for which the CGE analysis pipeline has databases

Database	Included lineages/plasmids
PlasmidFinder	Enterobacteriaceae, Acinetobacter baumannii, Enterococcus, Streptococcus and Staphylococcus
VirulenceFinder	Escherichia coli, Shigella
SalmonellaTypeFinder	Salmonella
cgMLST	Escherichia coli, Campylobacter, Listeria, Yersinia, Salmonella
MLST	has a database for each scheme in the pubMLST database
ResFinder	all databases in the ResFinder database
pMLST	IncF, IncHI1, IncHI2, IncI1, IncN, IncAC

Database	Included lineages/plasmids
PlasmidFinder	Enterobacteriaceae, Acinetobacter baumannii, Enterococcus, Streptococcus and Staphylococcus
VirulenceFinder	Escherichia coli, Shigella
SalmonellaTypeFinder	Salmonella
cgMLST	Escherichia coli, Campylobacter, Listeria, Yersinia, Salmonella
MLST	has a database for each scheme in the pubMLST database
ResFinder	all databases in the ResFinder database
pMLST	IncF, IncHI1, IncHI2, IncI1, IncN, IncAC

Table 1

Open in new tab Download slide

List of the bacterial lineages for which the CGE analysis pipeline has databases

Database	Included lineages/plasmids
PlasmidFinder	Enterobacteriaceae, Acinetobacter baumannii, Enterococcus, Streptococcus and Staphylococcus
VirulenceFinder	Escherichia coli, Shigella
SalmonellaTypeFinder	Salmonella
cgMLST	Escherichia coli, Campylobacter, Listeria, Yersinia, Salmonella
MLST	has a database for each scheme in the pubMLST database
ResFinder	all databases in the ResFinder database
pMLST	IncF, IncHI1, IncHI2, IncI1, IncN, IncAC

Database	Included lineages/plasmids
PlasmidFinder	Enterobacteriaceae, Acinetobacter baumannii, Enterococcus, Streptococcus and Staphylococcus
VirulenceFinder	Escherichia coli, Shigella
SalmonellaTypeFinder	Salmonella
cgMLST	Escherichia coli, Campylobacter, Listeria, Yersinia, Salmonella
MLST	has a database for each scheme in the pubMLST database
ResFinder	all databases in the ResFinder database
pMLST	IncF, IncHI1, IncHI2, IncI1, IncN, IncAC

UAntwerp: BACPIPE—bacterial analysis pipeline

Bacpipe (Xavier et al., 2019, under review) is a collection of open-access bioinformatics tools carefully designed into a logical workflow for the analysis of microorganism whole-genome sequencing and was developed to mitigate the level of bioinformatics experience required for microorganism genome analysis in a clinical setting. This computationally low-resource bioinformatics pipeline enables direct analyses of bacterial whole-genome sequences (raw reads, contigs or scaffolds) obtained from second- and third-generation sequencing technologies. Bacpipe covers the full analysis workflow from read quality assessment, to genome assembly, annotation and finally the identification of resistance and virulence genes. The outbreak module (single nucleotide polymorphism [SNPs] and patient metadata) can simultaneously analyse many strains to identify evolutionary relationships and transmission routes. Importantly, parallelization of tools in BacPipe considerably reduces the time-to-result. BacPipe is able to simultaneously analyse numerous strains of bacteria to elucidate their evolutionary relationships and derive a microorganism transmission route. BacPipe was initially validated using a methicillin-resistant Staphylococcus aureus (MRSA) outbreak WGS data set amongst different data sets from hospital, community and food-borne outbreaks and from transmission studies of important pathogens demonstrating the speed and simplicity of the pipeline that reconstructed the same analyses and conclusions within a few hours. BacPipe consolidates the analysis results into a single worksheet to aid rapid interpretation by clinicians, making it an ideal tool for WGS data analysis and interpretation for routine patient-care in hospitals and for infection monitoring in public health settings. The complete package can be obtained at: https://github.com/wholeGenomeSequencingAnalysisPipeline/BacPipe

SELECTA: future pipelines

FLI: RIEMS—metagenomic analysis pipeline

Metagenomics over recent years has proved to be a powerful tool for the analysis of microbial communities for both clinical diagnostic and scientific purposes. However, a major bottleneck is the extraction of relevant actionable information from these often large metagenomics data sets. Reliable Information Extraction from Metagenomics Sequence data set (RIEMS, 13) was developed to address this challenge by accurately assigning each read in a data set to a taxonomic group. RIEMS analysis proves to be highly accurate when compared with similar metagenomics tools on simulated sequence reads and in 2011 was used to detect the orthobunyavirus sequence in metagenomics reads prompting the discovery of the Schmallenberg virus. A manuscript describing an optimized version of RIEMS will be published in due course (Höper et al., 2019, in preparation). The code for the current RIEMS version being integrated into the COMPARE platform can be found at: https://github.com/EBI-COMMUNITY/fli-RIEMS.

EBI-ISS: Parasite analysis pipeline

The parasite pipeline aims to assist wet-lab scientists and clinicians in pathogen surveillance monitoring through in-depth genomic sequence analysis of clinical samples coupled to their geo-location origins with an inherent audit trail that facilitates reproducibility. The parasite workflow leverages a set of open-source bioinformatics analysis tools for comparative analysis of pathogen strains genetic variability, structural variations and recombination events. It can provide insights into specific pathogen features, such as zoonotic potential, and identification of potential targets for drug development. To characterize and formalize the pipeline reports, we conducted a pilot study of intra- and inter-genomes comparative analysis of Cryptosporidium hominis, a major cause of diarrheal disease in humans worldwide. This involved comparative analysis of SNPs, insertion-deletions, simple repeats and identification of genes under section pressure. A major outcome of this research will be the implementation of informative pathogen typing schemes derived from genomic data. A manuscript describing the pipeline is in preparation (Alako et al., 2019). The source code of the parasite pipeline is available at https://github.com/EBI-COMMUNITY/ebi-parasite.

RIVM: Jovian—metagenomics analysis pipeline

Jovian is a metagenomics and viromics analysis workflow designed for wet-lab scientist and clinicians. Jovian consolidates established open-source bioinformatic analytical tools into a single easy to use transparent environment. Its analytical workflows automatically process and analyse Illumina sequence reads from human clinical samples into clinically actionable information. The sequence reads first undergo quality control and removal of human sequence data to enforce patient privacy. The pipeline assembles the remaining sequence reads into scaffolds and possible full viral genomes. Assembled sequences are classified up to species level and viral sequences taxonomically labelled at the sub-species level. Jovian provides a taxonomic classification of the microbial and viral species detected in the clinical samples via an interactive web-report and Jupyter notebook visualisation. Moreover, Jovian enforces an audit trail, particularly important for clinical analysis reproducibility and reporting. A manuscript describing Jovian will be submitted in the near future (Schmitz et al., 2019, in preparation), and the code will be made public after publication.

EMC: SLIM—viral identification pipeline

SLIM is a python-based wrapper for two main functions, de novo assembly to contigs and contig classification. After a quality control step on the short read data to remove common adapters, short reads and low quality reads, SLIM performs a de novo assembly using SPAdes, described by Bankevich et al. (14) using conditions for either Illumina HiSeq paired-end or Ion Torrent single reads. Next, SLIM classifies the contigs based on a translation of all six reading frames of the contig and a usearch (Edgar, 15) screen for homologies to viral proteins. The output of this classification is a single tab-separated value (TSV) table listing each contig showing protein homology above 30% to sequences in virus family-based databases. Contigs showing homology above the threshold to a virus family entry are copied into family-specific fasta files. A short annotation of the closest identified match (the INSDC accession and a truncated ID derived from the INSDC entry) is added to the contig’s fasta id. Contigs that failed to return a homology to any entry in any of the databases are gathered as ‘mystery contigs’. The tool provides the usearch alignment file for each virus family to allow users to verify classifications. The output from each analysis run is a TSV summary table, a useful document for further exploration of the results. Due to the need to limit user-defined and run-specific parameters, generic conditions were chosen that provide a reasonable de novo assembly performance across a broad range of sequence data types. Thus, the results serve as a useful starting point for analysis but can usually be improved by tuning condition for run-specific parameters. A manuscript describing SLIM is in preparation (Cotten, 2019, in preparation). The source code will be released at the time of manuscript publication.

Discovery

The raw data from providers and the automatically processed analysis results from SELECTA workflows are discoverable through the PP and associated authenticated APIs. The PP acts as a single access point to the wealth of raw and processed analysis data. The PP includes an advanced search query builder (https://www.ebi.ac.uk/ena/pathogens/search) that assists the user in creating powerful searches to identify data sets and analyses of interest. Similarly, the Discovery API has a Swagger (https://swagger.io/) interface to assist users with query construction and access to documentation on the usage and status codes of each of the endpoints (https://www.ebi.ac.uk/ena/portal/api/#!/Portal_API/downloadDocUsingGET). The documentation can also be downloaded in PDF. Both the PP advanced search and Discovery API support user authentication, so that search is extended to the Data Hubs that the authenticated user has been granted access to, in addition to any public data sets. For complex API queries that may take longer to complete the user can specify an email notification when the search result is ready for download. The email notification support can help particularly where network speed and robustness may be of concern. For both PP web interactive and API interfaces the returned result can be set to include all searchable fields but can also be customized to include a set of fields of interest only, a particular order of the fields and whether to limit the number of results returned. Please also refer to the ‘Usage and access’ section.

Retrieval

Once users have identified data sets of interest, they can utilize a range of tools to download the data files. The design of the CDH system to utilize existing technology of the ENA at EMBL-EBI, gives CDH users access to its range of powerful download applications for programmatic users through its APIs. For non-programmatic users, there are a range of graphical interface options, and for pathogens data, this is primarily the ENA File Downloader and ENA Browser tools. The ENA File Downloader (https://github.com/enasequence/ena-ftp-downloader) is a stand-alone Java based graphical user interface application, which allows a user to search directly by either an accession or a search query, or alternatively upload a search report generated from within the PP to initiate the download of those sequences. The download interface supports selecting multiple data files at once, download progress indication and automatic verification that files have been successfully downloaded using MD5 checksums. The interface supports downloads using either FTP or for less stable connections IBM Aspera transfer software (https://downloads.asperasoft.com/). The ENA Browser tools (https://github.com/enasequence/enaBrowserTools/) are a set of Python based tools that allow command line based (or programmatic) downloads without requiring scripting knowledge. The tools allow downloading all data for a given accession or data of a particular group for a given accession with minimal effort from the user. Once again the tools support both FTP and Aspera-based downloads.

Exploration

The notebooks

Data sharing has significantly advanced several disciplines, but sharing of the data analysis process in a reproducible way is somewhat lagging behind. Scientific results are traditionally published as human readable articles, but human language lacks the precision and details of computer codes. Therefore, reproducing results even with the data at hand is, if at all possible, often a long tiresome process. The figures in traditional articles without possibility for zooming or subsetting also hide many of the details present in the data, and especially multidimensional data sets are hard to represent in passive two dimensional prints. In the late 80s, Mathematica’s first Notebook frontend was released. Since then slowly but steadily other languages like R and Python picked up the concept of ‘reproducible research’ (16). In the last couple of years, Jupyter Notebooks (17)—capable of handling Python, R and several other languages—became a standard for data analytics and visualisation. The Notebooks integrate the analysis and visualisation process and produce output that can be rendered as rich interactive web pages. Due to the virtualization techniques, it was straightforward to integrate these tools into the CDH (Figure 2). We use a collaborative cloud-accessible platform, Kooplex (Visontai et al., 2019, in preparation) to perform the final step of data analysis and visualisation. The rendering of Notebooks for presentation through the PP is automated, with the daily generated Notebooks immediately available for distribution. Notebooks can be rendered within the user’s web browser or downloaded. Daily reports are archived, so that previous reports can still be accessed. Two types of Notebooks are available: a simpler, static HTML that can give a quick overview about the content of the Data Hub at any given day, and a more complex version the users can use to filter, sort, arrange and visualize the data in their web browser without writing code. In the current version, the primary aim of these Notebooks is to lower the initial effort needed to explore the content of a Data Hub, but further development may allow even richer data exploration and visualisation. Exploration and visualisation are available in the PP under the ‘Explore’ tab. Please also see the ‘Usage and access’ section.

Figure 3

PP interactive interface. 1: select a domain (e.g. read_run); 2: narrow down search (where desired) by specifying taxonomy and sample collection details (e.g. collection date and country); 3: specify (where desired) which fields should be returned in the result report (e.g. centre name, study accession, sample accession, etc); 4: result page with download options in TSV and JASON (JavaScript Object Notation).

The Virome browser

Where a broad overview of families or genera is usually sufficient in bacterial metagenomics, analysing the virome requires species or sub-species level resolution of sequence annotations to extract useful information. In addition, the detection of a partial sequence of a highly pathogenic virus can require further investigation, while a large quantity of plant virus derived DNA may be of less interest. Therefore, there is a need for easy, interactive and in-depth browsing of the analysis results. To that end, the Virome Browser (Nieuwenhuijse et al., 2019, in preparation) Shiny app was built, which allows users to download virus related analysis data form the PP and visualize and browse these locally. The Virome Browser allows users to import annotation data and assembled contigs from the PP, which can be interactively filtered based on quality thresholds such as amino acid identity and hit length. Subsequently, contigs with a specific annotation can be selected based on user interest. The selected contigs can then be inspected individually, allowing the user to visualize the sequence of the contig and the open reading frame (ORF) structure. Both the nucleotide and amino acid sequences derived from the ORFs can be saved as a FASTA file for further analysis. Metagenomic sequencing of the virome typically requires in-depth post-processing of the analysis results, which can be a daunting task for users without programmatic experience. The in depth analysis functionalities of the Virome Browser enable users without a bioinformatic background to extract useful information from raw analysis results. The Virome Browser has been made into an R package which can be downloaded and installed locally from github (https://github.com/dnieuw/ViromeBrowser) using Rstudio.

Configuration

As mentioned in the CDH design section, the Data Hubs allow for flexibility and choice of their configuration to cater for different collaborations. The configuration of a Data Hub depends on the data type being shared and analysed and hence includes choices of an appropriate sample metadata standard as well as the analysis pipeline needed and the visualisation process. Both choices of sample metadata reporting standard (checklists, see also under ‘Upload and Standards’) and the analysis workflow depend on the sample type and provenance, for example whether the sequenced sample is a bacterial isolate, viral, or whether it is derived from an environmental setting.

Usage and access to the platform

Authenticated access via the PP interactive (https://www.ebi.ac.uk/ena/pathogens/) and the associated API (https://www.ebi.ac.uk/ena/portal/api/) enables the user to search pre-publication data in a desired Data Hub. To do this, the user needs to authenticate using a username and password combination. The PP/API can also be used to access data in the public domain without user authentication. The search functionalities here are the same to when a user authenticates, but results are limited to microbial data in the public domain. Figure 3 shows the use of the PP interactive interface going through steps of building a query to look in the public domain for read data of Ebola virus collected in Zaire in 1976,and customising the result page to return ‘centre name’, ‘study accession, ‘sample accession’, ‘run accession’ and ‘fastq_ftp’ (the FTP file path to the processed fastq files). The results can be downloaded in TSV and JSON formats.

The curl command for the same query used in the PP:curl -X POST -H ‘‘Content-Type: application/x-www-form-urlencoded’’ -d ‘‘dataPortal=pathogen&result= read_run&query=tax_tree(1570291) AND country=‘‘Zaire’’ AND collection_date=1976-01-01&fields=center_name, study_accession, sample_accession, run_accession, fastq_ftp & format=tsv’’ ‘‘https://www.ebi.ac.uk/ena/portal/api/ search’’

In order to explore a data set that has been included in a Data Hub and run through one of the integrated analysis pipelines (SELECTA workflows), the user can go to the ‘Explore’ tab in the PP and visualize the analysis results via a connected Notebook. To demonstrate how this works, we selected three Salmonella enterica data sets that are available in the public domain. We configured a Data Hub, called dcc_benoit, to include this data and run it through the CGE bacterial analysis pipeline (BAP). The output of the analysis was used to create a public Notebook, the visualisation process has been rendered and can be accessed via the PP: https://www.ebi.ac.uk/ena/pathogens/explore

There are three tabs available here, ‘Data Hub content’ shows a summary tables for the data content of the Data Hub, ‘Primary Analysis’ shows in this case the visualisation of the CGE analysis pipeline output and finally there is an ‘AMR’ (antimicrobial resistance) tab here because antimicrobial resistance profiles (antibiograms) exist for 19 of the sequenced samples. For a demonstration video to see how to change views or filter data, please go to https://www.ebi.ac.uk/ena/pathogens/explore and select the ‘View Demo’ button.

At the time of writing, the CDH model has been used for 13 different projects by the COMPARE collaboration. Use and assessment of the CDHs have been also described in Poen M. et al. (2019), and Matamoros S. et al. (2019), both submitted. Table 2 lists CDHs and their descriptions. Where data is already in the public domain it is indicated by the corresponding URL to ENA. In other cases, data is still pre-publication confidential at the time of writing.

Table 2

CDHs

Name	Topic	Description
dcc_sibelius	Pilot: Influenza H5N8	This data hub contains RNA sequence data and metadata of three highly pathogenic avian influenza H5N8 viruses used in the H5N8 pilot project. The aim of this project is to determine the similarity and reproducibility of viral genome consensus determination and minority variant detections between three different but commonly used workflows. Data public and described in Poen M., et al. (2019), submitted.
dcc_berlioz	Pilot: Ebola	This data hub contains Ebola sequence data generated as part of a relief effort in West Africa. Data public at: https://www.ebi.ac.uk/ena/data/view/PRJEB10265
dcc_liszt	Global Sewage	This data hub contains the DNA sequences, metagenomics data from the Global Sewage Surveillance project. The data represent all urban sewage samples collected prior to treatment plant inlet from major cities around the world. The study aims to establish a global surveillance of infectious disease agents and AMR.
dcc_strauss	Kibera Sewage	This data hub contains the DNA sequences, metagenomics data from the informal settlement of Kibera, Nairobi, Kenya. The data represent sewage samples collected over 3 months from 2 of the 10 population clusters under US CDC surveillance. The study aims to establish a hot spot disease surveillance.
dcc_vivaldi	Foodborne pathogen surveillance and epidemiological analysis	This data hub contains the DNA sequences and metagenomics data from the COMPARE Work package 4/7 workgroup of food-borne pathogens. The majority of the data consist of the bacterial pathogens Salmonella, E. coli, Listeria and Campylobacter. Samples are collected from human infections, food and animals, and the work done on these sequences aimed to improve and evaluate new and established microbiological and epidemiological analysis methods used in Public Health and the food and veterinary sector. Data were also collected with the goal of developing Source Attribution methods for WGS data.
dcc_schubert	AMR working group	This data hub contains DNA sequences of bacterial isolates’ genomes coupled with phenotypic AMR data. The phenotypic data are presented in standardized antibiograms, describing for each isolate the antibiotics tested, the levels of resistance observed, the antimicrobial susceptibility testing method employed, etc. The aim is to provide a platform for exchange of genomic and phenotypic information regarding AMR, thus encouraging surveillance and the development of innovating projects for the prediction of AMR from sequence data. Data public and described in Matamoros S., et al. (2019), submitted.
dcc_brahms	Diagnostic metagenomics on clinical samples	Diagnostic metagenomics on clinical samples: Prediction of antibiotic resistance genes and pathogen discovery in shot-gun metagenomic data from swine faeces. The data-hub contains shot-gun metagenomic NGS data sets generated from swine holdings with acute diarrhoea. The causative agent, porcine epidemic diarrhoea virus, was identified. Metadiagnostic analysis via FLI RIEMS pipeline revealed co-infection with bacteria. Prediction of antibiotic resistance markers and pathogen discovery will be done as a pilot to test pipelines and algorithms with clinical metagenomics data.
dcc_handel	Virus metagenomics	This data hub is used to share Fastq files of NGS experiments, mostly with a metagenomics approach, on clinical samples of patients with hepatitis A and norovirus gastroenteritis. The human reads in the files have been removed before being uploaded.
dcc_puccini	Parasites (Comparative Genomics of Intestinal Protozoa)	This data hub contains the raw DNA sequences from isolates of the protozoan Cryptosporidium. The data were generated in collaboration with the UK Cryptosporidium Reference Unit and represent both sporadic and outbreak cases. The study aims to understand the major factors that structure parasites’ genomes by using a comparative genomics approach.Data public at: https://www.ebi.ac.uk/ena/data/view/PRJEB15112
dcc_beard	Global Sewage snapshot Virome sequencing part	This data hub contains the raw read data of the virus specific part of the Global Sewage Surveillance project. The data aim to capture the complete DNA and RNA virome of the sampled locations. In addition, the analysis results of the SELECTA-SLIM pipeline can be found in this datahub.
dcc_cole	Metagenomics ring trial	This data hub contains the DNA sequences, metagenomics data from the Food Metagenomics ring trial 2018. The data represent DNA- and RNA-derived metagenomics data sets processed from a piece of smoked salmon spiked with a complex mock community consisting of viruses, bacteria, fungi and a parasite. The study aims to compare wet lab protocols using the same starting material.
dcc_bromhead	CoVetLab (Colistin resistant Enterobacteriaceae project)	This data hub contains the DNA sequences of single bacterial isolates of primarily antimicrobial resistant enterobacteriaceae from European National Reference Laboratories. The data represent amongst others isolates collected for the EU antimicrobial resistance monitoring in zoonotic and indicator bacteria from humans, animals and food as well as for the CoVetLab, colistin resistance project. The intention is that all data related to the EURL-AR will be submitted to the data hub.
dcc_schumann	Bioaccumulation experiment	This data hub contains DNA sequences and metadata to analyse bioaccumulated oysters. In addition, the analysis results of the SELECTA-SLIM pipeline can be found in this datahub.

Name	Topic	Description
dcc_sibelius	Pilot: Influenza H5N8	This data hub contains RNA sequence data and metadata of three highly pathogenic avian influenza H5N8 viruses used in the H5N8 pilot project. The aim of this project is to determine the similarity and reproducibility of viral genome consensus determination and minority variant detections between three different but commonly used workflows. Data public and described in Poen M., et al. (2019), submitted.
dcc_berlioz	Pilot: Ebola	This data hub contains Ebola sequence data generated as part of a relief effort in West Africa. Data public at: https://www.ebi.ac.uk/ena/data/view/PRJEB10265
dcc_liszt	Global Sewage	This data hub contains the DNA sequences, metagenomics data from the Global Sewage Surveillance project. The data represent all urban sewage samples collected prior to treatment plant inlet from major cities around the world. The study aims to establish a global surveillance of infectious disease agents and AMR.
dcc_strauss	Kibera Sewage	This data hub contains the DNA sequences, metagenomics data from the informal settlement of Kibera, Nairobi, Kenya. The data represent sewage samples collected over 3 months from 2 of the 10 population clusters under US CDC surveillance. The study aims to establish a hot spot disease surveillance.
dcc_vivaldi	Foodborne pathogen surveillance and epidemiological analysis	This data hub contains the DNA sequences and metagenomics data from the COMPARE Work package 4/7 workgroup of food-borne pathogens. The majority of the data consist of the bacterial pathogens Salmonella, E. coli, Listeria and Campylobacter. Samples are collected from human infections, food and animals, and the work done on these sequences aimed to improve and evaluate new and established microbiological and epidemiological analysis methods used in Public Health and the food and veterinary sector. Data were also collected with the goal of developing Source Attribution methods for WGS data.
dcc_schubert	AMR working group	This data hub contains DNA sequences of bacterial isolates’ genomes coupled with phenotypic AMR data. The phenotypic data are presented in standardized antibiograms, describing for each isolate the antibiotics tested, the levels of resistance observed, the antimicrobial susceptibility testing method employed, etc. The aim is to provide a platform for exchange of genomic and phenotypic information regarding AMR, thus encouraging surveillance and the development of innovating projects for the prediction of AMR from sequence data. Data public and described in Matamoros S., et al. (2019), submitted.
dcc_brahms	Diagnostic metagenomics on clinical samples	Diagnostic metagenomics on clinical samples: Prediction of antibiotic resistance genes and pathogen discovery in shot-gun metagenomic data from swine faeces. The data-hub contains shot-gun metagenomic NGS data sets generated from swine holdings with acute diarrhoea. The causative agent, porcine epidemic diarrhoea virus, was identified. Metadiagnostic analysis via FLI RIEMS pipeline revealed co-infection with bacteria. Prediction of antibiotic resistance markers and pathogen discovery will be done as a pilot to test pipelines and algorithms with clinical metagenomics data.
dcc_handel	Virus metagenomics	This data hub is used to share Fastq files of NGS experiments, mostly with a metagenomics approach, on clinical samples of patients with hepatitis A and norovirus gastroenteritis. The human reads in the files have been removed before being uploaded.
dcc_puccini	Parasites (Comparative Genomics of Intestinal Protozoa)	This data hub contains the raw DNA sequences from isolates of the protozoan Cryptosporidium. The data were generated in collaboration with the UK Cryptosporidium Reference Unit and represent both sporadic and outbreak cases. The study aims to understand the major factors that structure parasites’ genomes by using a comparative genomics approach.Data public at: https://www.ebi.ac.uk/ena/data/view/PRJEB15112
dcc_beard	Global Sewage snapshot Virome sequencing part	This data hub contains the raw read data of the virus specific part of the Global Sewage Surveillance project. The data aim to capture the complete DNA and RNA virome of the sampled locations. In addition, the analysis results of the SELECTA-SLIM pipeline can be found in this datahub.
dcc_cole	Metagenomics ring trial	This data hub contains the DNA sequences, metagenomics data from the Food Metagenomics ring trial 2018. The data represent DNA- and RNA-derived metagenomics data sets processed from a piece of smoked salmon spiked with a complex mock community consisting of viruses, bacteria, fungi and a parasite. The study aims to compare wet lab protocols using the same starting material.
dcc_bromhead	CoVetLab (Colistin resistant Enterobacteriaceae project)	This data hub contains the DNA sequences of single bacterial isolates of primarily antimicrobial resistant enterobacteriaceae from European National Reference Laboratories. The data represent amongst others isolates collected for the EU antimicrobial resistance monitoring in zoonotic and indicator bacteria from humans, animals and food as well as for the CoVetLab, colistin resistance project. The intention is that all data related to the EURL-AR will be submitted to the data hub.
dcc_schumann	Bioaccumulation experiment	This data hub contains DNA sequences and metadata to analyse bioaccumulated oysters. In addition, the analysis results of the SELECTA-SLIM pipeline can be found in this datahub.

Updates will be made available through https://www.ebi.ac.uk/ena/pathogens/datahubs.

Table 2

CDHs

Name	Topic	Description
dcc_sibelius	Pilot: Influenza H5N8	This data hub contains RNA sequence data and metadata of three highly pathogenic avian influenza H5N8 viruses used in the H5N8 pilot project. The aim of this project is to determine the similarity and reproducibility of viral genome consensus determination and minority variant detections between three different but commonly used workflows. Data public and described in Poen M., et al. (2019), submitted.
dcc_berlioz	Pilot: Ebola	This data hub contains Ebola sequence data generated as part of a relief effort in West Africa. Data public at: https://www.ebi.ac.uk/ena/data/view/PRJEB10265
dcc_liszt	Global Sewage	This data hub contains the DNA sequences, metagenomics data from the Global Sewage Surveillance project. The data represent all urban sewage samples collected prior to treatment plant inlet from major cities around the world. The study aims to establish a global surveillance of infectious disease agents and AMR.
dcc_strauss	Kibera Sewage	This data hub contains the DNA sequences, metagenomics data from the informal settlement of Kibera, Nairobi, Kenya. The data represent sewage samples collected over 3 months from 2 of the 10 population clusters under US CDC surveillance. The study aims to establish a hot spot disease surveillance.
dcc_vivaldi	Foodborne pathogen surveillance and epidemiological analysis	This data hub contains the DNA sequences and metagenomics data from the COMPARE Work package 4/7 workgroup of food-borne pathogens. The majority of the data consist of the bacterial pathogens Salmonella, E. coli, Listeria and Campylobacter. Samples are collected from human infections, food and animals, and the work done on these sequences aimed to improve and evaluate new and established microbiological and epidemiological analysis methods used in Public Health and the food and veterinary sector. Data were also collected with the goal of developing Source Attribution methods for WGS data.
dcc_schubert	AMR working group	This data hub contains DNA sequences of bacterial isolates’ genomes coupled with phenotypic AMR data. The phenotypic data are presented in standardized antibiograms, describing for each isolate the antibiotics tested, the levels of resistance observed, the antimicrobial susceptibility testing method employed, etc. The aim is to provide a platform for exchange of genomic and phenotypic information regarding AMR, thus encouraging surveillance and the development of innovating projects for the prediction of AMR from sequence data. Data public and described in Matamoros S., et al. (2019), submitted.
dcc_brahms	Diagnostic metagenomics on clinical samples	Diagnostic metagenomics on clinical samples: Prediction of antibiotic resistance genes and pathogen discovery in shot-gun metagenomic data from swine faeces. The data-hub contains shot-gun metagenomic NGS data sets generated from swine holdings with acute diarrhoea. The causative agent, porcine epidemic diarrhoea virus, was identified. Metadiagnostic analysis via FLI RIEMS pipeline revealed co-infection with bacteria. Prediction of antibiotic resistance markers and pathogen discovery will be done as a pilot to test pipelines and algorithms with clinical metagenomics data.
dcc_handel	Virus metagenomics	This data hub is used to share Fastq files of NGS experiments, mostly with a metagenomics approach, on clinical samples of patients with hepatitis A and norovirus gastroenteritis. The human reads in the files have been removed before being uploaded.
dcc_puccini	Parasites (Comparative Genomics of Intestinal Protozoa)	This data hub contains the raw DNA sequences from isolates of the protozoan Cryptosporidium. The data were generated in collaboration with the UK Cryptosporidium Reference Unit and represent both sporadic and outbreak cases. The study aims to understand the major factors that structure parasites’ genomes by using a comparative genomics approach.Data public at: https://www.ebi.ac.uk/ena/data/view/PRJEB15112
dcc_beard	Global Sewage snapshot Virome sequencing part	This data hub contains the raw read data of the virus specific part of the Global Sewage Surveillance project. The data aim to capture the complete DNA and RNA virome of the sampled locations. In addition, the analysis results of the SELECTA-SLIM pipeline can be found in this datahub.
dcc_cole	Metagenomics ring trial	This data hub contains the DNA sequences, metagenomics data from the Food Metagenomics ring trial 2018. The data represent DNA- and RNA-derived metagenomics data sets processed from a piece of smoked salmon spiked with a complex mock community consisting of viruses, bacteria, fungi and a parasite. The study aims to compare wet lab protocols using the same starting material.
dcc_bromhead	CoVetLab (Colistin resistant Enterobacteriaceae project)	This data hub contains the DNA sequences of single bacterial isolates of primarily antimicrobial resistant enterobacteriaceae from European National Reference Laboratories. The data represent amongst others isolates collected for the EU antimicrobial resistance monitoring in zoonotic and indicator bacteria from humans, animals and food as well as for the CoVetLab, colistin resistance project. The intention is that all data related to the EURL-AR will be submitted to the data hub.
dcc_schumann	Bioaccumulation experiment	This data hub contains DNA sequences and metadata to analyse bioaccumulated oysters. In addition, the analysis results of the SELECTA-SLIM pipeline can be found in this datahub.

Name	Topic	Description
dcc_sibelius	Pilot: Influenza H5N8	This data hub contains RNA sequence data and metadata of three highly pathogenic avian influenza H5N8 viruses used in the H5N8 pilot project. The aim of this project is to determine the similarity and reproducibility of viral genome consensus determination and minority variant detections between three different but commonly used workflows. Data public and described in Poen M., et al. (2019), submitted.
dcc_berlioz	Pilot: Ebola	This data hub contains Ebola sequence data generated as part of a relief effort in West Africa. Data public at: https://www.ebi.ac.uk/ena/data/view/PRJEB10265
dcc_liszt	Global Sewage	This data hub contains the DNA sequences, metagenomics data from the Global Sewage Surveillance project. The data represent all urban sewage samples collected prior to treatment plant inlet from major cities around the world. The study aims to establish a global surveillance of infectious disease agents and AMR.
dcc_strauss	Kibera Sewage	This data hub contains the DNA sequences, metagenomics data from the informal settlement of Kibera, Nairobi, Kenya. The data represent sewage samples collected over 3 months from 2 of the 10 population clusters under US CDC surveillance. The study aims to establish a hot spot disease surveillance.
dcc_vivaldi	Foodborne pathogen surveillance and epidemiological analysis	This data hub contains the DNA sequences and metagenomics data from the COMPARE Work package 4/7 workgroup of food-borne pathogens. The majority of the data consist of the bacterial pathogens Salmonella, E. coli, Listeria and Campylobacter. Samples are collected from human infections, food and animals, and the work done on these sequences aimed to improve and evaluate new and established microbiological and epidemiological analysis methods used in Public Health and the food and veterinary sector. Data were also collected with the goal of developing Source Attribution methods for WGS data.
dcc_schubert	AMR working group	This data hub contains DNA sequences of bacterial isolates’ genomes coupled with phenotypic AMR data. The phenotypic data are presented in standardized antibiograms, describing for each isolate the antibiotics tested, the levels of resistance observed, the antimicrobial susceptibility testing method employed, etc. The aim is to provide a platform for exchange of genomic and phenotypic information regarding AMR, thus encouraging surveillance and the development of innovating projects for the prediction of AMR from sequence data. Data public and described in Matamoros S., et al. (2019), submitted.
dcc_brahms	Diagnostic metagenomics on clinical samples	Diagnostic metagenomics on clinical samples: Prediction of antibiotic resistance genes and pathogen discovery in shot-gun metagenomic data from swine faeces. The data-hub contains shot-gun metagenomic NGS data sets generated from swine holdings with acute diarrhoea. The causative agent, porcine epidemic diarrhoea virus, was identified. Metadiagnostic analysis via FLI RIEMS pipeline revealed co-infection with bacteria. Prediction of antibiotic resistance markers and pathogen discovery will be done as a pilot to test pipelines and algorithms with clinical metagenomics data.
dcc_handel	Virus metagenomics	This data hub is used to share Fastq files of NGS experiments, mostly with a metagenomics approach, on clinical samples of patients with hepatitis A and norovirus gastroenteritis. The human reads in the files have been removed before being uploaded.
dcc_puccini	Parasites (Comparative Genomics of Intestinal Protozoa)	This data hub contains the raw DNA sequences from isolates of the protozoan Cryptosporidium. The data were generated in collaboration with the UK Cryptosporidium Reference Unit and represent both sporadic and outbreak cases. The study aims to understand the major factors that structure parasites’ genomes by using a comparative genomics approach.Data public at: https://www.ebi.ac.uk/ena/data/view/PRJEB15112
dcc_beard	Global Sewage snapshot Virome sequencing part	This data hub contains the raw read data of the virus specific part of the Global Sewage Surveillance project. The data aim to capture the complete DNA and RNA virome of the sampled locations. In addition, the analysis results of the SELECTA-SLIM pipeline can be found in this datahub.
dcc_cole	Metagenomics ring trial	This data hub contains the DNA sequences, metagenomics data from the Food Metagenomics ring trial 2018. The data represent DNA- and RNA-derived metagenomics data sets processed from a piece of smoked salmon spiked with a complex mock community consisting of viruses, bacteria, fungi and a parasite. The study aims to compare wet lab protocols using the same starting material.
dcc_bromhead	CoVetLab (Colistin resistant Enterobacteriaceae project)	This data hub contains the DNA sequences of single bacterial isolates of primarily antimicrobial resistant enterobacteriaceae from European National Reference Laboratories. The data represent amongst others isolates collected for the EU antimicrobial resistance monitoring in zoonotic and indicator bacteria from humans, animals and food as well as for the CoVetLab, colistin resistance project. The intention is that all data related to the EURL-AR will be submitted to the data hub.
dcc_schumann	Bioaccumulation experiment	This data hub contains DNA sequences and metadata to analyse bioaccumulated oysters. In addition, the analysis results of the SELECTA-SLIM pipeline can be found in this datahub.

Updates will be made available through https://www.ebi.ac.uk/ena/pathogens/datahubs.

Data hub requests

Requests to configure a Data Hub with a short description of the project/collaboration should be sent to datasubs@ebi.ac.uk. If requesting groups know already which scope they wish to use, i.e. sample metadata standard and analysis pipelines described in this work, these should be also specified. Alternatively, design of new metadata standards and integration of other analysis workflows will become part of a consultation process. We will send out a document to be completed by the requesting group to include names, affiliations and contact details of data providers and users who will need to provide consent to access pre-publication data (if required). Once the scope of the Data Hub is finalized, a Data Hub will be assigned and configured accordingly. Following this process, users will be sent access credentials to their assigned Data Hub. Data providers Webin accounts will be associated with the Data Hub, so they can use the PP to share submitted data with other Data Hub users (see also under ‘Sharing’).

Data hub sharing agreements

Considering the governance of access and use of data within the CDHs, and in order to address concerns of the Parties involved in terms of pre-publication access, confidentiality, due diligence (compliance with relevant regulations) and clarifying specific rights and obligations, a Code of Conduct was agreed by the COMPARE members in the Consortium Agreement. Additionally, in order to set up pilot projects that were relevant to the COMPARE platform and required the use of Data Hubs but were executed by external users (non-Consortium members), the Code of Conduct (http://www.compare-europe.eu/project-organisation/work-packages/workpackage-12) had to be signed by all parties involved in order to gain access and permission to use the available data for the purpose of specified activities. Within this Code of Conduct, the confidentiality and due diligence agreement is stating the ownership of, and responsibilities for the sharing of data in the Data Hubs, in addition to the Terms of Reference that clarify the do’s and don’ts for the participating parties. The ultimate goal of this document is to protect the integrity of the developed database and the Platform as a whole, to promote open access but facilitate also temporally confidential sharing when needed and to promote constructive peer collaboration.

Efforts have been made to draw up the agreements in a way that is legally unambiguous and at the same time readable and understandable for participants who are not legally trained. An additional disclaimer is included which safeguards the data Providers from liability for the fitness to use, faults or errors in the data. A special consideration was made on the flexibility of terms, leaving to the participants to decide, at any stage, on the issues of (i) who can participate: parties can join (if agreed by all participants) and/or drop off at any point in time; and (ii) the nature of data to be shared: the amount and type of metadata attached to the raw sequences, considering a defined minimum set of metadata (see also under ‘Upload and Standards’). The main concerns of the stakeholders involved addressed on the Code of Conduct are stratified between the two defined phases: (i) protected space, to which only eligible participants who have undersigned the agreements have access, and (ii) public availability of the data. For the first stage, issues of confidentiality and ownership are directly addressed, when parties agree not to share the data outside the closed space, and not use the data in commercial applications and/or scientific publications without the consent and acknowledgement of the data providers. For the second stage, to comply with open data policies and as a preemptive response to requests of third parties to the publication of data, parties declare to share the data on the public domain with a minimum set of metadata. At both stages, parties declare to refrain from any attempt to identify individuals when using the data, and that data is uploaded and available in accordance with the applicable laws and regulations of the European Union and of the country of origin.

Ongoing developments

One of the visions of the COMPARE initiative was to promote open data sharing and help scientists who may not have access to suitable resources by providing workflows and tools and to consolidate these through a single platform. The CDHs have been implemented over the last several years in an approach that saw a simple first implementation with rapid new deployments as additional functions became available. We will continue with this approach in order to provide maximum benefit as early as possible to our users. Particular areas of focus over the next year will likely be updates to analysis workflows, tools to search based on queries within the outputs of analyses, such as species identified, typing information and resistance gene calls, searchability within AMR profile antibiograms and implementation of the Evergreen tree-building system (https://cge.cbs.dtu.dk/services/Evergreen/, Szarvas et al. submitted, pre-print: https://www.biorxiv.org/content/10.1101/540138v1).

A current priority is to pilot the system in a variety of contexts, spanning pathogen, sectors and domains, including public and animal health agencies, food safety monitoring organisations and the commercial food industry. We invite interested groups from these and other contexts to contact us with ideas for pilots.

Conclusions

The COMPARE initiative was established with the aim to deal with challenges of open data sharing (see also under ‘Introduction’, ‘Data Hub design’, ‘Data Hub Sharing Agreements’) and to comply with the Findable, Accessible, Interoperable and Reusable (FAIR) guiding principles for data management and discovery (18). The design principles for the COMPARE platform are founded on being FAIR. This is granted through resources reused for this initiative that have a portfolio of engagement with various communities and existing infrastructure to ensure these principles. The ENA at the EMBL-EBI is an ELIXIR (https://elixir-europe.org/) Core Data Resource and participant in FAIR activities (e.g. FAIR Data Resources implementation study; https://elixir-europe.org/about-us/implementation-studies) and has been used for COMPARE’s integration, storage, presentation and retrieval of data. Data reported into the CDHs use community standards (see ‘data standards’ under Design principles and the standard section). Data in the CDHs, both in pre-publication status (shared between stakeholders) and ultimately released into the public domain, are ‘Findable’ through search and discovery tools covering both programmatic and interactive options to provide maximum flexibility and adaptability (see Discovery and Retrieval sections). Data are ‘Accessible’ both directly through the ENA but also globally through the INSDC mirrors (NCBI and DDBJ; www.insdc.org; see also under ‘Data Hub design’ section). ‘Interoperability’ is provided through structured data and metadata formats, which are validated at the time of reporting (see under ‘Upload and Standards’ section and examples: https://www.ebi.ac.uk/ena/data/view/PRJEB12582;https://www.ebi.ac.uk/ena/data/view/PRJEB9687;https://www.ebi.ac.uk/ena/data/view/SAMEA3493153;https://www.ebi.ac.uk/ena/data/view/SAMEA4390577;https://www.ebi.ac.uk/ena/data/view/PRJEB21546). Finally, data becomes ‘Reusable’ through promotion of data sharing and clear terms of use: https://www.ebi.ac.uk/about/terms-of-use.

Author contributions

Clara Amid led on the development of the article; Guy Cochrane, Marion Koopmans and Frank Møller Aarestrup conceived the CDH; Clara Amid, Sam Holt and Jeffrey E. Skiby operated user support systems; Nima Pakseresht, Blaise Alako and Nadim Rahman provided cloud compute environment used in CDH; Nicole Silvester, Peter Harrison, Suran Jayathilaka and Abdulrahman Hussein provided metadata search, data access APIs and access management tools; Ole Lund and Lukasz D. Dynovski provided the DTU Uploader; Rasko Leinonen provided the Webin data submission system; Ole Lund, Jose J. L. Cisneros, Rolf S. Kaas, Martin C. F. Thomsen and Camilla Hundahl provided the CGE analysis workflow; Surbhi Malhotra-Kumar and Basil Britto Xavier provided the BacPIPE analysis workflow; Dirk Höper and Ariane Belka contributed the RIEMS analysis workflow; Simone Cacciò, Blaise Alako and Xin Liu worked on the parasite workflow; István Csabai, Dávid Visontai, Bálint Á. Pataki, József Stéger and János M. Szalai-Gindl supported data visualisation; Matthew Cotten and David Nieuwenhuijse contributed the SLIM analysis workflow and visualisation system; Dennis Schmitz and Annelies Kroneman provided Jovian; and George B. Haringhuizen and Carolina dos S Ribeiro provided user data agreement support.

Funding

The COMPARE Consortium, which has received funding from the European Union’s Horizon 2020 research and innovation programme [grant agreement number 643476]. Additional support has been provided from National Research, Development and Innovation Office of Hungary [NVKP_16-1-2016-0004 to I.C. and J.S.].

Conflict of interest

None declared.

References

Whitty

C.J.

Mundel

Farrar

et al. (

2015

)

Providing incentives to share data early in health emergencies: the role of journal editors

Lancet

386

1797

–

1798

Dos

Ribeiro

Koopmans

M.P.

et al. (

2018

)

Threats to timely sharing of pathogen sequence data

Science

362

404

–

406

Aarestrup

F.M.

and

Koopmans

M.G.

(

2016

)

Sharing data for global infectious disease surveillance and outbreak detection

Trends Microbiol.

241

–

245

Van Panhuis

W.G.

Paul

Emerson

et al. (

2014

)

A systematic review of barriers to data sharing in public health

BMC Public Health

1144

Ribeiro

C.D.S.

van Roode

M.Y.

Haringhuizen

G.B.

et al. (

2018

)

How ownership rights over microorganisms affect infectious disease control and innovation: a root-cause analysis of barriers to data sharing as experienced by key stakeholders

PLoS One

e0195885

Reichman

J.H.

Uhlir

P.F.

and

Dedeurwaerdere

(

2015

)

Governing Digitally Integrated Genetic Resources, Data, and Literature: Global Intellectual Property Strategies for a Redesigned Microbial Research Commons

Cambridge University Press: Belgium.

Google Preview

10.1371/journal.pone.0157718

Sane

and

Edelstein

(

2015

) Overcoming barriers to data sharing in public health: a global perspective. In:

Chatham House

Modjarrad

Moorthy

V.S.

Millett

et al. (

2016

)

Developing global norms for sharing data and results during public health emergencies

PLoS Med.

e1001935

Karsch-Mizrachi

Takagi

Cochrane

and

International Nucleotide Sequence Database Collaboration

(

2018

)

The international nucleotide sequence database collaboration

Nucleic Acids Res.

–

. doi:

10.

Harrison

P.W.

Alako

Amid

et al. (

2019

)

The European nucleotide archive in 2018

Nucleic Acids Res.

–

. doi:

11.

Yilmaz

Kottman

Field

et al. (

2011

)

Minimum information about a marker gene sequence (MIMARKS) and minimum information about any (x) sequence (MIxS) specifications

Nat. Biotechnol.

415

–

420

12.

Thomsen

M.C.F.

Ahrenfeldt

Cisneros

J.L.B.

et al. (

2016

)

A bacterial analysis platform: an integrated system for Analysing bacterial whole genome sequencing data for clinical diagnostics and surveillance

PLoS One

e0157718

. doi:

13.

Scheuch

Höper

and

Beer

(

2015

)

RIEMS: a software pipeline for sensitive and comprehensive taxonomic classification of reads from metagenomics datasets

BMC Bioinformatics

. doi:

10.1186/s12859-015-0503-6

10.1093/bioinformatics/btq461

Crossref

14.

Bankevich

Nurk

Antipov

et al. (

2012

)

SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing

J. Comput. Biol.

455

–

477

. doi:

10.1089/cmb.2012.0021

15.

Edgar

R.C.

(

2010

)

Search and clustering orders of magnitude faster than BLAST

Bioinformatics

2460

–

2461

. doi:

16.

Munafò

M.R.

Nosek

B.A.

Bishop

D.V.

et al. (

2017

)

A manifesto for reproducible science

Nat. Hum. Behav.

(1), p.0021.

10.3233/978-1-61499-649-1-87

17.

Kluyver

Ragan-Kelley

Pérez

et al. (

2016

)

Jupyter notebooks-a publishing format for reproducible computational workflows

ELPUB

–

. doi: