SalmonDB: a bioinformatics resource for Salmo salar and Oncorhynchus mykiss Open Access

Final assembly details

Assembly statistics	Salmo salar	Oncorhynchus mykiss
Number of total reads	495 257	285 359
Total Unigenes in first assembly	150 720	125 077
Total Unigenes in reassembly (BLAST-CAP3)	103 221	97 667
Total Unigenes after CDS prediction	59 336	62 233
Number of reads in final assembly	387 294	213 218
Number of singletons	31 915	38 884
Average read length	619	666
Unigene length (average ± SD)	872 ± 434	880 ± 322
Average unigene depth	7	3
Maximum unigene depth	2005	1444

Assembly statistics	Salmo salar	Oncorhynchus mykiss
Number of total reads	495 257	285 359
Total Unigenes in first assembly	150 720	125 077
Total Unigenes in reassembly (BLAST-CAP3)	103 221	97 667
Total Unigenes after CDS prediction	59 336	62 233
Number of reads in final assembly	387 294	213 218
Number of singletons	31 915	38 884
Average read length	619	666
Unigene length (average ± SD)	872 ± 434	880 ± 322
Average unigene depth	7	3
Maximum unigene depth	2005	1444

Total number of reads and unigenes assembled using the described pipeline.

In Phase II, unigenes were analyzed to annotate putative protein products and to identify sequence features. First, all unigenes were analyzed with a BLASTX search against the Uniref (31) database to predict putative CDS and to determine the percentage of full-length cDNA contained in it. The CDS was assigned when the unigene had a significative hit (E < 1E − 10). Unigenes without a significant hit against the Uniref database were further analyzed using ESTscan (32). Putative CDS having at least 30 aminoacids were included in the database, the rest was discarded.

The functional annotation was based on homology detection with known proteins using MPIBLAST (33) searches against Uniref (31), Swissprot (34), KEGG (35) and KOG (36) databases. To improve the functional assignment and classification, we used MPIHMMER against PFAM (37), SMART (38), TIGRFAM (39), SUPERFAMILY (40) and PIRSF (41) databases to search for motifs in all unigenes. Motifs were assigned when a domain was detected with an E < 10⁻⁵.

Putative SNPs within sequences in the assembly were detected using the AMOS toolkit (42) and in-house developed scripts optimized for SNP discovery in complex genomes. Those sites within CDS regions, with more than four covering reads, that differ at least 20% from the consensus sequence and that were not inside repetitive sequences were marked as putative SNPs. Also, non-synonymous alleles and protein positions were predicted for each SNP.

Finally, we used Orthomcl (43) clustering to predict orthologs between the reference fish genomes and the salmonid species. A total of 273 395 proteins were clustered with Orthomcl using an E-value cutoff of 10⁻¹⁰ and a moderate inflation value of 2.5. The analysis produced a total of 28 365 clusters, where only the ortholog and paralog clusters containing salmonid species were stored in the database.

In Phase III, all sequence features were stored in MySQL as Bio::SeqFeatureI objects, cross-referenced to the corresponding external databases. Cross-references include EC, KO and KOG numbers for the Blast hits, and SMART, PIRSF, SUPERFAMILY, TIGRFAM, PFAM, INTERPRO and GO numbers for HMMER domain hits. The idea was to create a controlled vocabulary useful in other applications. A summary of SalmonDB contents is shown in Table 2.

Table 2.

Open in new tab Download slide

General SalmonDB statistics

Database	Salmo salar	Oncorhynchus mykiss
Unigenes	59 336	62 233
Total SNP	35 879	42 238
UNIREF	50 067	52 351
KEGG	30 085	31 908
SWISSPROT	41 472	44 803
KOG	33 000	35 436
PFAM	20 625	22 306
TIGRFAM	3191	3715
SMART	10 493	11 088
PIRSF	1658	1978
SUPERFAMILY	24 394	25 447

Database	Salmo salar	Oncorhynchus mykiss
Unigenes	59 336	62 233
Total SNP	35 879	42 238
UNIREF	50 067	52 351
KEGG	30 085	31 908
SWISSPROT	41 472	44 803
KOG	33 000	35 436
PFAM	20 625	22 306
TIGRFAM	3191	3715
SMART	10 493	11 088
PIRSF	1658	1978
SUPERFAMILY	24 394	25 447

Total number of unigenes matching a database hit. On average each S. salar unigene has 4.2 attributes, while O. mykiss unigenes have 4.4.

Using SalmonDB

The database can be accessed through a web interface as seen in Figure 2. The main views are the Unigene, Genome, GO and KEGG browsers, the Blast server and the BioMart interface. It also has a help navigation page that explains step by step how to use the different tools in the website.

Figure 2.

Snapshots of the SalmonDB web interface. (a) Unigene browser: the Unigene SS2U057650 is shown with several tracks (features), the blast alignment can be shown for each hit. (b) Biomart: the MartView interface is shown using the S. salar dataset and several filters selected on the left navigation panel. It also shows the ouptut table with multiple attributes shown on the left. (c) Go Browser: result of the search for GO term GO:003872 in the S. salar Unigene database. (d) KEGG Browser: the pathway associated to alanine and aspartate metabolism is shown using the S. salar Unigene database.

The Unigene Browser (Figure 2a) contains different sequence features, including CDS prediction, unigene coverage, BLAST and HMMER hits, GC content and putative SNPs, each presented as GBrowse tracks. The Genome Browser includes the complete D. rerio, O. latipes, T. rubripes, G. aculeatus and T. nigroviridis genomes and shows the genomic localization of genes, the exon/intron organization and their corresponding transcripts. Every genome contains external links to the Ensembl database for more detail. SalmonDB provides access to KEGG pathway information, through a KEGG Browser (Figure 2d), where you can select a specific EC number or browse through any pathway to find all participating unigenes. This is useful for mapping the relationships within a whole system of annotated enzymes. It is specially valuable for those who are interested in biological pathways. Moreover, SalmonDB could be queried for Gene Ontologies using the GO Browser (Figure 2c).

A web form allows the use of BLAST to find matches to an user-supplied sequence in the SalmonDB unigene databases (S. salar, O. mykiss) or the SalmonDB reference genome databases (one can search against the genome, the mRNA dataset or peptide dataset from any of the aquaculture species stored in SalmonDB). The BLAST output is dynamically linked to the Unigene and Genome Browsers (Figure 2).

BioMart (Figure 2b) is an outstanding feature for SalmonDB. It provides a step by step interface that allows searching the entire database with predefined criteria. It has the advantage that one can select any data filter combination and access only the information needed by clicking on those attributes. It is fast and depends on the information stored in the local database. Complex questions can be solved through a simple query. As an example, suppose that a researcher wants to find all unigenes participating in the nutrient reservoir activity metabolism (they all share a specific GO number, GO:0045735) and that contain a putative SNP within their sequence. First, one would click on the GO filter and specify GO:0045735 number. The next step is to click on the ‘SNP predicted only’ filter to search for just those unigenes that have a SNP present. The search will return an output table with all unigene hits and the information that was selected in the attributes form. This information can be useful to identify potential SNP markers associated to dietary responses related to nutrient storage in salmons. The website has a step by step help navigation page for using BioMart in more detail. Recently, SalmonDB biomart has been included as part of the central biomart portal (44).

Additionally, database searches can be performed with a keyword term, accession number or any ID from the cross reference of the databases mentioned before by entering the term in the quick box search or in the Gbrowse search box.

The other available databases rely on EST assembly and gene annotation data (17, 18), or the physical map based on BAC fingerprinting with BAC end sequence data (16). A comparison of the assemblies based on percent similarity among the final number of unigenes is shown on Figure 3. This plot shows the expected peak for a recent genome duplication event (45). Also, the complementary capabilities of each database and assembly statistics such as percent full-length cDNAs are shown in Tables 3 and 4, respectively.

Figure 3.

Frequency of aligned Unigenes plotted against percent identity. Figure (modified from [45]) shows frequency of top-pairwise alignment (E < 1e-10; query and subject coverage = 0.9) between Unigenes generated through our assembly pipeline plotted against identity score (SalmonDB, orange). It also shows the relationships among the contig consensus sequences of gene index EST assembly (Gene Index, blue) and cGRASP EST assembly (CGRASP, yellow) for Atlantic salmon. The same analysis is included for Fugu (Takifugu rubripes, light blue) and Medaka (Oryzias latipes, dark red) mRNAs obtained from Ensembl and the African Clawed Frog (Xenopus laevis, green) Unigenes obtained from NCBI. Since there is no standard metodolgy to compare EST assemblies (e.g. Genome assembly has N50 value), a good approximation is to observe the expected pattern for a duplicated genome using this strategy. We include the African clawed frog because it has a well-documented recent genome duplication. The expected pattern is shown in the figure with a peak around 93–94%. The same is expected for Salmon which suffered from a whole genome duplication ∼100 million years ago. SalmonDB and gene index assembly show these accumulation of paralogs around 93–94% identity.

Open in new tab Download slide

Table 3.

Global comparison of available salmon databases

	SalmonDB	GRASP	ASALBASE	Gene index
Data
Data source	All public ESTs	Public ESTs, BAC ends	BAC clones, BAC ends and EST cluster	NCBI ESTs
Base pair quality	No	Yes	No	No
EST assembly	CAP3, clustering	Phrap	No	Clustering, CAP3
Physical map	No	No	Yes	No
Genetic map	No	No	Yes	No
Expression data	No	Yes	No	No
Tools
Blast homology search	Yes	Yes	No	Yes
Quick search box	Yes	No	Yes	No
Primer design	Yes	No	No	No
RepeatMasking	No	Yes	No	No
GO annotation browser	Yes	No	No	Yes
KEGG annotation browser	Yes	No	No	Yes
Advanced search with Biomart	Yes	No	No	No
Analysis
Ortholog prediction	Yes	Yes	Yes	No
Paralog prediction	Yes	No	No	No
SNP prediction	Yes	No	No	Yes
CDS prediction	Yes	Yes	No	Yes
Other markers	No	No	Yes	No
Full-length cDNA prediction	Yes	Yes	Yes	No
Alternative splicing forms prediction	No	No	No	Yes
Others
Web interface	Gbrowse	Gbrowse	Gbrowse, custom	custom
Other organism data	5 fish species	Other salmonids and salmon lice	4 fish species and Human	Other TIGR organisms

	SalmonDB	GRASP	ASALBASE	Gene index
Data
Data source	All public ESTs	Public ESTs, BAC ends	BAC clones, BAC ends and EST cluster	NCBI ESTs
Base pair quality	No	Yes	No	No
EST assembly	CAP3, clustering	Phrap	No	Clustering, CAP3
Physical map	No	No	Yes	No
Genetic map	No	No	Yes	No
Expression data	No	Yes	No	No
Tools
Blast homology search	Yes	Yes	No	Yes
Quick search box	Yes	No	Yes	No
Primer design	Yes	No	No	No
RepeatMasking	No	Yes	No	No
GO annotation browser	Yes	No	No	Yes
KEGG annotation browser	Yes	No	No	Yes
Advanced search with Biomart	Yes	No	No	No
Analysis
Ortholog prediction	Yes	Yes	Yes	No
Paralog prediction	Yes	No	No	No
SNP prediction	Yes	No	No	Yes
CDS prediction	Yes	Yes	No	Yes
Other markers	No	No	Yes	No
Full-length cDNA prediction	Yes	Yes	Yes	No
Alternative splicing forms prediction	No	No	No	Yes
Others
Web interface	Gbrowse	Gbrowse	Gbrowse, custom	custom
Other organism data	5 fish species	Other salmonids and salmon lice	4 fish species and Human	Other TIGR organisms

cGRASP information was extracted directly from the http://web.uvic.ca/grasp/ website that includes features from external links. Gene index information was obtained from the website http://compbio.dfci.harvard.edu/tgi/.

Table 4.

Assembly statistics comparison of available salmon databases

	SalmonDB	Gene index	cGRASP
Unigenes	59 336	99 285	81 236
Total length (Mb)	51	84	71
Min length	100	100	75
Max length	4563	5828	4780
Average length	872	854	881
Median length	771	755	758
Full-length cDNA	5939^a	7124	7625
% Full-length protein	10.01	7.18	9.39

	SalmonDB	Gene index	cGRASP
Unigenes	59 336	99 285	81 236
Total length (Mb)	51	84	71
Min length	100	100	75
Max length	4563	5828	4780
Average length	872	854	881
Median length	771	755	758
Full-length cDNA	5939^a	7124	7625
% Full-length protein	10.01	7.18	9.39

Table shows statistics for the three Atlantic salmon assemblies. Total number of unigenes constructed using each database pipeline, total sequence length from all unigenes and their statistics. Also, we show the number of full-length cDNAs calculated using blastx against nr database (counted as full-length when the unigene cover 99% or more of the protein).

^aNumber of full-length cDNAs from SalmonDB biomart is 7465. This number was calculated using translated sequences (blastp) instead of blastx against nr.

SalmonDB is intended to fully exploit genetic information regarding salmon and provides several tools and pre-calculated analyzed data that can be easily browsed through the BioMart interface. It is also possible to perform fast comparative genomic research with other salmon databases and fish reference genomes. Among several tools, it is possible to design primers within salmon sequences and search for these primer sequences in the other genomes. This could enable an effective comparison of intron/exon boundaries among salmon and other fishes. Among other important features, SalmonDB provides with several putative SNPs that are accessible for all scientific community in order to validate and use them for genotyping experiments. All these combined information can help the researcher to conduct experiments and, therefore, improve results.

Future development of SalmonDB

In the near future, we will incorporate genomic information provided from the Atlantic salmon genome sequencing project (1) and publicly available transcriptomic data from Illumina/Solexa or Roche/454 sequences (46).

We expect to incorporate additional tools in order to allow scientists to explore the genetic and physical maps of S. salar. Also, we are integrating our database with the existing resources for salmonids using cross references to Gene Index TCs and cGRASP unigenes. Therefore, a link between similar unigenes (98% identity and 95% coverage for both sequences) will be provided in order to navigate through the different databases.

Several ongoing projects on salmon require an easy to access database with several tools available. Next-generation sequencing technologies will bust up the amount of information related to sequences. Thus, our experience in constructing databases (44, 47), NGS pipeline development and SNP discovery for salmon sequences will alow us to build a new version of the database every year with the goal of providing up to date information to end users.

Finally, Chile is part of the International Collaboration to sequence the Atlantic Salmon Genome (ICSASG) (1). Thus, the access to data will allow us to exploit different pipelines, tools and methodologies regarding salmon genome sequences. In the future, our goal is to become an important reference database for the salmonid species.

Funding

The development, creation and hosting of SalmonDB was supported by CORFO-INNOVA (grant 07CN13PBT-41); Fondecyt (1110427) and Fondap (No 15090007); Basal Grant CMM Projects. Funding for open access charge: Fondap (No 15090007).

Conflict of interest. None declared.

Acknowledgements

We would like to thank Dr William Davidson for his comments on the first manuscript.

References

Davidson

Koop

Jones

, et al. ,

Sequencing the genome of the Atlantic salmon (Salmo salar)

Genome Biol.

2010

, vol.

pg.

403

Thorsen

Zhu

Frengen

, et al. ,

A highly redundant bac library of Atlantic salmon (Salmo salar): an important tool for salmon projects BMC

Genomics

2005

, vol.

pg.

Artieri

Bosdet

, et al. ,

A physical map of the genome of Atlantic salmon, Salmo salar

Genomics

2005

, vol.

(pg.

396

404

)

Moen

Hayes

Baranski

, et al. ,

A linkage map of the Atlantic salmon (Salmo salar) based on EST-derived SNP markers

BMC Genomics

2008

, vol.

pg.

223

Danzmann

Davidson

Ferguson

, et al. ,

Distribution of ancestral proto-actinopterygian chromosome arms within the genomes of 4r-derivative salmonid fishes (rainbow trout and Atlantic salmon)

BMC Genomics

2008

, vol.

pg.

557

Palti

Genet

Luo

, et al. ,

A first generation integrated map of the rainbow trout genome

BMC Genomics

2011

, vol.

pg.

180

Rise

vonSchalburg

Brown

, et al. ,

Development and application of a salmonid est database and cDNA microarray: data mining and interspecific hybridization characteristics

Genome Res.

2004

, vol.

(pg.

478

490

)

Koop

vonSchalburg

Leong

, et al. ,

A salmonid EST genomic study: genes, duplications, phylogeny and microarrays

BMC Genomics

2008

, vol.

pg.

545

Hayes

Laerdahl

Lien

, et al. ,

An extensive resource of single nucleotide polymorphism markers associated with Atlantic salmon (Salmo salar) expressed sequences

Aquaculture

2007

, vol.

265

(pg.

)

Chang

Brown

, et al. ,

Type I microsatellite markers from Atlantic salmon (Salmo salar) expressed sequence tags

Mol. Ecol. Notes

2005

, vol.

(pg.

762

766

)

Vasemägi

Nilsson

Primmer

. ,

Expressed sequence tag-linked microsatellites as a source of gene-associated polymorphisms for detecting signatures of divergent selection in Atlantic salmon (Salmo salar l.)

Mol. Biol. Evol.

2005

, vol.

(pg.

1067

1076

)

Taggart

Bron

Martin

, et al. ,

A description of the origins, design and performance of the traits–sgp Atlantic salmon (Salmo salar l. cDNA microarray

J. Fish Biol.

2008

, vol.

(pg.

2071

2094

)

Rise

vonSchalburg

Cooper

, et al.

Zhangjiang

. ,

Salmonid DNA microarrays and other tools for functional genomics research

Aquaculture Genome Technologies

2007

Oxford, UK

Blackwell Publishing

(pg.

369

412

)

Google Preview

Leaver

Villeneuve

Obach

, et al. ,

Functional genomics reveals increases in cholesterol biosynthetic genes and highly unsaturated fatty acid biosynthesis after dietary substitution of fish oil with vegetable oils in Atlantic salmon (Salmo salar)

BMC Genomics

2008

, vol.

pg.

299

Panserat

Kaushik

. ,

Regulation of gene expression by nutritional factors in fish

Aquaculture Res.

2010

, vol.

(pg.

751

762

)

The Atlantic salmon genomics database (ASALBASE)

http://www.asalbase.org/ (10 June 2011, date last accessed)

The University of Victoria Grasp site (GRASP)

http://web.uvic.ca/grasp/ (10 June 2011, date last accessed)

The TIGR gene index database (TIGR)

http://compbio.dfci.harvard.edu/tgi/ (10 June 2011, date last accessed)

Allendorf

Thorgaard

Turner

. ,

Tetraploidy and the evolution of salmonid fishes

Evolutionary Genetics of Fishes

1984

New York

Plenum Press

(pg.

)

Google Preview

http://compbio.dfci.harvard.edu/tgi/software/

deBoer

Yazawa

Davidson

, et al. ,

Bursts and horizontal evolution of DNA transposons in the speciation of pseudotetraploid salmonids

BMC Genomics

2007

, vol.

pg.

422

Leong

Jantzen

vonSchalburg

, et al. ,

Salmo salar and Esox lucius full-length cDNA sequences reveal changes in evolutionary pressures on a post-tetraploidization genome

BMC Genomics

2010

, vol.

pg.

279

Kasahara

Naruse

Sasaki

, et al. ,

The medaka draft genome and insights into vertebrate genome evolution

Nature

2007

, vol.

447

(pg.

714

719

)

Aparicio

Chapman

Stupka

, et al. ,

Whole-genome shotgun assembly and analysis of the genome of Fugu rubripes

Science

2002

, vol.

297

(pg.

1301

1310

)

Stein

Mungall

Shu

, et al. ,

The generic genome browser: a building block for a model organism system database

Genome Res.

2002

, vol.

(pg.

1599

1610

)

Smedley

Haider

Ballester

, et al. ,

Biomart–biological queries made easy

BMC Genomics

2009

, vol.

pg.

Meyer

Goesmann

McHardy

, et al. ,

Gendb–an open source genome annotation system for prokaryote genomes

Nucleic Acids Res.

2003

, vol.

(pg.

2187

2195

)

Consortium for genomic research on all salmonids program (cGRASP)

http//: www.cgrasp.org/ (10 June 2011, date last accessed)

Seqclean: a script for automated trimming and validation of ESTs or other DNA sequences by screening for various contaminants, low quality and low-complexity sequences

Huang

Madan

. ,

Cap3: A DNA sequence assembly program

Genome Res.

1999

, vol.

(pg.

868

877

)

Altschul

Madden

Schäffer

, et al. ,

Gapped blast and psi-blast: a new generation of protein database search programs

Nucleic Acids Res.

1997

, vol.

(pg.

3389

3402

)

Suzek

Huang

McGarvey

, et al. ,

Uniref: comprehensive and non-redundant uniprot reference clusters

Bioinformatics

2007

, vol.

(pg.

1282

1288

)

Iseli

Jongeneel

Bucher

. ,

ESTScan: a program for detecting, evaluating, and reconstructing potential coding regions in EST sequences

Proc. Int. Conf. Intell. Syst. Mol. Biol.

1999

(pg.

138

148

)

Archuleta

Tilevich

chunFeng

. ,

IEEE Int. Conf Softwar. Maint.

2007

pg.

Bairoch

Boeckmann

Ferro

, et al. ,

Swiss-prot: juggling between evolution and stability

Brief. Bioinformatics

2004

, vol.

(pg.

)

Kanehisa

. ,

The kegg database

Novartis Found. Symp.

2002

, vol.

247

(pg.

101; discussion 101–103, 119–128, 244–252

)

Tatusov

Fedorova

Jackson

, et al. ,

The COG database: an updated version includes eukaryotes

BMC Bioinformatics

2003

, vol.

pg.

Finn

Tate

Mistry

, et al. ,

The pfam protein families database

Nucleic Acids Res.

2008

, vol.

(pg.

D281

D288

)

Letunic

Doerks

Bork

. ,

Smart 6: recent updates and new developments

Nucleic Acids Res.

2009

, vol.

(pg.

D229

D232

)

Haft

Selengut

White

. ,

The tigrfams database of protein families

Nucleic Acids Res.

2003

, vol.

(pg.

371

373

)

Wilson

Madera

Vogel

, et al. ,

The superfamily database in 2007: families and functions

Nucleic Acids Res.

2007

, vol.

(pg.

D308

D313

)

Nikolskaya

Arighi

Huang

, et al. ,

Pirsf family classification system for protein functional and evolutionary analysis

Evol. Bioinform. Online

2006

, vol.

(pg.

197

209

)

Phillippy

Schatz

Pop

. ,

Genome assembly forensics: finding the elusive mis-assembly

Genome Biol.

2008

, vol.

pg.

R55

Feng

Aaron

Mackey

Christian J

Stoeckert

David

S Roos

. ,

OrthoMCL-DB: querying a comprehensive multi-species collection of ortholog groups

Nucleic Acids Res.

2006

, vol.

(pg.

D363

D368

)

Guberman

Arnaiz

Baran

, et al. ,

BioMart Central Portal: An Open Database Network for the Biological Community

2011

In press

Koop

von Schalburg

Leong

, et al. ,

A salmonid EST genomic study: genes, duplications, phylogeny and microarrays

BMC Genomics

2008

, vol.

pg.

545

Salem

Rexroad

Wang

, et al. ,

Characterization of the rainbow trout transcriptome using sanger and 454-pyrosequencing approaches

BMC Genomics

2010

, vol.

pg.

564

The Potato Genome Sequencing Consortium (PGSC)

Genome sequence and analysis of the tuber crop potato

Nature

2011

, vol.

475

(pg.

189

195

)