Abstract

In mammals, approximately 10% of genome sequences correspond to endogenous viral elements (EVEs), which are derived from ancient viral infections of germ cells. Although most EVEs have been inactivated, some open reading frames (ORFs) of EVEs obtained functions in the hosts. However, EVE ORFs usually remain unannotated in the genomes, and no databases are available for EVE ORFs. To investigate the function and evolution of EVEs in mammalian genomes, we developed EVE ORF databases for 20 genomes of 19 mammalian species. A total of 736,771 non-overlapping EVE ORFs were identified and archived in a database named gEVE (http://geve.med.u-tokai.ac.jp). The gEVE database provides nucleotide and amino acid sequences, genomic loci and functional annotations of EVE ORFs for all 20 genomes. In analyzing RNA-seq data with the gEVE database, we successfully identified the expressed EVE genes, suggesting that the gEVE database facilitates studies of the genomic analyses of various mammalian species.

Database URL:http://geve.med.u-tokai.ac.jp

Introduction

Approximately 10% of mammalian genome sequences correspond to endogenous viral elements (EVEs), including endogenous retroviruses (ERVs), which are thought to be derived from ancient viral infections of germ cells (1–4). In general, most EVEs have been inactivated by insertions, deletions, substitutions and/or epigenetic modifications. For this reason, they were once thought solely as the legacies of ancestral viral infection, so that they remain unannotated even if they contain open reading frames (ORFs). However, various ORFs of EVEs are still active and express viral proteins in hosts, some of which have been found to play important roles in mammalian development. For example, proteins that were originally derived from envelope proteins of retroviruses—many of them are called syncytins—are known to be involved in placental development in various mammalian species (5–16).

EVEs are unique in that their evolutionary histories differ among mammalian lineages. Various mammalian species have different syncytin genes that show similar molecular functions, but those have been acquired independently in each lineage during mammalian evolution (17, 18). For example, human syncytin-1 and -2 were captured in the ancestral lineages of Catarrhini and Simiiformes (6, 19), respectively and mouse syncytin-A and B were captured in the ancestral lineage of Muridae (7). Although this unique evolution of EVEs might have contributed to maintain genetic basis of mammalian traits, it is sometimes problematic for a comprehensive discovery of functional EVEs in mammalian genomes.

At present, there are no integrated databases of EVEs. Previously, EVE (ERV) databases for human and mouse genomes were constructed as HERVd (20) and ERE database (21), respectively. However, these databases have several problems (summarized in Table 1). For HERVd (http://herv.img.cas.cz), the reference human genome sequence is out of date, and the database is apparently not maintained, as its last update was on September 19, 2003. ERE database is not a web-based database and requires Microsoft Windows. Neither database provides ORFs for each EVE sequence. Further, no computational programs for EVE detection can identify EVE ORFs comprehensively in a given genome sequence. RetroTector (22) is a well-known computer program that can identify EVE sequences in a given genome sequence, but it has been reported to be unable to identify some EVE sequences (23). RepeatMasker (24) with Repbase (25) is another well-known system for detecting EVEs. However, it was originally developed as a ‘masking’ tool for repetitive sequences in a given genome, and cannot annotate ORFs originating from viruses. Although there are no established programs for EVE ORF detection, a combination of these programs and databases, as well as sequence similarity searches using endogenous and exogenous viral sequences, can be used to identify comprehensive sets of EVEs in a genome.

Table 1.

Comparison of EVE databases.

Database (URL)SpeciesMethodsReleased dateLast update dateReference
HERVd (http://herv.img.cas.cz)HumanRepeatMasker with RepbaseJul 2000Sep 200320
ERE database (http://eredatabase.ucdmc.ucdavis.edu/)MousePCR library for LTR U3 sequencesNov 2007Feb 200821
Homology search (Megablast)
gEVE database (http://geve.med.u-tokai.ac.jp)19 mammalian speciesRetroTectorApr 2014Apr 2015This paper
RepeatMasker with Repbase
Homology search (BLAT)
Database (URL)SpeciesMethodsReleased dateLast update dateReference
HERVd (http://herv.img.cas.cz)HumanRepeatMasker with RepbaseJul 2000Sep 200320
ERE database (http://eredatabase.ucdmc.ucdavis.edu/)MousePCR library for LTR U3 sequencesNov 2007Feb 200821
Homology search (Megablast)
gEVE database (http://geve.med.u-tokai.ac.jp)19 mammalian speciesRetroTectorApr 2014Apr 2015This paper
RepeatMasker with Repbase
Homology search (BLAT)
Table 1.

Comparison of EVE databases.

Database (URL)SpeciesMethodsReleased dateLast update dateReference
HERVd (http://herv.img.cas.cz)HumanRepeatMasker with RepbaseJul 2000Sep 200320
ERE database (http://eredatabase.ucdmc.ucdavis.edu/)MousePCR library for LTR U3 sequencesNov 2007Feb 200821
Homology search (Megablast)
gEVE database (http://geve.med.u-tokai.ac.jp)19 mammalian speciesRetroTectorApr 2014Apr 2015This paper
RepeatMasker with Repbase
Homology search (BLAT)
Database (URL)SpeciesMethodsReleased dateLast update dateReference
HERVd (http://herv.img.cas.cz)HumanRepeatMasker with RepbaseJul 2000Sep 200320
ERE database (http://eredatabase.ucdmc.ucdavis.edu/)MousePCR library for LTR U3 sequencesNov 2007Feb 200821
Homology search (Megablast)
gEVE database (http://geve.med.u-tokai.ac.jp)19 mammalian speciesRetroTectorApr 2014Apr 2015This paper
RepeatMasker with Repbase
Homology search (BLAT)

To investigate the function and evolution of EVEs in mammalian genomes, we developed a genome-based EVE database named gEVE (http://geve.med.u-tokai.ac.jp) using 20 genomes of 19 mammalian species (Table 2). We comprehensively identified and annotated EVE ORF sequences (i) encoding >80 amino acid (aa) sequences and (ii) harboring viral sequence motifs. The sequences and annotations of all EVEs can be downloaded from the database without registration. Our new annotations of EVE ORFs will offer a useful resource which enhances studies of EVEs, such as expression analysis using next-generation sequencing (NGS) data, facilitating studies of functional EVE sequences in various mammalian species.

Table 2.

Genome data used in the gEVE database and EVE ORF viral profiles for each genome.

SpeciesGenome IDGenome, released dateEVEs (Met)agagpropol (LINE)benvothers
Human (Homo sapiens)Hsap38GRCh38, Dec 201333 966 (31 292)1782148229 120 (21 087)173111
Chimpanzee (Pan troglodytes)Ptro214CSAC 2.1.4, Feb 201130 099 (28 136)1813112525 572 (19 043)171910
Gorilla (Gorilla gorilla)Ggor31gorGor3.1, May 201126 335 (24 409)1456103422 462 (16 140)14868
Orangutan (Pongo pygmaeus abelii)Pabe2PPYG2, Sep 200728 315 (26 716)121484624 919 (19 492)140014
Baboon (Papio anubis)Panu2Panu_2.0, Jun 201227 230 (25 192)2101124022 125 (15 476)19625
Macaque (Macaca mulatta)Mmul1MMUL 1.0, Feb 200626 941 (25 043)1980113021 968 (15 745)20207
Marmoset (Callithrix jacchus)Cjac321C_jacchus3.2.1, Jan 201021 802 (20 614)99240619 575 (16 070)8883
Mouse (Mus musculus)Mmus38GRCm38.p1, Jan 201261 184 (58 805)7494560246 784 (29 122)307516
Rat (Rattus norvegicus)Rnor50Rnor_5.0, Mar 201234 861 (32 525)2570149129 258 (21 517)17716
Rabbit (Oryctolagus cuniculus)Ocun2oryCun2, Nov 200913 214 (12 909)43823712 275 (10 473)2922
Cow (Bos taurus)BtauUMD31UMD3.1, Dec 2009105 654 (104 674)1023673103 402 (98 952)6481
Cow (Bos taurus)Btau461Btau_4.6.1 Nov 201198 016 (97 150)86064196 065 (92 153)5850
Dog (Canis lupus familiaris)Cfam31CanFam3.1, Sep 201111 393 (11 011)39913510 815 (10 019)780
Cat (Felis catus)Fcat62Felis_catus_6.2, Sep 201111 132 (10 625)6942039,898 (8,505)3911
Horse (Equus caballus)Ecab2EquCab2.0, Sep 200714 391 (13 972)19014213 904 (12 554)1670
Sheep (Ovis aries)Oari31Oar_v3.1, Sep 201261 093 (60 184)109951758 940 (55 274)6281
Pig (Sus scrofa)Sscr102Sscrofa10.2, Aug 201115 210 (14 761)45615514 350 (13 207)2859
Goat (Capra hircus)Chir1CHIR_1.0, Jan 201337 003 (36 060)110650834 797 (31 146)6530
Opossum (Monodelphis domestica)Mdom5monDom5, Oct 200677 190 (73 029)2546272371 821 (46 874)11340
Platypus (Ornithorhynchus anatinus)Oana5OANA5, Dec 20051742 (1365)211732 (1658)70
SpeciesGenome IDGenome, released dateEVEs (Met)agagpropol (LINE)benvothers
Human (Homo sapiens)Hsap38GRCh38, Dec 201333 966 (31 292)1782148229 120 (21 087)173111
Chimpanzee (Pan troglodytes)Ptro214CSAC 2.1.4, Feb 201130 099 (28 136)1813112525 572 (19 043)171910
Gorilla (Gorilla gorilla)Ggor31gorGor3.1, May 201126 335 (24 409)1456103422 462 (16 140)14868
Orangutan (Pongo pygmaeus abelii)Pabe2PPYG2, Sep 200728 315 (26 716)121484624 919 (19 492)140014
Baboon (Papio anubis)Panu2Panu_2.0, Jun 201227 230 (25 192)2101124022 125 (15 476)19625
Macaque (Macaca mulatta)Mmul1MMUL 1.0, Feb 200626 941 (25 043)1980113021 968 (15 745)20207
Marmoset (Callithrix jacchus)Cjac321C_jacchus3.2.1, Jan 201021 802 (20 614)99240619 575 (16 070)8883
Mouse (Mus musculus)Mmus38GRCm38.p1, Jan 201261 184 (58 805)7494560246 784 (29 122)307516
Rat (Rattus norvegicus)Rnor50Rnor_5.0, Mar 201234 861 (32 525)2570149129 258 (21 517)17716
Rabbit (Oryctolagus cuniculus)Ocun2oryCun2, Nov 200913 214 (12 909)43823712 275 (10 473)2922
Cow (Bos taurus)BtauUMD31UMD3.1, Dec 2009105 654 (104 674)1023673103 402 (98 952)6481
Cow (Bos taurus)Btau461Btau_4.6.1 Nov 201198 016 (97 150)86064196 065 (92 153)5850
Dog (Canis lupus familiaris)Cfam31CanFam3.1, Sep 201111 393 (11 011)39913510 815 (10 019)780
Cat (Felis catus)Fcat62Felis_catus_6.2, Sep 201111 132 (10 625)6942039,898 (8,505)3911
Horse (Equus caballus)Ecab2EquCab2.0, Sep 200714 391 (13 972)19014213 904 (12 554)1670
Sheep (Ovis aries)Oari31Oar_v3.1, Sep 201261 093 (60 184)109951758 940 (55 274)6281
Pig (Sus scrofa)Sscr102Sscrofa10.2, Aug 201115 210 (14 761)45615514 350 (13 207)2859
Goat (Capra hircus)Chir1CHIR_1.0, Jan 201337 003 (36 060)110650834 797 (31 146)6530
Opossum (Monodelphis domestica)Mdom5monDom5, Oct 200677 190 (73 029)2546272371 821 (46 874)11340
Platypus (Ornithorhynchus anatinus)Oana5OANA5, Dec 20051742 (1365)211732 (1658)70
a

Number of EVE sequences containing at least an amino acid of Methionine was shown in parentheses.

b

Number shown in parentheses indicates pol genes that were thought to be derived from LINEs, which were annotated as ‘LINE’ by RepeatMasker and/or ‘YP_073558.1’ or ‘NP_048132.1’ by BLASTP against the NCBI Viral Genome Database.

Table 2.

Genome data used in the gEVE database and EVE ORF viral profiles for each genome.

SpeciesGenome IDGenome, released dateEVEs (Met)agagpropol (LINE)benvothers
Human (Homo sapiens)Hsap38GRCh38, Dec 201333 966 (31 292)1782148229 120 (21 087)173111
Chimpanzee (Pan troglodytes)Ptro214CSAC 2.1.4, Feb 201130 099 (28 136)1813112525 572 (19 043)171910
Gorilla (Gorilla gorilla)Ggor31gorGor3.1, May 201126 335 (24 409)1456103422 462 (16 140)14868
Orangutan (Pongo pygmaeus abelii)Pabe2PPYG2, Sep 200728 315 (26 716)121484624 919 (19 492)140014
Baboon (Papio anubis)Panu2Panu_2.0, Jun 201227 230 (25 192)2101124022 125 (15 476)19625
Macaque (Macaca mulatta)Mmul1MMUL 1.0, Feb 200626 941 (25 043)1980113021 968 (15 745)20207
Marmoset (Callithrix jacchus)Cjac321C_jacchus3.2.1, Jan 201021 802 (20 614)99240619 575 (16 070)8883
Mouse (Mus musculus)Mmus38GRCm38.p1, Jan 201261 184 (58 805)7494560246 784 (29 122)307516
Rat (Rattus norvegicus)Rnor50Rnor_5.0, Mar 201234 861 (32 525)2570149129 258 (21 517)17716
Rabbit (Oryctolagus cuniculus)Ocun2oryCun2, Nov 200913 214 (12 909)43823712 275 (10 473)2922
Cow (Bos taurus)BtauUMD31UMD3.1, Dec 2009105 654 (104 674)1023673103 402 (98 952)6481
Cow (Bos taurus)Btau461Btau_4.6.1 Nov 201198 016 (97 150)86064196 065 (92 153)5850
Dog (Canis lupus familiaris)Cfam31CanFam3.1, Sep 201111 393 (11 011)39913510 815 (10 019)780
Cat (Felis catus)Fcat62Felis_catus_6.2, Sep 201111 132 (10 625)6942039,898 (8,505)3911
Horse (Equus caballus)Ecab2EquCab2.0, Sep 200714 391 (13 972)19014213 904 (12 554)1670
Sheep (Ovis aries)Oari31Oar_v3.1, Sep 201261 093 (60 184)109951758 940 (55 274)6281
Pig (Sus scrofa)Sscr102Sscrofa10.2, Aug 201115 210 (14 761)45615514 350 (13 207)2859
Goat (Capra hircus)Chir1CHIR_1.0, Jan 201337 003 (36 060)110650834 797 (31 146)6530
Opossum (Monodelphis domestica)Mdom5monDom5, Oct 200677 190 (73 029)2546272371 821 (46 874)11340
Platypus (Ornithorhynchus anatinus)Oana5OANA5, Dec 20051742 (1365)211732 (1658)70
SpeciesGenome IDGenome, released dateEVEs (Met)agagpropol (LINE)benvothers
Human (Homo sapiens)Hsap38GRCh38, Dec 201333 966 (31 292)1782148229 120 (21 087)173111
Chimpanzee (Pan troglodytes)Ptro214CSAC 2.1.4, Feb 201130 099 (28 136)1813112525 572 (19 043)171910
Gorilla (Gorilla gorilla)Ggor31gorGor3.1, May 201126 335 (24 409)1456103422 462 (16 140)14868
Orangutan (Pongo pygmaeus abelii)Pabe2PPYG2, Sep 200728 315 (26 716)121484624 919 (19 492)140014
Baboon (Papio anubis)Panu2Panu_2.0, Jun 201227 230 (25 192)2101124022 125 (15 476)19625
Macaque (Macaca mulatta)Mmul1MMUL 1.0, Feb 200626 941 (25 043)1980113021 968 (15 745)20207
Marmoset (Callithrix jacchus)Cjac321C_jacchus3.2.1, Jan 201021 802 (20 614)99240619 575 (16 070)8883
Mouse (Mus musculus)Mmus38GRCm38.p1, Jan 201261 184 (58 805)7494560246 784 (29 122)307516
Rat (Rattus norvegicus)Rnor50Rnor_5.0, Mar 201234 861 (32 525)2570149129 258 (21 517)17716
Rabbit (Oryctolagus cuniculus)Ocun2oryCun2, Nov 200913 214 (12 909)43823712 275 (10 473)2922
Cow (Bos taurus)BtauUMD31UMD3.1, Dec 2009105 654 (104 674)1023673103 402 (98 952)6481
Cow (Bos taurus)Btau461Btau_4.6.1 Nov 201198 016 (97 150)86064196 065 (92 153)5850
Dog (Canis lupus familiaris)Cfam31CanFam3.1, Sep 201111 393 (11 011)39913510 815 (10 019)780
Cat (Felis catus)Fcat62Felis_catus_6.2, Sep 201111 132 (10 625)6942039,898 (8,505)3911
Horse (Equus caballus)Ecab2EquCab2.0, Sep 200714 391 (13 972)19014213 904 (12 554)1670
Sheep (Ovis aries)Oari31Oar_v3.1, Sep 201261 093 (60 184)109951758 940 (55 274)6281
Pig (Sus scrofa)Sscr102Sscrofa10.2, Aug 201115 210 (14 761)45615514 350 (13 207)2859
Goat (Capra hircus)Chir1CHIR_1.0, Jan 201337 003 (36 060)110650834 797 (31 146)6530
Opossum (Monodelphis domestica)Mdom5monDom5, Oct 200677 190 (73 029)2546272371 821 (46 874)11340
Platypus (Ornithorhynchus anatinus)Oana5OANA5, Dec 20051742 (1365)211732 (1658)70
a

Number of EVE sequences containing at least an amino acid of Methionine was shown in parentheses.

b

Number shown in parentheses indicates pol genes that were thought to be derived from LINEs, which were annotated as ‘LINE’ by RepeatMasker and/or ‘YP_073558.1’ or ‘NP_048132.1’ by BLASTP against the NCBI Viral Genome Database.

gEVE database

Statistics and annotation

The procedure used to identify sequences derived from viral infection is summarized in Figure 1. We first applied RetroTector version 1.01 (22) and RepeatMasker version 4.03 (24) with RMblast (version 2.2.28) and RepBase (25, version 20140423) to each genome sequence (Figure 1A, STEP1). We used default parameters for each search program excluding RepeatMasker with the ‘-species’ option depending on the target genome: human, mouse, rat, cow, pig, cat, dog, or mammal. For each identified candidate region, we scanned all possible codon reading frames, three in each direction (i.e. six frames). If the longest reading frame in the region does not contain any stop codons encoding >80 amino acids (aa), the amino acid sequence was searched by using HMMER 3.1b1 (hmmer.org) with viral motif profiles as illustrated in Figure 1A STEP2. Hidden Markov models (HMMs) of the viral motif profiles used in this process were downloaded from the Pfam (26) and the Gypsy (27) databases (39 and 304 profiles, respectively, summarized in Supplementary Table S1). Each ORF having at least one HMM profile hit was stored in the database for the corresponding genome. Note that we used an arbitrary minimum ORF cut-off of 80 aa to reduce the number of falsely extracting non-coding RNAs as EVE ORFs (28). In our annotation, ORF sequences missing a start codon (ATG) are also defined as ORFs because these sequences could work as exons in a spliced transcript.

A schematic workflow of a four-step procedure for identifying EVE ORFs in 20 mammalian genomes. (A) First extraction of EVE candidates by RetroTector and RepeatMasker (STEP1) followed by ORF extraction processes in each genome (STEP2). (B) Second extraction of EVE ORFs by BLAT search for retrieving missed EVE candidates in STEP2 (STEP3). Similarly to the first extraction, EVE ORF datasets are generated by ORF extraction processes (STEP4). This is the final dataset of the gEVE database. The numbers for EVE ORF sequences in (A) and (B) indicate the total numbers of non-overlapping sequences in the 20 mammalian genomes. The numbers of extracted EVE sequences at STEP2 and STEP4 for each genome are shown in the Supplementary Table S3.
Figure 1.

A schematic workflow of a four-step procedure for identifying EVE ORFs in 20 mammalian genomes. (A) First extraction of EVE candidates by RetroTector and RepeatMasker (STEP1) followed by ORF extraction processes in each genome (STEP2). (B) Second extraction of EVE ORFs by BLAT search for retrieving missed EVE candidates in STEP2 (STEP3). Similarly to the first extraction, EVE ORF datasets are generated by ORF extraction processes (STEP4). This is the final dataset of the gEVE database. The numbers for EVE ORF sequences in (A) and (B) indicate the total numbers of non-overlapping sequences in the 20 mammalian genomes. The numbers of extracted EVE sequences at STEP2 and STEP4 for each genome are shown in the Supplementary Table S3.

Next, to retrieve EVEs that are missed by the two computational programs, we performed similarity searches using BLAT (29) against each genome (Figure 1B, STEP3) using the following amino acid sequences: (i) all viral sequences encoding proteins stored in the NCBI RefSeq database (viral.1.protein.faa, version July 10, 2014), (ii) 131 known EVE genes (see Supplementary Table S2) and (iii) all 774 172 EVE sequences identified in the STEP 2. We then summarized EVE ORF sequences with viral motifs and encoding >80 amino acids by removing overlapping sequences while accounting for reading frames (Figure 1B, STEP 4). The number of EVE ORF sequences for each gene annotation is shown in Table 2 and the gEVE database (see ‘About’ page).

To further annotate each EVE ORF sequence, we conducted BLASTP searches separately against (i) all viral protein sequences (viral.1.protein.faa, version July 10, 2014), (ii) the non-redundant protein database (nr, version June 26, 2014) and (iii) known EVE sequences (see Supplementary Table S2). For each EVE gene, a description of the best hit was stored in the database. The number of best hits against all viral protein sequences for each genome is summarized in the gEVE database (see ‘About’ page). We also examined the correspondence between 131 known EVEs and sequences in the database (Supplementary Table S2 and ‘About’ page of the gEVE database). Additional annotations such as overlaps between exons of all annotated genes and our EVE sequences are provided in ‘Annotation Datasheet’ of the gEVE database. Detailed annotations are presented in the next section ‘Service and data download’.

In the database, we employed a naming system for each EVE ORF sequence based on the genome sequence and the EVE location, using a combination of genome ID, chromosome number, 5’ position, 3’ position and coordinates (+  or –). For example, a gEVE ID of Hsap38.chr1.100259758.100261128.– indicates that the EVE ORF is located on chromosome 1 of the human genome (version GRCh38) from positions 100 259 758 to 100 261 128 (on the negative strand). With this system, all EVEs have a unique ID for each genome.

Service and data download

All EVE sequences and their annotations for the 20 mammalian genomes are available in the database. Annotation tables are displayed with optional searches (such as species, chromosomes, amino acid lengths and HMM profiles) and can be downloaded as tab-delimited text files (Figure 2). Annotation tables include the following information: ID, gEVE ID (genome ID, chromosome, start, end and strand); Amino acid length; method, method used for EVE identification; Number of N letters, the number of Ns (undetermined nucleotides) in the region; MetORF ID, ID for EVE starting with methionine; Amino acid length of MetORF ID; HMM profile, significant motif profile(s); Viral BLAST, BLASTP best hit(s) against the NCBI Viral Genome Database (viral.1.protein.faa, version 07/10, 2014); NR BLAST, BLASTP best hit(s) against the NCBI nr (non-redundant) database; and EVE BLAST, BLASTP best hit(s) against known EVE sequences; RetroTector, annotation by RetroTector (22); Repbase, annotation by RepeatMasker with Repbase database (24, 25); Overlapping, overlaps between EVE sequences and all annotated genes in the NCBI/UCSC/Ensembl databases. IDs, BLAST results and overlapping genes are linked to NCBI/UCSC/Ensembl resources depending on their contexts. Visible annotation column can be selected using ‘Display’ option (Figure 2b). Annotation search tools are also available (Figure 2c). FASTA files of nucleotide and/or amino acid sequences and annotation tables of selected EVE sequences can be downloaded via the website (Figure 2d). The bulk download of all the EVE ORF sequences and their annotations is available in the ‘Download’ page. Further, the BLAST search is implemented in the gEVE database powered by SequenceServer (30) so that any sequences of interest can be searched online against all sequences in the gEVE database.

Web interface of the gEVE database. (a) A menu bar is shown at the top, and the current page is ‘Annotation Datasheet’. (b) Display option is available to select annotations of interest (boxed in gray dashed line, left). (c) Advanced searches for the EVE annotations such as genome IDs, viral HMM profiles, chromosome ID and amino acid lengths can be given in a new window (boxed in gray dashed line, right). (d) The annotation table or sequences (nucleotide and/or amino acid) shown in the window can be downloaded in tab-delimited format or FASTA format, respectively.
Figure 2.

Web interface of the gEVE database. (a) A menu bar is shown at the top, and the current page is ‘Annotation Datasheet’. (b) Display option is available to select annotations of interest (boxed in gray dashed line, left). (c) Advanced searches for the EVE annotations such as genome IDs, viral HMM profiles, chromosome ID and amino acid lengths can be given in a new window (boxed in gray dashed line, right). (d) The annotation table or sequences (nucleotide and/or amino acid) shown in the window can be downloaded in tab-delimited format or FASTA format, respectively.

Application of the gEVE database

As described in Introduction, one of the difficulties in EVE analysis is the lack of conservation in sequences among mammalian lineages. We thus demonstrated phylogenetic analysis as an example of gEVE database application (Figure 3). Human syncytin-1 amino acid sequence was used to perform BLASTP searches against all EVE sequences in gEVE database with e-value <1e-40. Then, a maximum likelihood phylogenetic tree was constructed using RAxML version 8 (31). We obtained syncytin-1 genes in all apes as reported by Kim and his colleagues (19), and we also found syncytin-1 like sequences in non-hominid primates, rodents and even in cows, goats, dogs and cats. Interestingly, known annotated syncytin genes in cows, goats, dogs and cats are different from these syncytin-1 like sequences. This result does not directly indicate that all these syncytin-1 like sequences are really functional. However, we can easily know when these syncytin-1 like sequences were integrated in mammalian genomes. The phylogenetic analysis using gEVE database can help researchers to save time to obtain EVE ORFs in mammalian genomes and to select species for further comparative analysis.

Phylogenetic tree of syncytin-1 like sequences. All sequences over 400 amino acids were extracted from BLASTP hits with e-values <e-40, and the tree was built with RAxML (31) with substitution model (JTT + G + I) determined by ProtTest3 (32). Bootstrap values are shown on the node (1,000 replicates). Known syncytin-1 and -2 genes in primates are indicated by the bar on the right. External nodes show EVE IDs (see Table 2 as well).
Figure 3.

Phylogenetic tree of syncytin-1 like sequences. All sequences over 400 amino acids were extracted from BLASTP hits with e-values <e-40, and the tree was built with RAxML (31) with substitution model (JTT + G + I) determined by ProtTest3 (32). Bootstrap values are shown on the node (1,000 replicates). Known syncytin-1 and -2 genes in primates are indicated by the bar on the right. External nodes show EVE IDs (see Table 2 as well).

The most powerful application of gEVE database is in NGS analyses. We also provide a General Transfer Format (GTF) file for EVE gene loci of each genome stored in the gEVE database (see ‘Download’ page). Using these GTF files with NGS data, dynamic expression profiles of EVE genes can be examined. For example, the RNA-seq data of human placenta expression (ID: ERR315374) stored in the sequence read archive (SRA, http://www.ncbi.nlm.nih.gov/sra/) were examined. The FASTQ sequences were obtained and mapped onto the human genome (GRCh38) using TopHat2 (33). The expression levels of EVE sequences were computed using Cufflinks (34) with the GTF file of gEVE Hsap38. The top 10 EVE sequences showing biggest FPKM values (i.e. highly expressed EVE sequences) are summarized in Table 3. We successfully identified known EVEs expressed in human placenta—PEG10 (35), suppressyn (10), syncytin-1 (5) and syncytin-2 (6)—as well as novel EVE sequences. This result shows that NGS data analyses combined with our annotation data enable us to discover hidden functional EVE sequences in genomes.

Table 3.

Top 10 highly expressed gEVE sequences in the RNA-seq data of ERR315374

gEVE IDHMM profileKnown EVEaFPKM
Hsap38.chr7.94664474.94665679.+proPEG10481.4
Hsap38.chr7.94663299.94664531.+gagPEG10392.9
Hsap38.chr3.129171078.129171320.-gag210.5
Hsap38.chr21.42917294.42917818.-env(suppressyn)158.5
Hsap38.chr21.42918527.42919045.-envsuppressyn131.1
Hsap38.chr7.92468768.92470387.-envsyncytin-144.5
Hsap38.chr21.42919026.42919586.-pol(suppressyn)30.7
Hsap38.chr6.11103697.11105316.-envsyncytin-224.6
Hsap38.chr21.42921853.42922110.-env(suppressyn)24.2
Hsap38.chr16.20680984.20681253.+pol20.7
gEVE IDHMM profileKnown EVEaFPKM
Hsap38.chr7.94664474.94665679.+proPEG10481.4
Hsap38.chr7.94663299.94664531.+gagPEG10392.9
Hsap38.chr3.129171078.129171320.-gag210.5
Hsap38.chr21.42917294.42917818.-env(suppressyn)158.5
Hsap38.chr21.42918527.42919045.-envsuppressyn131.1
Hsap38.chr7.92468768.92470387.-envsyncytin-144.5
Hsap38.chr21.42919026.42919586.-pol(suppressyn)30.7
Hsap38.chr6.11103697.11105316.-envsyncytin-224.6
Hsap38.chr21.42921853.42922110.-env(suppressyn)24.2
Hsap38.chr16.20680984.20681253.+pol20.7
a

A gene name in parentheses for a gEVE ID represents that the EVE sequence is located close to the known functional EVE sequence. A character, ‘–’, indicates the corresponding sequence is not reported to our knowledge.

Table 3.

Top 10 highly expressed gEVE sequences in the RNA-seq data of ERR315374

gEVE IDHMM profileKnown EVEaFPKM
Hsap38.chr7.94664474.94665679.+proPEG10481.4
Hsap38.chr7.94663299.94664531.+gagPEG10392.9
Hsap38.chr3.129171078.129171320.-gag210.5
Hsap38.chr21.42917294.42917818.-env(suppressyn)158.5
Hsap38.chr21.42918527.42919045.-envsuppressyn131.1
Hsap38.chr7.92468768.92470387.-envsyncytin-144.5
Hsap38.chr21.42919026.42919586.-pol(suppressyn)30.7
Hsap38.chr6.11103697.11105316.-envsyncytin-224.6
Hsap38.chr21.42921853.42922110.-env(suppressyn)24.2
Hsap38.chr16.20680984.20681253.+pol20.7
gEVE IDHMM profileKnown EVEaFPKM
Hsap38.chr7.94664474.94665679.+proPEG10481.4
Hsap38.chr7.94663299.94664531.+gagPEG10392.9
Hsap38.chr3.129171078.129171320.-gag210.5
Hsap38.chr21.42917294.42917818.-env(suppressyn)158.5
Hsap38.chr21.42918527.42919045.-envsuppressyn131.1
Hsap38.chr7.92468768.92470387.-envsyncytin-144.5
Hsap38.chr21.42919026.42919586.-pol(suppressyn)30.7
Hsap38.chr6.11103697.11105316.-envsyncytin-224.6
Hsap38.chr21.42921853.42922110.-env(suppressyn)24.2
Hsap38.chr16.20680984.20681253.+pol20.7
a

A gene name in parentheses for a gEVE ID represents that the EVE sequence is located close to the known functional EVE sequence. A character, ‘–’, indicates the corresponding sequence is not reported to our knowledge.

Future perspectives

We developed the gEVE database to provide EVE sequences coding >80 aa in the 20 mammalian genomes. In other words, our current database does not yet support non-coding sequences derived from EVEs. Accumulating reports indicate the functional importance of non-coding EVE sequences in host species, such as long terminal repeats (LTRs). Some LTRs in humans (such as LTR7) retain functional promoter–enhancer activity and control stem cell potency of embryonic stem (ES) and induced pluripotent stem (iPS) cells (36). Furthermore, various long non-coding (lnc) RNAs are expected to be derived from non-coding EVE sequences, which are also functional in host species (37). Thus, another task of gEVE database is to add more detailed annotation for EVE sequences. For example, evolutionary relationship among EVE sequences in the gEVE database has not been examined yet, although annotation of BLASTP best hits in the database would be partially useful. By addressing these points, the gEVE database will be continuously improved and expanded to contribute the further understanding of EVE sequences in the host genomes.

Acknowledgements

We are grateful to Akira Hashimoto (Kyoto University, Kyoto, Japan) for helpful comments on this manuscript. Computations were performed partially on the NIG supercomputer at the ROIS National Institute of Genetics.

Funding

This work was financially supported by the Japan Society for the Promotion of Science KAKENHI [grant numbers and 16K21386 to S.N.].

Conflict of interest. None declared.

References

1

Bock
M.
Stoye
J.P.
(
2000
)
Endogenous retroviruses and the human germline
.
Curr. Opin. Genet. Dev
.,
10
,
651
655
.

2

Holmes
E.C.
(
2011
)
The evolution of endogenous viral elements
.
Cell Host Microbe
.,
10
,
368
377
.

3

Dewannieux
M.
Heidmann
T.
(
2013
)
Endogenous retroviruses: acquisition, amplification and taming of genome invaders
.
Curr. Opin. Virol
.,
3
,
646
656
.

4

Aiewsakun
P.
Katzourakis
A.
(
2015
)
Endogenous viruses: Connecting recent and ancient viral evolution
.
Virology
,
479
-
480
,
26
37
.

5

Mi
S.
Lee
X.
Li
X
. et al. . (
2000
)
Syncytin is a captive retroviral envelope protein involved in human placental morphogenesis
.
Nature
,
403
,
785
789
.

6

Blaise
S.
de Parseval
N.
Benit
L
. et al. . (
2003
)
Genomewide screening for fusogenic human endogenous retrovirus envelopes identifies syncytin 2, a gene conserved on primate evolution
.
Proc. Natl. Acad. Sci. USA
,
100
,
13013
13018
.

7

Dupressoir
A.
Marceau
G.
Vernochet
C
. et al. . (
2005
)
Syncytin-A and syncytin-B, two fusogenic placenta-specific murine envelope genes of retroviral origin conserved in Muridae
.
Proc. Natl. Acad. Sci. USA
,
102
,
725
730
.

8

Heidmann
O.
Vernochet
C.
Dupressoir
A
. et al. . (
2009
)
Identification of an endogenous retroviral envelope gene with fusogenic activity and placenta-specific expression in the rabbit: a new “syncytin” in a third order of mammals
.
Retrovirology
,
6
,
107.

9

Cornelis
G.
Heidmann
O.
Bernard-Stoecklin
S
. et al. . (
2012
)
Ancestral capture of syncytin-Car1, a fusogenic endogenous retroviral envelope gene involved in placentation and conserved in Carnivora
.
Proc. Natl. Acad. Sci. USA
,
109
,
E432
E441
.

10

Sugimoto
J.
Sugimoto
M.
Bernstein
H
. et al. . (
2013
)
A novel human endogenous retroviral protein inhibits cell-cell fusion
.
Sci. Rep
.,
3
,
1462.

11

Cornelis
G.
Heidmann
O.
Degrelle
S.A
. et al. . (
2013
)
Captured retroviral envelope syncytin gene associated with the unique placental structure of higher ruminants
.
Proc. Natl. Acad. Sci. USA
,
110
,
E828
E837
.

12

Nakagawa
S.
Bai
H.
Sakurai
T
. et al. . (
2013
)
Dynamic Evolution of Endogenous Retrovirus-Derived Genes Expressed in Bovine Conceptuses during the Period of Placentation
.
Genome Biol. Evol
.,
5
,
296
306.,

13

Nakaya
Y.
Koshi
K.
Nakagawa
S
. et al. . (
2013
)
Fematrin-1 Is Involved in Fetomaternal Cell-to-Cell Fusion in Bovinae Placenta and Has Contributed to Diversity of Ruminant Placentation
.
J. Virol
.,
87
,
10563
10572
.

14

Redelsperger
F.
Cornelis
G.
Vernochet
C
. et al. . (
2014
)
Capture of syncytin-Mar1, a Fusogenic Endogenous Retroviral Envelope Gene Involved in Placentation in the Rodentia Squirrel-Related Clade
.
J. Virol
.,
88
,
7915
7928
.

15

Cornelis
G.
Vernochet
C.
Malicorne
S
. et al. . (
2014
)
Retroviral envelope syncytin capture in an ancestrally diverged mammalian clade for placentation in the primitive Afrotherian tenrecs
.
Proc. Natl. Acad. Sci. USA
,
111
,
E4332
E4341
.

16

Cornelis
G.
Vernochet
C.
Carradec
Q
. et al. . (
2015
)
Retroviral envelope gene captures and syncytinexaptation for placentation in marsupials
.
Proc. Natl. Acad. Sci. USA
,
112
,
E487
E496
.

17

Lavialle
C.
Cornelis
G.
Dupressoir
A
. et al. . (
2013
)
Paleovirology of ‘syncytins', retroviral env genes exapted for a role in placentation
.
Philos. Trans. R. Soc. Lond. B. Biol. Sci
.,
368
,
20120507.

18

Imakawa
K.
Nakagawa
S.
Miyazawa
T.
(
2015
)
Baton pass hypothesis: successive incorporation of unconserved endogenous retroviral genes for placentation during mammalian evolution
.
Genes Cells
,
20
,
771
788
.

19

Kim
H.S.
Takenaka
O.
Crow
T.J.
(
1999
)
Isolation and phylogeny of endogenous retrovirus sequences belonging to the HERV-W family in primates
.
J. Gen. Virol
.,
80
,
2613
2619
.

20

Paces
J.
Pavlícek
A.
Zika
R
. et al. . (
2004
)
HERVd: the Human Endogenous RetroViruses Database: update
.
Nucleic Acids Res
.,
32
,
D50.

21

Kao
D.
Hsu
K.
Chiu
S
. et al. . (
2012
)
ERE database: a database of genomic maps and biological properties of endogenous retroviral elements in the C57BL/6J mouse genome
.
Genomics
,
100
,
157
161
.

22

Sperber
G.O.
Airola
T.
Jern
P
. et al. . (
2007
)
Automated recognition of retroviral sequences in genomic data–RetroTector
.
Nucleic Acids Res
.,
35
,
4964
4976
.

23

Garcia-Etxebarria
K.
Jugo
B.M.
(
2010
)
Genome-Wide Detection and Characterization of Endogenous Retroviruses in Bos taurus.
J. Virol
.,
84
,
10852
10862
.

24

Smit
A.F.A.
Hubley
R
Green
P.
RepeatMasker Open-4.0., 2013-2015 <http://www.repeatmasker.org> .

25

Bao
W.
Kojima
K.K.
Kohany
O.
(
2015
)
Repbase Update, a database of repetitive elements in eukaryotic genomes
.
Mob DNA
,
6
,
11.

26

Finn
R.D.
Bateman
A.
Clements
J
. et al. . (
2014
)
Pfam: the protein families database
.
Nucleic Acids Res
.,
42
,
D222
D230
.

27

Llorens
C.
Futami
R.
Covelli
L
. et al. . (
2011
)
The Gypsy Database (GyDB) of mobile genetic elements: release 2.0
.
Nucleic Acids Res
.,
39
,
D70
D74
.

28

Ruiz-Orera
J.
Messeguer
X.
Subirana
J.A.
et al. . (
2014
)
Long non-coding RNAs as a source of new peptides
.
Elife
,
3
,
e03523.

29

Kent
W.J.
(
2002
)
BLAT - the BLAST-like alignment tool
.
Genome Res
.,
12
,
656
664
.

30

Priyam
A.
Woodcroft
B.J.
Rai
V
. et al. . (
2015
)
Sequenceserver: A modern graphical user interface for custom BLAST databases
.
Biorxiv
doi: 10.1101/033142.

31

Stamatakis
A.
(
2014
)
RAxML Version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies
.
Bioinformatics
,
30
,
1312
1313
.

32

Darriba
D.
Taboada
G.L.
Doallo
R
. et al. . (
2011
)
ProtTest 3: fast selection of best-fit models of protein evolution
.
Bioinformatics
,
27
,
1164
1165
.

33

Kim
D.
Pertea
G.
Trapnell
C
. et al. . (
2013
)
TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions
.
Genome Biol
.,
14
,
R36.

34

Trapnell
C.
Roberts
A.
Goff
L
. et al. . (
2012
)
Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks
.
Nat. Protoc
.,
7
,
562
578
.

35

Ono
R.
Nakamura
K.
Inoue
K
. et al. . (
2006
)
Deletion of Peg10, an imprinted gene acquired from a retrotransposon, causes early embryonic lethality
.
Nat. Genet
.,
38
,
101
106
.

36

Macfarlan
T.S.
Gifford
W.D.
Driscoll
S
. et al. . (
2012
)
Embryonic stem cell potency fluctuates with endogenous retrovirus activity
.
Nature
,
487
,
57
63
.

37

Gibb
E.A.
Warren
R.L.
Wilson
G.W
. et al. . (
2015
)
Activation of an endogenous retrovirus-associated long non-coding RNA in human adenocarcinoma
.
Genome Med
.,
7
,
22.

Author notes

Citation details: Nakagawa,S. and Ueda Takahashi,M. gEVE: a genome-based endogenous viral element database provides comprehensive viral protein-coding sequences in mammalian genomes. Database (2016) Vol. 2016: article ID baw087; doi:10.1093/database/baw087

This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.

Supplementary data