TEITbase: a database for transposable element (TE)-initiated transcripts in human cancers

Author Notes

Abstract

Transposable elements (TEs) are abundant and play a crucial regulatory role in the human genome. Functioning as alternative promoters, TEs can be reactivated to produce TE-initiated transcripts. In cancer, TE-initiated transcription may upregulate oncogene expression and generate novel tumour-specific antigens, which could serve as potential targets for immunotherapy. However, there remains a lack of comprehensive databases that systematically investigate TE-initiated transcription in cancer. To address this gap, we developed a deep learning-based method and identified 38 995 TE-initiated transcripts across 33 cancer types from The Cancer Genome Atlas. Among these, 6203 were tumour-specifically expressed and strongly associated with the tumourigenesis. Using these annotations, we created TEITbase (http://teitbase.medbioinfo.org/), a user-friendly database that provides researchers with tools to conduct various downstream analyses and investigations for TE-initiated transcripts, including 546 novel onco-exaptation events. The establishment of TEITbase provides new insights into transcriptional reprogramming in cancers and enables further investigation into the potential roles of TE-initiated transcripts in cancer diagnostics and therapeutic strategies.

Introduction

Transposable elements (TEs) are mobile genetic elements that are widespread in eukaryotic genomes [1]. Approximately 45% of the human genome is derived from various classes of TEs, such as LINEs (long interspersed nuclear elements), SINEs (short interspersed nuclear elements), LTRs (long terminal repeats), and DNA transposons [2, 3]. In humans, only a small proportion of TEs, including active human-specific L1 (L1HS) elements, remain capable of transposition and inducing genetic changes [4, 5]. The majority of TEs in the human genome are incomplete and no longer transposable [6]. However, TE-derived sequences retain significant regulatory potential, acting as promoters, enhancers, splicing sites, and more, playing crucial roles in both physiological and pathological processes [6–9].

As potential threats to human genome, most TEs have been kept inactive in somatic cells by mutational events and epigenetic repression [6, 10]. However, accumulating evidence suggests that certain TEs are transcriptionally reactivated in human cancers as a result of epigenetic dysregulation [11, 12]. Acting as alternative promoters, TEs can enhance their own transcription or regulate the expression of neighbouring genes, leading to the production of TE-initiated transcripts with TE-derived transcription start sites (TSSs) [13]. In cancer, TE-initiated transcription can upregulate the expression of oncogenes, a phenomenon known as onco-exaptation [14–16]. Notable examples include Alujb-LIN28B, L1-FGGY, and HERVH-CALB1 [15, 17, 18]. Furthermore, TE-initiated transcription may produce novel protein isoforms and tumour-specific antigens, which could serve as potential targets for immunotherapy [19, 20]. For instance, the tumour-specific membrane protein GABRG2, transcribed from the L1PA2 promoter, can generate aberrant epitopes on the extracellular surface of cancer cells [12]. Moreover, endogenous retrovirus (ERV) envelope glycoproteins have been identified as a dominant anti-tumour antibody target, and ERV-reactive antibodies exhibit anti-tumour activity that extends survival in mouse models [21].

Given the recognized role of TEs in tumour progression, several databases, including ERVcancer and CancerHERVdb, have been developed to investigate TE expression in human cancers [22, 23]. However, these databases only provide TE expression data at the TE family level, losing locus information and failing to distinguish between TE-initiated transcription and TE exonization. Recently, Gu et al. established TE-TSS, a database compiling 2681 RNA sequencing datasets that identifies 5768 human TE-derived TSSs and 2797 mouse TE-derived TSSs [24]. Nevertheless, due to the repetitive nature of TEs and the limitations of short-read RNA-seq, TE-TSS cannot provide accurate annotations of TE-initiated transcripts. Deep learning techniques, which can automatically extract features from DNA sequence information, have been widely applied to predict TSSs [25, 26]. By integrating RNA-seq data with DNA sequence information, convolutional neural networks could accurately predict TSSs for each assembled transcript, thereby enabling the identification of TE-initiated transcripts. However, there remains a lack of databases that specifically annotate TE-initiated transcripts in human cancers.

In this work, we developed the TEITbase (http://teitbase.medbioinfo.org/), an integrated database that systematically studies the TE-initiated transcription across 33 cancer types from The Cancer Genome Atlas (TCGA), offering data at both the TE family and transcript levels. To identify TE-initiated transcripts, we first developed a deep learning-based method to predict TSSs for each assembled transcript, utilizing both DNA sequences and short-read RNA sequencing data. Subsequently, we applied this method to 10 079 tumour samples from TCGA, and identified 38 995 TE-initiated transcripts. These annotations enable TEITbase to provide researchers with the tools to conduct various downstream analyses, such as promoter activity of TE families, tumour-specific TE-initiated transcripts, differential expression analysis, stage analysis, and survival analysis. We believe that TEITbase will serve as a valuable resource for further understanding the role of TE-initiated transcription in epigenetically dysregulated tumour tissues and will be instrumental in investigating novel tumour-specific antigens that could potentially be therapeutically targeted.

Methods

Data collection

We downloaded short-read RNA-seq data from 10 819 samples across 33 cancer types from the TCGA (phs000178.v11.p8) [27]. Additionally, we obtained RNA-seq data from 10 620 samples across 45 human body sites from the GTEx project (by dbGaP, study accession: phs000424.v8.p2) [28]. Short-read RNA-seq and CapTrap-seq datasets were downloaded from the ENCODE project and the ArrayExpress repository (E-MTAB-13063) [29, 30]. Ten samples of these samples were used as the training set (H1 cell line and heart), and three samples were used as the validation set (endodermal cells and brain). Short-read RNA-seq and RAMPAGE datasets of testis were downloaded from the Gene Expression Omnibus database (GSE135791) and the ENCODE project [31]. Short-read RNA-seq and CAGE datasets of six cancer cell lines were downloaded from the ENCODE project. GENCODE Version 46 was downloaded from https://www.gencodegenes.org/human/ and used as the transcript reference [32]. TE annotations and consensus sequences of TE family were downloaded from the Dfam database (https://www.dfam.org/releases/Dfam_3.8/) [33].

Deep learning model for TSS prediction

CapTrap-seq and matched short-read RNA-seq datasets were used to train the model. For each 3′ end of the first exon, the 10 kb DNA sequence on either side, along with the corresponding RNA-seq coverage (including both unique-mapping and multi-mapping reads), was extracted and used as input to the model. The corresponding TSS signal from the CapTrap-seq data served as the model’s label. Our model is based on the architecture of the Puffin-D model, a convolutional neural network specifically designed for TSS prediction using DNA sequences [25]. To enable TSS prediction for each sample, we adjusted the model to extract information from both the DNA sequence and RNA-seq coverage. To address the imbalance between positive and negative samples during model training, we employed the KL divergence loss function:

$$\begin{eqnarray} L = \mathop \sum \limits_i {\mathrm{ target}}_i^*\left[ {\ln \left( {{\mathrm{ target}}_i^* + \epsilon } \right) - \ln \left( {{\mathrm{ pred}}_i^* + \epsilon } \right)} \right]. \end{eqnarray}$$

The target represents the true TSS signal at position i, pred represents the TSS signal predicted by the model, and ϵ is a constant set to e⁻¹⁰.

The model utilized the Adam optimizer for parameter optimization, with a learning rate of 0.005, a batch size of 64, and a maximum of 10 training epochs. Model construction and training were carried out using PyTorch. The distance between the predicted and actual TSS was used to evaluate the model’s performance. The example of an alternative promoter was visualized on WashU Epigenome Browser [34].

Pipeline for the detection of TE-initiated transcripts

First, RNA-seq data were aligned to the reference human genome (GRCh38) using STAR v2.7.10b [35]. Sambamba was employed to remove PCR duplicates, and deepTools was used to extract RNA-seq coverage [36, 37]. StringTie v2.1.7 was used to assemble the reads into full-length transcripts (stringtie –m 200 –c 1 -t) [38]. The TSS for each assembled transcript was then predicted using the trained deep learning model. These assembled transcripts across different samples were subsequently merged with a custom script. Briefly, the first exons of multi-exon transcripts were extracted. Following this, first exons overlapping with other internal exons were excluded, as they may have resulted from RNA degradation. Then, for each first exon, the TSS was corrected according to the score calculated as follows [39]:

$$\begin{eqnarray} {{S}_t} = \ {{s}_t} + \mathop \sum \limits_{k \in T} \frac{{{\rm{w}} - {{d}_k}}}{{{\rm{w*}}{{s}_k}}}. \end{eqnarray}$$

The S is the weight score for position t; s represents the number of samples supporting the position as a TSS of the first exon; w is the window size used to calculate the weight score; T is the set of all positions within a distance from t that is less than the window size; and d is the distance between positions t and k.

Corrected TSSs overlapping with TEs were retained, and the full-length transcripts of TE-initiated transcripts were merged using a reference-guided approach across samples. The count of junction reads supporting the first exon and the expression of TE-initiated transcripts at the transcript level were then calculated using StringTie. TE-initiated transcripts with at least one junction read supporting the first exon were considered expressed in the sample. The count of junction reads supporting the first exon of TE-initiated transcripts from the same TE family was summed, and the total read count mapped to the GRCh38 genome was used for normalization to obtain TE promoter activity at the family level.

Computational validation of TE-derived TSSs and TE promoter activity

To evaluate the precision of TE-derived TSSs identified by our pipeline, we used CapTrap-seq or RAMPAGE data as the gold standard. A TE-derived TSS with supporting reads within 100 bp was considered positive. Plots of the CapTrap-seq and RAMPAGE signals surrounding the TSSs were generated using deepTools. To assess the precision of TE promoter activity calculated by our pipeline, the Spearman correlation with CAGE data was calculated.

Cell culture and experimental validation of TE-initiated transcripts

The NCI-H520 and BEAS-2B cell lines were cultured under 37°C/5% CO₂ conditions in complete medium. RNA was isolated from cell pellets containing 5 × 10⁶ cells using the FastPure® Cell/Tissue Total RNA Isolation Kit V2 (Vazyme) and subsequently reverse-transcribed into cDNA using the HiScript IV 1st Strand cDNA Synthesis Kit (Vazyme). Transcribed cDNA was quantified using quantitative PCR with ChamQ Universal SYBR qPCR Master Mix (Vazyme). Primer sequences are listed in Additional file 1: Table S10.

Annotation and expression analysis of TE-initiated transcripts

Identified TE-initiated transcripts were annotated with Dfam database by using BEDTools v2.31.1 [33, 40]. Based on the relationship with reference transcripts, TE-initiated transcripts were classified into three categories: annotated, chimeric, and intergenic. Coding potential of TE-initiated transcripts was calculated by CPC2 [41]. For coding TE-initiated transcripts, we subsequently compared predicted CDS with annotated CDS using BLAST 2.14.1 [42]. Then, the coding TE-initiated transcripts were annotated into five classes: annotated, 5′ chimeric normal, 5′ chimeric truncated, 5′ truncated, and novel. Tumour-specific scores for each TE-initiated transcript were calculated as follows:

$$\begin{eqnarray} S = \frac{{({{T}_e} + 1)/{{T}_t}}}{{({{N}_e} + 1)/{{N}_t}}}. \end{eqnarray}$$

The S represents the tumour-specific score for the TE-initiated transcript. T_e represents the number of tumour samples expressing the TE-initiated transcript, N_e represents the number of tumour-adjacent or normal tissue samples expressing the TE-initiated transcript (excluding testicular tissue), and T_t and N_t represent the total number of tumour and normal samples, respectively. A TE-initiated transcript with a tumour-specific score greater than eight was considered tumour-specific expressed.

The reactome pathway enrichment analysis of gene-chimeric TE-initiated transcripts was conducted by the ReactomePA package [43]. The tumour-type preferential expression analysis and TE family enrichment analysis were conducted using a two-tailed Fisher’s exact test. All P-values were corrected using the Benjamini–Hochberg (BH) procedure.

Clinically associated TE families or TE-initiated transcripts

The Wilcoxon rank-sum test was used to assess statistical differences in the expression levels of TE families and TE-initiated transcripts between tumour and adjacent normal tissues. The Kruskal–Wallis test was applied to evaluate statistical differences in the expression levels of TE families and TE-initiated transcripts across patients with different tumour stages. The survdiff function from the survival package was used to perform the log-rank test, comparing differences between two survival curves, and calculating the hazard ratio and P-value. All P-values were adjusted using the BH procedure.

Construction of the TEITbase website

The backend application and frontend interface of the TEITbase website were developed using the Django framework, with Gunicorn serving as the WSGI HTTP server to execute the Django code. NGINX was employed as a reverse proxy server to forward client requests to Gunicorn. To ensure system stability, Supervisor was used to monitor the operation of the Gunicorn and Django processes. The frontend interface was designed and implemented using HTML, JavaScript, and CSS. Additionally, annotations for TE families and TE-initiated transcripts were securely stored in an SQLite3 database.

Results

A deep learning-based method for the identification of TE-initiated transcript

Due to the repetitive nature of TEs and the limitations of short-read RNA-seq, identifying TE-derived TSSs based on RNA-seq coverage is not sufficiently accurate. To overcome the challenge, we developed a deep learning-based method that integrates both DNA sequence and RNA-seq coverage information. Using one-hot encoding and convolutional neural network, we predicted TSSs for each assembled transcript and overlapped them with TE annotations to identify TE-initiated transcripts (Fig. 1A).

For image description, please refer to the figure legend and surrounding text.

Figure 1

Detection of TE-initiated transcripts. (A) Pipeline designed to identify TE-initiated transcripts by using a deep learning model. (B) RAMPAGE and CapTrap-seq signals of the PRKACB gene, along with TSS predictions of our method. (C) The distances to the true TSSs of predicted TSSs from our method (left) and StringTie (right). (D) Prediction score and distance of predicted TSSs to the true TSSs. (E) CapTrap-seq and CAGE signals around the TE-derived TSSs identified by our method or DeeReCT-TSS (left) and those identified in this work, TE-TSS database, or both (right) in H1 cell lines. (F) RAMPAGE signals around the TE-derived TSSs identified by our method or DeeReCT-TSS (left) and those identified in this work, TE-TSS database, or both (right) in testis tissues. (G) Spearman correlation coefficient between CAGE data and the quantitative results from our method or TEtranscripts (left). The Kimura value of high-accuracy TE families and others (right).

Open in new tab Download slide

To train our model, we downloaded CapTrap-seq datasets along with matched short-read RNA-seq datasets [29]. Unlike CAGE and RAMPAGE, CapTrap-seq combines the cap-trapping strategy with long-read sequencing technologies, enabling the identification of both TSSs and full-length transcripts. Using these CapTrap-seq datasets for training, our model can predict TSSs at the transcript level. For example, PRKACB has alternative promoters in brain tissue. While RAMPAGE identifies only two TSS signals, it cannot assign them to specific transcripts. In contrast, both CapTrap-seq and our model can identify the TSS for each transcript (Fig. 1B).

Compared to StringTie, which relies solely on RNA-seq coverage to determine TSS, our model significantly enhances the accuracy of TSS prediction (P = 4.6 × 10⁻⁴, Fig. 1C, Table S1). In the validation set, our model accurately identified an average of ~26.7% of TSSs, with ~84.7% of TSSs exhibiting identification errors within 100 bp, a notable improvement over StringTie’s performance of 2.4% and 70.6%, respectively. As expected, our model performed significantly better in predicting highly expressed transcripts (Fig. 1C, P = 1.2 × 10⁻⁵). For transcripts with counts per million (CPM) greater than 2, an average of ~71.0% of TSSs were accurately identified by our model, with around 99.2% of TSSs having identification errors within 100 bp. This is substantially better than for low-expressed transcripts (CPM < 0.1), which had 9.1% and 72.7% accuracy within 100 bp, respectively. Furthermore, the prediction score of our model was significantly and negatively correlated with TSS prediction errors (Fig. 1D, Spearman’s r = −0.50, P < 2.2 × 10⁻¹⁶), enabling us to improve TSS identification accuracy by setting an appropriate prediction score threshold.

Through the application of the deep learning model and strict criteria, our method significantly outperforms DeeReCT-TSS in identifying TE-TSS (Table S2, P = 0.029) [26]. In the H1 cell line, our method identified 954 TE-TSS, of which 626 (65.6%) were supported by CapTrap-seq data. In contrast, DeeReCT-TSS identified only 390 TE-TSS, with just 12 (3.1%) supported by CapTrap-seq data (Fig. 1E). When compared to the TE-TSS database [24], our method identified 64 previously annotated TE-TSS and 908 novel TE-TSS, of which 580 (63.9%) were supported by CapTrap-seq data (Fig. 1E). To exclude the potential impact of model overfitting on TE-TSS identification, we downloaded short-read RNA-seq data from 18 testis tissues as an external validation dataset [31], with RAMPAGE data serving as the gold standard. Our method identified 3979 TE-TSS, of which 2909 (73.1%) were supported by RAMPAGE data, while DeeReCT-TSS identified only 286 TE-TSS, with just 107 (37.4%) supported by RAMPAGE data (Fig. 1F). Compared to the TE-TSS database, our method identified 157 previously annotated TE-TSS and 3822 novel TE-TSS, of which 2752 (72.0%) were supported by RAMPAGE data (Fig. 1F). These results suggest that we have developed a deep learning-based method that can accurately identify TE-initiated transcripts using multiple biological replicates of short-read RNA-seq data, significantly outperforming existing methods.

Due to the relatively accurate annotation of TE-initiated transcripts, our method effectively distinguishes between TE-initiated transcription and TE exonization (Fig. S1), enabling the measurement of promoter activity in TE families. To evaluate the quantitative performance at the family level, we downloaded short-read RNA-seq data and corresponding CAGE data (used as the gold standard for quantitative evaluation) from six cancer cell lines (A549, Hela, HepG2, K562, MCF-7, and SK-N-SH) from the ENCODE database [30]. Our method demonstrated significantly better consistency with CAGE data compared to existing TE family expression quantification methods, TEtranscripts (Fig. 1G, P < 2.2 × 10⁻¹⁶). Among all TE families, the median Spearman correlation coefficient between the quantitative results from our method and CAGE data across these six cell lines was 0.26, while TEtranscripts had a value of −0.37. Furthermore, our method identified 105 TE families with high accuracy (Spearman’s r ≥ 0.7), whereas TEtranscripts identified only 6. Notably, the age of these 105 high-accuracy TE families (measured by the evolutionary distance to ancestral TEs: Kimura value) was significantly lower than that of other TE families (Fig. 1G, P = 9.2 × 10⁻⁴), which may be related to the higher retention of promoter activity in younger TEs.

Genome-wide identification of TE-initiated transcripts in pan-cancers

In 10 079 tumour samples across 33 cancer types from the TCGA, we identified 38 995 TE-initiated transcripts (Table S3), of which 34 670 were not annotated in GENCODE.v46 [32]. The expression of these TE-initiated transcripts was associated with the activation of 39 703 TE loci, including 11 857 SINE elements, 11 758 LTR elements, 10 911 LINE elements, and 2580 DNA transposons (Fig. 2A). These active TEs contributed to the expression of 14 453 neighbouring genes, including 9315 protein-coding genes and 4771 lncRNA genes (Fig. 2B), suggesting that TEs play a widespread role in regulating gene expression in tumour tissues.

Figure 2

Tumour-specific TE-initiated transcription promotes tumourigenesis. (A) The distribution of the activated TE loci across the TE classes. (B) The distribution of the neighbouring genes across the gene type. (C) The reactome pathway enrichment analysis of neighbouring genes associated with the tumour-specific TE-initiated transcripts. (D) The TE family enrichment analysis of tumour-specific TE-initiated transcripts. (E) The 10 TE-initiated transcripts with the highest tumour-specific scores are presented. The left-most panel shows the TE-gene along with the tumour-specific score. The next two panels display the number of tumour samples in which each TE-initiated transcript is expressed and the distribution across cancer types. The final two panels show the expression levels of the TE-initiated transcript and the percentage of total oncogene expression contributed by TE-initiated transcription. (F) Kaplan–Meier plots demonstrating overall survival associated with the expression of TE-initiated transcripts.

Open in new tab Download slide

Tumour-specific expressed transcripts play a critical role in tumourigenesis and represent important sources of tumour-specific antigens, which may provide new targets for cancer immunotherapy. To investigate this, we performed the expression analysis of TE-initiated transcripts in 10 079 tumour samples, 740 tumour-adjacent samples, and 10 620 non-testicular normal tissue samples. Tumour-specific scores were calculated, identifying 6203 tumour-specific TE-initiated transcripts (scores ≥ 8, Table S4). These activated tumour-specific TE-initiated transcripts were associated with the expression of 1532 adjacent protein-coding genes. As anticipated, enrichment analysis revealed significant overrepresentation in tumourigenesis-related pathways, including regulation of PTEN gene transcription and TP53 activity (Fig. 2C, Table S5). Among the 702 oncogenes annotated in the ONGene database [44], we found that TEs upregulated the expression of 333 oncogenes (a process known as onco-exaptation), generating 696 TE-initiated transcripts (Table S6), including previously reported onco-exaptations such as AluJb-LIN28B, L1HS-SYT1, L1HS-FGGY, and L1HS-MET [15, 18, 45]. Compared to a recent similar study [15], we identified 150 reported and 546 novel onco-exaptation events, with 263 (48.2%) supported by CAGE data from at least one cancer cell line. Among them, AluJr-AKT2 was expressed across all cancer types and detected in 619 tumour samples, further highlighting the universality of onco-exaptation events. Furthermore, TE family enrichment analysis revealed significant overrepresentation of young families, including L1HS, L1P1, and HERVH (Fig. 2D, Table S7). These evolutionarily young TE families retain regulatory sequences that are typically suppressed in somatic cells but are reactivated in the epigenetically dysregulated tumour tissues, driving the TE-initiated transcription and potentially promoting tumour progression.

To illustrate, we analysed the 10 TE-initiated transcripts with the highest tumour-specific scores and found that 8 were initiated by the L1HS family (Fig. 2E). These 10 TE-initiated transcripts were expressed in 275–1069 tumour samples but only in 1 or 2 normal tissue samples, indicating their high tumour specificity. Moreover, these activated TEs were critical in regulating the expression of adjacent genes, with L1HS-CNBD1, L1HS-CFAP47, and L1P2-PATE2 serving as the primary transcripts for their respective genes. Furthermore, EYA4 and NLRP7 have been shown to promote cell proliferation, invasion, and migration in various cancers [46, 47], suggesting that the expression of L1HS-EYA4 and AluSz-NLRP7 may have significant oncogenic effects. Survival analysis further indicated that the expression of L1HS-EYA4, and L1HS-CRPPA was associated with poor prognosis (Fig. 2F). In conclusion, these highly tumour-specific TE-initiated transcripts may play a role in tumourigenesis and hold potential as biomarkers for tumour diagnosis and prognosis.

To experimentally validate these computational predictions, we selected seven highly tumour-specific TE-initiated transcripts in lung squamous cell carcinoma (LUSC), including L1HS-ADGRL3, L1HS-CFAP47, L1HS-CNTLN, L1HS-EMBP1, L1HS-EYS, L1HS-NELL2, and L1HS-PKHD1. Expression of these transcripts was compared between the LUSC cell line NCI-H520 and the normal lung epithelial cell line BEAS-2B. All seven transcripts were robustly detected and significantly upregulated in NCI-H520 tumour cells, whereas all targets except L1HS-EMBP1 were below the detection limit in BEAS-2B normal cells (Fig. S2), confirming their tumour-restricted expression. These findings provide strong experimental validation for the accuracy of tumour-specific TE-initiated transcript predictions in TEITbase.

Furthermore, we identified 5911 of 6203 tumour-specific TE-initiated transcripts (95.3%) that showed significant tumour-type preference (Table S8, odds ratio ≥ 3, FDR < 0.05). Among them, ovarian cancer exhibited the largest number of preferentially expressed TE-initiated transcripts (Fig. S3A, n = 1973). Family level enrichment analysis revealed that, while L1HS elements were broadly enriched across multiple tumour types, LTR7, LTR5Hs, and LTR17 were specifically enriched in testicular germ cell tumours (TGCT), suggesting their potential involvement in TGCT tumourigenesis (Fig. S3B and Table S9). Notably, the LTR7 promoter drives TGCT-specific expression of TE_15281 (odds ratio = 1887, FDR < 2.2 × 10⁻¹⁶), producing a 5′-truncated isoform of MYCT1, which may contribute to TGCT tumourigenesis.

Content and usage of TEITbase

Based on the identification of TE-initiated transcripts across various cancer types, we have developed a free, user-friendly website, TEITbase (http://teitbase.medbioinfo.org/), which provides expression data for 1027 TE families and 38 995 TE-initiated transcripts (Fig. 3A). The website consists of several main pages: Home, Browse, Analyse, Statistics, Download, Help, and Contact. TEITbase offers an intuitive interface and a range of core functionalities, enabling researchers to explore and understand the role of TE-initiated transcription in cancers.

Figure 3

Applications of TEITbase. (A) The homepage of TEITbase. (B) Box plot showing the promoter activity of L1HS in TCGA tumours (top) and GTEx normal tissues (bottom). (C) Box plot showing the promoter activity of L1HS in LUSC and adjacent normal tissues. (D) Box plot illustrating the promoter activity of L1HS across different stages of LUSC. (E) Kaplan–Meier plots showing overall survival associated with the promoter activity of L1HS in LUSC. (F) The reactome pathway enrichment analysis of genes positively correlated with the promoter activity of L1HS in LUSC. (G, H) Proportion of TCGA tumour samples (G) and GTEx normal tissue samples (H) expressing TE_11880. (I) Box plot showing the expression level of TE_11880 in LUSC, STAD, and adjacent normal tissues. (J) Box plot illustrating the expression level of TE_11880 across different stages of STAD. (K) Kaplan–Meier plots showing overall survival associated with the expression of TE_11880 in COAD.

Open in new tab Download slide

The Browse page is divided into two sections: TE Family and TE-initiated Transcripts. Users can query TE family entries by TE family, TE superfamily, and species. The query results display the TE family, Dfam database ID, TE superfamily, TE classification, species, consensus sequence length, Kimura value, and tissue-specificity score (Tau index). Users can also query TE-initiated transcripts using TEITbase database ID, TE family, and gene. The results for these queries include the TEITbase database ID, TE family, TE superfamily, TE classification, chimeric gene, transcript classification, coding types, tumour specificity score, and TSS location. According to the GENCODE v46 annotation, TE-initiated transcripts are classified into three categories: annotated, chimeric (with a novel splice site at the 5′ end), and intergenic (without known splice sites). Based on predicted open reading frames, TE-initiated transcripts are further categorized into six types: annotated, 5′ truncated, 5′ chimeric, chimeric truncated, other coding types, and non-coding. The Analyses page allows users to query tumour-specific TE-initiated transcripts, as well as survival-associated, stage-associated, and differentially expressed TE-initiated transcripts (or TE families). Additionally, the Statistics page provides the basic information of TEITbase, such as the number of TE-initiated transcripts. The Help page includes a user guide for the database, while the Download page offers links to download all annotation information, expression matrices, and analysis results.

Using the functions of TEITbase, we investigated the dysregulation of the L1HS family in cancer. We found that lung squamous carcinoma (LUSC), oesophageal carcinoma, and bladder urothelial carcinoma exhibit higher L1HS promoter activity, while normal tissues show lower activity (Fig. 3B). Differential expression analysis revealed significant activation of the L1HS promoter in 16 out of 21 cancer types, with the most pronounced increase in LUSC (Fig. 3C, P < 2.2 × 10⁻¹⁶). In LUSC, L1HS promoter activity in Stage II patients was significantly higher than in Stage I patients (Fig. 3D, P = .035). Survival analysis indicated that in 5 out of 33 cancer types, including liver hepatocellular carcinoma, abnormal activation of the L1HS promoter was associated with poorer tumour prognosis. Conversely, activation of the L1HS promoter is significantly correlated with better tumour prognosis in LUSC (Fig. 3E), adrenocortical carcinoma, and mesothelioma. To explore how L1HS promoter activation improves prognosis, we conducted a correlation analysis between L1HS promoter activity and gene expression, followed by enrichment analysis of genes positively correlated with its expression. We discovered that in LUSC, activation of the L1HS promoter was linked to apoptosis-regulating signalling pathways (Fig. 3F), suggesting that active transposition of L1HS may disrupt tumour genome stability, leading to cancer cell apoptosis [48].

By utilizing the ‘Tumour-specific transcripts’ function of TEITbase, we investigated the role of TE_11880 (L1HS-IRF1) in cancers. TE_11880 is initiated by L1HS with a tumour-specific score of 32.5 and is chimeric with IRF1, a transcriptional regulator and tumour suppressor that activates genes involved in anti-tumour immunity [49]. These findings suggest that the activation of TE promoters in cancer tissues may also confer benefits to the host. Further details about TE_11880, including its sequence and other relevant information, can be accessed in the TEITbase database. TE_11880 is silent in most normal tissue samples but activated in several cancer types (Fig. 3G, H). We selected specific cancer types to explore the differential expression, stage, and survival associations of TE_11880. Our differential expression analysis revealed that TE_11880 expression was primarily restricted to LUSC and stomach adenocarcinoma (STAD), with no expression in adjacent normal tissues (Fig. 3I). In STAD, TE_11880 expression in Stage III patients was significantly higher than in both Stage I and Stage II patients (Fig. 3J). Additionally, TE_11880 expression is associated with a favourable prognosis in colon adenocarcinoma (COAD), despite not reaching statistical significance (P = .069) (Fig. 3K), suggesting its potential prognostic value.

Discussion

The activation of TE promoters can significantly impact the transcriptomic and proteomic profiles of cancer [20]. Recent studies suggest that TE-initiated transcription may be linked to tumourigenesis and could generate tumour-specific antigens, offering potential targets for immunotherapies [12, 15]. However, the widespread use of short-read RNA sequencing technologies poses challenges due to the loss of TSS information. To address this, we developed a deep learning-based approach that predicts TSSs and accurately identifies TE-initiated transcripts, enabling the creation of a specialized database, TEITbase. TEITbase offers abundant data resources and various analytical tools, allowing users to explore the roles of TE-initiated transcripts in cancer progression. Importantly, our findings have identified over 6000 tumour-specific TE-initiated transcripts, with the L1HS family being notably enriched. As the youngest L1 family in humans, the L1HS family retains transposable ability and may play a significant role in tumourigenesis [48].

Due to its powerful model fitting capabilities, deep learning has become widely applied to predicting key features from DNA sequences such as gene expression, RNA editing, TSS, alternative splicing, and polyadenylation [25, 50, 51–53]. Several CNN-based tools have been developed for the accurate prediction of promoters and TSS, including Puffin, TSSPlant, TransPrise, and DeeReCT-TSS [25, 26, 54, 55]. Among them, DeeReCT-TSS accurately identifies active TSS across the genome by integrating DNA sequences and RNA-seq coverage information. However, DeeReCT-TSS uses CAGE-seq data to train the model, which loses relevant transcript structure information and cannot assign identified active TSS to specific transcripts. Besides, reads derived from TE sequences often produce ambiguous alignments, leading to errors in RNA-seq coverage estimation and reducing the accuracy of the identification of TE-derived TSS [56]. To address this, we downloaded CapTrap-seq data to train the model [29], enabling the prediction of TSS for each transcript, and applied strict criteria to identify TE-initiated transcript. Our method significantly outperforms DeeReCT-TSS in identifying TE-derived TSS, providing a new tool to investigate TE-initiated transcription.

Aberrantly activated TE promoters have been shown to drive the expression of oncogenes, thereby contributing to tumour initiation and progression. In this study, we identified 696 onco-exaptation events, further supporting the role of TE promoter activation in tumourigenesis. Interestingly, the expression of TE-initiated transcripts can exert both oncogenic and tumour-suppressive effects. Previous studies have reported that TE activation may facilitate the occurrence of transposition events, disrupt genomic stability, and consequently suppress tumour proliferation. Consistently, our findings revealed that in a subset of LUSC samples, L1HS promoters were significantly activated, which may be associated with cancer cell apoptosis and an improved prognosis in LUSC patients.

Our analysis revealed that among the 14 453 neighbouring genes regulated by active TEs, 4771 were lncRNA genes, accounting for a substantial proportion of TE-associated transcriptional regulation. It may reflect the relatively relaxed selective constraints on lncRNA promoters compared to protein-coding genes, making lncRNA loci more permissive to the acquisition of TE-derived TSSs [57]. Moreover, given that lncRNAs frequently exhibit tissue and tumour-specific expression [58], their preferential association with reactivated TEs in epigenetically dysregulated tumour tissues may contribute to transcriptional reprogramming during tumourigenesis. Future studies are warranted to further characterize the functional roles of these TE-driven lncRNA transcripts in cancer.

In summary, TEITbase is a user-friendly database that provides valuable insights and tools for investigating TE-initiated transcription in tumourigenesis, as well as exploring the potential roles of these transcripts in cancer diagnostics and therapeutic strategies. While it already offers a comprehensive data collection, there are still opportunities for optimization and further development. Future versions will incorporate TE exonization, which results from noncanonical splice junctions between exons and TEs, serving as a source of tumour-specific antigens [8]. Furthermore, integrating ribosome profiling or mass spectrometry-based proteomics, particularly with strategies optimized to resolve multi-mapping reads from repetitive TEs, may validate predicted coding potential and refine the identification of these tumour-specific antigens. With ongoing advancements in full-length long-read RNA-seq technologies [16], TEITbase will integrate publicly available long-read RNA-seq data to capture the full length of TE-initiated and TE-exonization transcripts. These updates are expected to further enhance TEITbase’s utility and broaden its applicability across various research fields.

Acknowledgements

We would like to express our gratitude to the TCGA and GTEx committees for their valuable contributions. We also acknowledge the ENCODE Consortium and the ENCODE production laboratories.

Author contributions

E.Y. conceived and designed the study. Y.Z., J.Q.S., X.Y.H., Y.Q.J., C.Y.T., and M.H.D. analysed the data. Y.Z., J.Q.S., M.H.D., and E.Y. interpreted the data and wrote the manuscript. All authors approved the final version of the manuscript.

Conflicts of interest

The authors declare that they have no competing interests.

Funding

This work was supported by grant from the Young Scientists Fund of the National Natural Science Foundation of China (32400534), the China Postdoctoral Science Foundation (2024M760136), and the Ministry of Science and Technology of China (2021ZD0203203).

Data availability

Details about data analysed in this study were included in the Methods section. The modified pipeline, together with the deep learning model for TSS prediction developed in this study, is available at https://github.com/yunzhang1998/Deep-TEIRI.

References

Britten

Kohne

Repeated sequences in DNA. Hundreds of thousands of copies of DNA sequences have been incorporated into the genomes of higher organisms

Science

1968

;

161

529

–

10.1126/science.161.3841.529

Kapitonov

Jurka

A universal classification of eukaryotic transposable elements implemented in Repbase

Nat Rev Genet

2008

;

411

–

author reply 414

Wells

Feschotte

A field guide to eukaryotic transposable elements

Annu Rev Genet

2020

;

539

–

10.1146/annurev-genet-040620-022145

Beck

Collier

Macfarlane

et al.

LINE-1 retrotransposition activity in human genomes

Cell

2010

;

141

1159

–

10.1016/j.cell.2010.05.021

Brouha

Schustak

Badge

et al.

Hot L1s account for the bulk of retrotransposition in the human population

Proc Natl Acad Sci

2003

;

100

5280

–

10.1073/pnas.0831042100

Chuong

Elde

Feschotte

Regulatory activities of transposable elements: from conflicts to benefits

Nat Rev Genet

2016

;

–

Bourque

Leong

Vega

et al.

Evolution of the mammalian transcription factor binding repertoire via transposable elements

Genome Res

2008

;

1752

–

10.1101/gr.080663.108

Merlotti

Sadacca

Arribas

et al.

Noncanonical splicing junctions between exons and transposable elements represent a source of immunogenic recurrent neo-antigens in patients with lung cancer

Sci Immunol

2023

;

eabm6359

10.1126/sciimmunol.abm6359

Miao

Lyu

et al.

Tissue-specific usage of transposable element-derived promoters in mouse development

Genome Biol

2020

;

255

10.1186/s13059-020-02164-3

10.

Deniz

Frost

Branco

Regulation of transposable elements by DNA modifications

Nat Rev Genet

2019

;

417

–

10.1038/s41576-019-0106-6

11.

Kong

Rose

Cass

et al.

Transposable element expression in tumors is associated with immune infiltration and increased antigenicity

Nat Commun

2019

;

5228

10.1038/s41467-019-13035-2

12.

Rodriguez-Martin

Alvarez

Baez-Ortega

et al.

Pan-cancer analysis of whole genomes identifies driver rearrangements promoted by LINE-1 retrotransposition

Nat Genet

2020

;

306

–

10.1038/s41588-019-0562-0

13.

Modzelewski

Shao

Chen

et al.

A mouse-specific retrotransposon drives a conserved Cdk2ap1 isoform essential for development

Cell

2021

;

184

5541

–

5558.e22

10.1016/j.cell.2021.09.021

14.

Babaian

Mager

Endogenous retroviral promoter exaptation in human cancer

Mobile DNA

2016

;

10.1186/s13100-016-0080-x

15.

Jang

Shah

et al.

Transposable elements drive widespread expression of oncogenes in human cancers

Nat Genet

2019

;

611

–

10.1038/s41588-019-0373-3

16.

Shi

Liu

et al.

FLIBase: a comprehensive repository of full-length isoforms across human cancers and tissues

Nucleic Acids Res

2023

;

D124

–

17.

Attig

Pape

Doglio

et al.

Human endogenous retrovirus onco-exaptation counters cancer cell senescence through calbindin

J Clin Invest

2023

;

133

e164397

18.

Sun

Zhang

et al.

LINE-1 promotes tumorigenicity and exacerbates tumor progression via stimulating metabolism reprogramming in non-small cell lung cancer

Mol Cancer

2022

;

147

10.1186/s12943-022-01618-5

19.

Laumont

Vincent

Hesnard

et al.

Noncoding regions are the main source of targetable tumor-specific antigens

Sci Transl Med

2018

;

eaau5516

10.1126/scitranslmed.aau5516

20.

Liang

Shah

et al.

Towards targeting transposable elements for cancer therapy

Nat Rev Cancer

2024

;

123

–

10.1038/s41568-023-00653-8

21.

Boumelha

Enfield

KSS

et al.

Antibodies against endogenous retroviruses promote lung cancer immunotherapy

Nature

2023

;

616

563

–

10.1038/s41586-023-05771-9

22.

Lei

Mao

et al.

ERVcancer: a web resource designed for querying activation of human endogenous retroviruses across major cancer types

J Genet Genomics

2024

;

583

–

10.1016/j.jgg.2024.09.004

23.

Stricker

Peckham-Gregory

Scheurer

CancerHERVdb: human endogenous retrovirus (HERV) expression database for human cancer accelerates studies of the retrovirome and predictions for HERV-based therapies

J Virol

2023

;

e0005923

24.

Wang

Zhang

TE-TSS: an integrated data resource of human and mouse transposable element (TE)-derived transcription start site (TSS)

Nucleic Acids Res

2023

;

D322

–

25.

Dudnyk

Cai

Shi

et al.

Sequence basis of transcription initiation in human genome

Science

2024

;

384

eadj0116

10.1126/science.adj0116

26.

Zhou

Zhang

et al.

Annotating TSSs in multiple cell types based on DNA sequence and RNA-seq data via DeeReCT-TSS

Genomics Proteomics Bioinformatics

2022

;

959

–

10.1016/j.gpb.2022.11.010

27.

Cancer Genome Atlas Research Network

Comprehensive genomic characterization defines human glioblastoma genes and core pathways

Nature

2008

;

455

1061

–

28.

GTEx Consortium, Laboratory

Data Analysis &Coordinating Center (LDACC)—Analysis Working Group

Statistical Methods groups—Analysis Working Group

et al.

Genetic effects on gene expression across human tissues

Nature

2017

;

550

204

–

29.

Carbonell-Sala

Perteghella

Lagarde

et al.

CapTrap-seq: a platform-agnostic and quantitative approach for high-fidelity full-length RNA sequencing

Nat Commun

2024

;

5278

10.1038/s41467-024-49523-3

30.

Project Consortium ENCODE

An integrated encyclopedia of DNA elements in the human genome

Nature

2012

;

489

–

31.

Özata

Mou

et al.

Evolutionarily conserved pachytene piRNA loci are highly divergent among modern humans

Nat Ecol Evol

2020

;

156

–

10.1038/s41559-019-1065-1

32.

Harrow

Frankish

Gonzalez

et al.

GENCODE: the reference human genome annotation for The ENCODE Project

Genome Res

2012

;

1760

–

10.1101/gr.135350.111

33.

Storer

Hubley

Rosen

et al.

The Dfam community resource of transposable element families, sequence models, and genome annotations

Mobile DNA

2021

;

10.1186/s13100-020-00230-y

34.

Zhou

Wang

Using the Wash U Epigenome Browser to examine genome-wide sequencing data

Curr Prot Bioinformatics

2012

;

Chapter 10:10.10.1-10.10.14

10.1002/0471250953.bi1010s40

Google Scholar

OpenURL Placeholder Text

WorldCat

Crossref

35.

Dobin

Davis

Schlesinger

et al.

STAR: ultrafast universal RNA-seq aligner

Bioinformatics

2012

;

–

10.1093/bioinformatics/bts635

36.

Ramírez

Ryan

Grüning

et al.

deepTools2: a next generation web server for deep-sequencing data analysis

Nucleic Acids Res

2016

;

W160

–

37.

Tarasov

Vilella

Cuppen

et al.

Sambamba: fast processing of NGS alignment formats

Bioinformatics

2015

;

2032

–

10.1093/bioinformatics/btv098

38.

Kovaka

Zimin

Pertea

et al.

Transcriptome assembly from long-read RNA-seq alignments with StringTie2

Genome Biol

2019

;

278

10.1186/s13059-019-1910-1

39.

Tang

Soulette

van Baren

et al.

Full-length transcript characterization of SF3B1 mutation in chronic lymphocytic leukemia reveals downregulation of retained introns

Nat Commun

2020

;

1438

10.1038/s41467-020-15171-6

40.

Quinlan

Hall

BEDTools: a flexible suite of utilities for comparing genomic features

Bioinformatics

2010

;

841

–

10.1093/bioinformatics/btq033

41.

Kang

Yang

Kong

et al.

CPC2: a fast and accurate coding potential calculator based on sequence intrinsic features

Nucleic Acids Res

2017

;

W12

–

42.

Camacho

Coulouris

Avagyan

et al.

BLAST+: architecture and applications

BMC Bioinf

2009

;

421

10.1186/1471-2105-10-421

Google Scholar

Crossref

WorldCat

43.

ReactomePA: an R/Bioconductor package for reactome pathway analysis and visualization

Mol Biosyst

2016

;

477

–

44.

Liu

Sun

Zhao

ONGene: a literature-based database for human oncogenes

J Genet Genomics

2017

;

119

–

10.1016/j.jgg.2016.12.004

45.

Hur

Cejas

Feliu

et al.

Hypomethylation of long interspersed nuclear element-1 (LINE-1) leads to activation of proto-oncogenes in human colorectal cancer metastasis

Gut

2014

;

635

–

10.1136/gutjnl-2012-304219

46.

de la Peña Avalos

Tropée

Duijf

PHG

et al.

EYA4 promotes breast cancer progression and metastasis through its role in replication stress avoidance

Mol Cancer

2023

;

158

10.1186/s12943-023-01861-4

47.

et al.

NLRP7 deubiquitination by USP10 promotes tumor progression and tumor-associated macrophage polarization in colorectal cancer

J Exp Clin Cancer Res

2021

;

126

10.1186/s13046-021-01920-y

48.

Liu

Zhang

et al.

Silencing of LINE-1 retrotransposons is a selective dependency of myeloid leukemia

Nat Genet

2021

;

672

–

10.1038/s41588-021-00829-8

49.

Ghislat

Cheema

Baudoin

et al.

NF-κB-dependent IRF1 activation programs cDC1 dendritic cells to drive antitumor immunity

Sci Immunol

2021

;

eabg3570

10.1126/sciimmunol.abg3570

50.

Avsec

Agarwal

Visentin

et al.

Effective gene expression prediction from sequence by integrating long-range interactions

Nat Methods

2021

;

1196

–

203

10.1038/s41592-021-01252-x

51.

Bogard

Linder

Rosenberg

et al.

A deep neural network for predicting and engineering alternative polyadenylation

Cell

2019

;

178

–

106.e23

10.1016/j.cell.2019.04.046

52.

Gao

Nan

et al.

DEMINING: a deep learning model embedded framework to distinguish RNA editing from DNA mutations in RNA sequencing data

Genome Biol

2024

;

258

10.1186/s13059-024-03397-2

53.

Jaganathan

Kyriazopoulou Panagiotopoulou

McRae

et al.

Predicting splicing from primary sequence with deep learning

Cell

2019

;

176

535

–

548.e24

10.1016/j.cell.2018.12.015

54.

Pachganov

Murtazalieva

Zarubin

et al.

TransPrise: a novel machine learning approach for eukaryotic promoter prediction

PeerJ

2019

;

e7990

55.

Shahmuradov

Umarov

Solovyev

TSSPlant: a new tool for prediction of plant Pol II promoters

Nucleic Acids Res

2017

;

gkw1353

56.

Lanciano

Cristofari

Measuring and interpreting transposable element expression

Nat Rev Genet

2020

;

721

–

10.1038/s41576-020-0251-y

57.

Kapusta

Kronenberg

Lynch

et al.

Transposable elements are major contributors to the origin, diversification, and regulation of vertebrate long noncoding RNAs

PLos Genet

2013

;

e1003470

10.1371/journal.pgen.1003470

58.

Schmitt

Chang

Long noncoding RNAs in cancer pathways

Cancer Cell

2016

;

452

–

10.1016/j.ccell.2016.03.010

Author notes

Yun Zhang and Jianqi She contributed equally.

This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.

Download all slides

Article Contents

TEITbase: a database for transposable element (TE)-initiated transcripts in human cancers

Abstract

Introduction

Methods

Data collection

Deep learning model for TSS prediction

Pipeline for the detection of TE-initiated transcripts

Computational validation of TE-derived TSSs and TE promoter activity

Cell culture and experimental validation of TE-initiated transcripts

Annotation and expression analysis of TE-initiated transcripts

Clinically associated TE families or TE-initiated transcripts

Construction of the TEITbase website

Results

A deep learning-based method for the identification of TE-initiated transcript

Genome-wide identification of TE-initiated transcripts in pan-cancers

Content and usage of TEITbase

Discussion

Acknowledgements

Author contributions

Conflicts of interest

Funding

Data availability

References

Author notes

Supplementary data

Citations

Views

Altmetric

Citing articles via

Latest

Most Read

Most Cited

Article Contents

TEITbase: a database for transposable element (TE)-initiated transcripts in human cancers Open Access

Abstract

Introduction

Methods

Data collection

Deep learning model for TSS prediction

Pipeline for the detection of TE-initiated transcripts

Computational validation of TE-derived TSSs and TE promoter activity

Cell culture and experimental validation of TE-initiated transcripts

Annotation and expression analysis of TE-initiated transcripts

Clinically associated TE families or TE-initiated transcripts

Construction of the TEITbase website

Results

A deep learning-based method for the identification of TE-initiated transcript

Genome-wide identification of TE-initiated transcripts in pan-cancers

Content and usage of TEITbase

Discussion

Acknowledgements

Author contributions

Conflicts of interest

Funding

Data availability

References

Author notes

Supplementary data

Citations

Views

Altmetric

Citing articles via

Latest

Most Read

Most Cited

This Feature Is Available To Subscribers Only

Gift article access

Gift article access

Gift article access

Gift article access

TEITbase: a database for transposable element (TE)-initiated transcripts in human cancers