TRedD—A database for tandem repeats over the edit distance

The first six rows of the database table for chromosomes of Homo sapiens

Sequence	Repeats Found	Date
Homo_Sapiens.dna.chromosome.1	91 814 View Table View Graph	12/17/2008
Homo_Sapiens.dna.chromosome.2	92 525 View Table View Graph	12/19/2008
Homo_Sapiens.dna.chromosome.3	69 829 View Table View Graph	12/20/2008
Homo_Sapiens.dna.chromosome.4	69 485 View Table View Graph	1/22/2009
Homo_Sapiens.dna.chromosome.5	65 195 View Table View Graph	1/23/2009
Homo_Sapiens.dna.chromosome.6	62 481 View Table View Graph	1/24/2009

Sequence	Repeats Found	Date
Homo_Sapiens.dna.chromosome.1	91 814 View Table View Graph	12/17/2008
Homo_Sapiens.dna.chromosome.2	92 525 View Table View Graph	12/19/2008
Homo_Sapiens.dna.chromosome.3	69 829 View Table View Graph	12/20/2008
Homo_Sapiens.dna.chromosome.4	69 485 View Table View Graph	1/22/2009
Homo_Sapiens.dna.chromosome.5	65 195 View Table View Graph	1/23/2009
Homo_Sapiens.dna.chromosome.6	62 481 View Table View Graph	1/24/2009

This table contains one row per chromosome of the Homo sapiens (1–22,X,Y).

Table 1.

The first six rows of the database table for chromosomes of Homo sapiens

Sequence	Repeats Found	Date
Homo_Sapiens.dna.chromosome.1	91 814 View Table View Graph	12/17/2008
Homo_Sapiens.dna.chromosome.2	92 525 View Table View Graph	12/19/2008
Homo_Sapiens.dna.chromosome.3	69 829 View Table View Graph	12/20/2008
Homo_Sapiens.dna.chromosome.4	69 485 View Table View Graph	1/22/2009
Homo_Sapiens.dna.chromosome.5	65 195 View Table View Graph	1/23/2009
Homo_Sapiens.dna.chromosome.6	62 481 View Table View Graph	1/24/2009

Sequence	Repeats Found	Date
Homo_Sapiens.dna.chromosome.1	91 814 View Table View Graph	12/17/2008
Homo_Sapiens.dna.chromosome.2	92 525 View Table View Graph	12/19/2008
Homo_Sapiens.dna.chromosome.3	69 829 View Table View Graph	12/20/2008
Homo_Sapiens.dna.chromosome.4	69 485 View Table View Graph	1/22/2009
Homo_Sapiens.dna.chromosome.5	65 195 View Table View Graph	1/23/2009
Homo_Sapiens.dna.chromosome.6	62 481 View Table View Graph	1/24/2009

This table contains one row per chromosome of the Homo sapiens (1–22,X,Y).

This table has columns for the sequence name, the number of repeats found and the date the program was run. In the ‘Repeats Found’ column, the number of repeats found in the chromosome is displayed, as well as two links for viewing the results, ‘View Table’ and ‘View Graph’. Since this is the most important aspect of the database for the novice user, we devote the following two subsections to the viewing of results.

Table view

If the user selects ‘View Table’ in the ‘Repeats Found’ column (see Table 1), a table of the results for that specific chromosome is displayed. See Table 2 for the first few lines of the table of repeats for chromosome 1 of Homo sapiens. The table contains one line per tandem repeat found, with the following data on each repeat: Alignment, Start, End, Length, Period, Repetitions, Errors and Percent Match. The Alignment column is a link that allows the user to view the actual multiple alignment of the copies of the repeat. See Table 3 for the alignment that appears when the user clicks on the second line in Table 2. In ‘Definition’ Section, we described the details of how the alignment is generated.

Table 2.

The table view of the repeats shows details about the repeats found in a chromosome

Homo Sapiens Chromosome 1
Alignment	Start	End	Length	Period	Repetitions	Errors	% Match
View	1	468	468	6.1	77.2	20	95.75
View	621	860	240	74.8	3.2	7	95.83
View	9169	9308	140	70.0	2.0	5	92.96
View	20 718	20 755	38	1.9	20.0	4	89.19
View	20 726	20 785	60	13.0	4.6	6	87.50

Homo Sapiens Chromosome 1
Alignment	Start	End	Length	Period	Repetitions	Errors	% Match
View	1	468	468	6.1	77.2	20	95.75
View	621	860	240	74.8	3.2	7	95.83
View	9169	9308	140	70.0	2.0	5	92.96
View	20 718	20 755	38	1.9	20.0	4	89.19
View	20 726	20 785	60	13.0	4.6	6	87.50

The first five repeats found in chromosome 1 of Homo sapien are shown in this table

Table 2.

The table view of the repeats shows details about the repeats found in a chromosome

Homo Sapiens Chromosome 1
Alignment	Start	End	Length	Period	Repetitions	Errors	% Match
View	1	468	468	6.1	77.2	20	95.75
View	621	860	240	74.8	3.2	7	95.83
View	9169	9308	140	70.0	2.0	5	92.96
View	20 718	20 755	38	1.9	20.0	4	89.19
View	20 726	20 785	60	13.0	4.6	6	87.50

Homo Sapiens Chromosome 1
Alignment	Start	End	Length	Period	Repetitions	Errors	% Match
View	1	468	468	6.1	77.2	20	95.75
View	621	860	240	74.8	3.2	7	95.83
View	9169	9308	140	70.0	2.0	5	92.96
View	20 718	20 755	38	1.9	20.0	4	89.19
View	20 726	20 785	60	13.0	4.6	6	87.50

The first five repeats found in chromosome 1 of Homo sapien are shown in this table

Table 3.

The multiple alignment of the periods of the second repeat found in chromosome 1 of Homo sapiens, occurring at locations 621–860

621

GGCGCAGGCGCAGAGAGGCGCGCCGCGCCGGCGCAGGCGC–AGACACATGCTAGCGCGTC–GGGGTGGAGGCGT

692

693

GGCGCAGGCGCAGAGAGGCGCGCCGCGCCGGCGCAGGCGCAGAGACACATGCTACCGCGTCCAGGGGTGGAGGCGT

768

769

GGCGCAGGCGCAGAGAGGCGCACCGCGCCGGCGCAGGCGCAGAGACACATGCTAGCGCGTCCAGGGGTGGAGGCGT

844

845

GGCGCAGGCGCAGAGA

860

This corresponds to row 2 in Table 2.

Start and End show the loci in the chromosome of the start and end position of the tandem repeat. The length is the length of the repeat as its number of bases (end-start + 1). Since the period lengths of a given repeat are variable due to insertions and deletions, in the Period column we put the average period length. The Repetitions column tells the number of copies of the period in the tandem repeat, e.g. caggcaggcaggcag will have 3.75 copies. This corresponds to the number of rows in the multiple alignment. The Errors column tells the sum of the edit operations (insertions, deletions and mismatches) between consecutive periods of the repeat.

The table is paginated, and 100 repeats are shown per page. It is possible to sort each page by any one of the columns, in ascending or descending order, by clicking on the arrows in the column heading. Furthermore, it is possible to filter the output by one or more of several criteria. Suppose a user is interested in viewing long repeats, or repeats with large period size. Since there are so many repeats, sorting each table will be too cumbersome. We have therefore provided querying capabilities using mySQL queries.

To create a query, click the ‘Filter’ link immediately above the table. A query form will appear on the page, with the following fields: Start Location, End Location, Length, Errors, Minimum Period Size and Percent Match. The user can enter any starting and ending location in text boxes. This facilitates the search for repeats within known genes. The rest of the fields have pull-down menus for the user to choose from. These menus are dynamic, for e.g. the Length menu will range from the shortest repeat to the longest repeat found in this specific chromosome. The user has to simply choose values for the fields, click Submit, and only those repeats that satisfy the criteria will show in the table below. To close the filter, click again on the Filter link. This query feature is extremely useful in analyzing the myriad of output that is included in the database.

Graphical view using TandemGraph

Most genomes have high content of tandem repeats, and the Homo sapiens is no exception. In the sequence of chromosome 1, TRedD contains 91 814 repeats. In the table view (described in the previous section), these repeats are displayed to the end user in tables, one line per tandem repeat, 100 lines per table. This yields 919 pages for chromosome 1 alone. In order for biologists to be able to analyze this data, we felt that it must be presented in a clear graphical visualization, allowing both a high-level overview and variant levels of detail. To this end, we have developed a new software tool called ‘TandemGraph’ to graphically depict the tandem repeats in a sequence (25). TandemGraph allows one to view the entire set of tandem repeats in a chromosome in a single image, and then to continuously zoom in to see further details.

The idea of the representation used in TandemGraph is largely based upon the model of the pygram (or pyramid diagram) (26), which uses overlaid triangles, in a similar manner to an earlier design called the ‘landscape’ (27). Our model uses overlaid outlined colored triangles to represent the tandem repeats (see Figure 1). Each triangle represents one tandem repeat; thus, each triangle in the graph corresponds to one row in the table view. Given a sequence S of length n, and a list of the substrings of S that are tandem repeats, a representation is a two-dimensional graph, where the x-axis is labeled with the actual sequence, and triangles are drawn in the matrix above, with the height of each triangle representing the length of the repeat. The left x-coordinate of each triangle represents the first nucleotide of the repeat sequence, while the right x-coordinate represents the end of the repeat. The triangles are outlined, therefore all overlapping repeats are clearly visualized.

Figure 1.

This view of TandemGraph shows the repeats that occur in a segment of chromosome 11 of the Homo sapiens. Each triangle represents a tandem repeat. Below the graph, the green bar represents a new zooming segment, and the text boxes allow entry of actual indexes for zooming.

Information about other attributes, such as period size and percent error, appear in a triangle as the user mouses over the triangle. In addition, as the mouse is placed in a triangle, the triangle gets filled in, to clarify which of the overlapping triangles is selected. Once a triangle is selected, the user can click on it to view the multialignment of the actual repeat. This corresponds exactly to the user clicking ‘View Alignment’ in the specific row of the table. In Figure 2, we show the same view as in Figure 1 with a triangle selected.

Figure 2.

The same view of chromosome 11 as shown in Figure 1 is shown here. In this figure, a triangle has been selected, and the multiple alignment of this repeat is displayed in a pop-up window.

Zooming features: the highest level view represents an entire chromosome. For this level the graph generally degenerates to a column graph, each column representing a repeat, with the height of the line representing the repeat's length. If there is more than one repeat located at the same (or close) location, the column will appear as a multicolored line, the bottom part representing the shortest repeat, the piece above another color representing the next longer repeat and so on.

Zooming has been implemented in several different user-friendly ways. The slider at the right provides a continuous zoom, with zoom-out and zoom-in buttons on top and bottom. The red rectangle on the bottom allows the user to drag a range of the chromosome to zoom, while the text boxes underneath the rectangle allow the user to enter actual start and end locations in the chromosome. All three of these zooming features are fully integrated, so that the actual indices appear in the start and end box as the user drags through a range.

Log-graph: in some chromosomes there are repeats that are extremely long, possibly spanning over 100 000 bases. These long repeats cause the STR (which are much more common) to be barely visible as their heights scale to close to zero. In situations where data covers a large range of values it is common to present the data on a logarithmic scale. Thus, the y-axis is labeled with powers of 10 (10, 100, 1000, etc.). A repeat with length has height ⁠. Using the log-scale, a triangle that represents a repeat that is a substring of another repeat will not necessarily fit entirely inside its superstring's triangle. Therefore, we represent repeats as trapezoids. The left and right x values are the same as described above for the triangles, and the height of the trapezoid is where represents the repeat's actual length. TandemGraph includes radio buttons to allow the user to switch the graph from triangles and linear heights to trapezoids and log-scale heights. In Figure 3 , we show the same area in chromosome 11 that is shown in both Figures 1 and 2, with the trapezoid button selected.

Figure 3.

This view in TandemGraph is the same region of Homo sapiens chromosome 11 shown in Figures 1 and 2. Here, the trapezoid button is selected, and the log-graph displays trapezoids to represent the repeats.

We have run TandemGraph on all 24 chromosomes of Homo sapiens (1–22,X,Y) and the results are excellent. TandemGraph provides a GUI interface to huge amounts of data, previously available as text only. We have fully integrated the TandemGraph tool with the TRedD database. Thus, when the user chooses ‘View Graph’, the TandemGraph application opens, automatically connects to the TRedD database and downloads the information about the repeats. There is a menu of all of the chromosomes in the human genome in TandemGraph, so that the user can switch to a different chromosome without returning to the browser.

Note: In order to run TandemGraph directly from the browser, it is necessary to have java installed on your computer. Java is available as a free download from Sun Microsystems at http://java.com/en/download/. It is also preferable to use Firefox to open the java program. Using Internet Explorer, it is necessary to save and rename the executable with a .jar filename extension.

Current and future work

We have several plans of enhancements to the TRedD database that are under way. In this section, we discuss two of the projects that we are currently working on: comparing our results with other tandem repeats data, and merging our data with existing annotation.

Evaluation of the repeats found in TRedD

There are two different general approaches to defining an approximate tandem repeat. The first is a consensus-type repeat, that is, there is some string called a consensus, that is similar to all copies of the repeat. Note that it is possible that the consensus string does not appear as an exact copy in the repeat. We say that a repeat has K errors, if the pairwise sum of the errors between each copy and the consensus is K. Benson et al. in TRDB (11) follow the consensus approach.

Our approach is different, in that we consider evolutive repeats, where we relate each copy to the preceding and following copy. We count the errors between adjacent copies, and there is not necessarily any agreement over all of the copies. The assumption here is that each copy is derived from a neighboring copy, possibly with mutations.

Following is an interesting observation relating the two different definitions.

Observation 1Every consensus type repeat with K errors, is also an evolutive repeat with no more than 2K errors.

This observation can be proven by considering the changes necessary to convert a copy into the consensus, and then converting the consensus into the following copy. We point out that this observation does not work vice versa.

Following this observation, we see that any program that finds evolutive tandem repeats should find consensus-type repeats as well. This has been confirmed to some extent by tests that we have been running to compare our results with the results of TRDB. Many of the repeats found in TRDB are found almost identically in TRedD. In addition, TRedD should contain repeats that are inherently evolutive; those repeats are not found by a consensus-repeat program. It is stated in the literature that evolutive tandem repeats occur in biological sequences (20). We are in the midst of analyzing the additional results of our program to determine what kind and how many evolutive tandem repeats actually occur in the human genome. In chromosome 1, our preliminary tests have revealed that 36% of the repeats found in TRedD have no overlapping repeats in the output of TRDB. Following we give a simple example of a repeat that occurs in chromosome 1 that is inherently evolutive.

Example of evolutive-type repeat which does not have a consensus

The period begins as TGTA, changes to TATA, TATG, TAT, and then TT.

227807473 TGTA 227807476
227807477 TGTA 227807480
227807481 TATA 227807484
227807485 TATA 227807488
227807489 TATA 227807492
227807493 TATG 227807496
227807497 TAT- 227807499
TAT
227807500 T-T 227807501
TT
227807502 TT 227807503
227807504 TT 227807505
227807506 TT 227807507
227807508 T 227807508

Errors: 4 Percent Matching: 88.24%

Following is another very interesting example that we noticed during evaluation. In TRDB, four different repeats are reported, beginning and ending at the same locations as the following repeat. Our program reports this as one repeat, with a changing period size, averaging 24.2. This different view of (perhaps) the same repeat illustrates the importance of using different definitions and software for locating tandem repeats in the human genome.

TRDB output:

Start End Period Size
110832 110998 11
110832 110998 9
110832 110998 20
110832 110998 26

TRed output:

Start: 110832 End: 111001 Period Size: 24.2
110832 TATATATTATATATCTATTA 110851
110852 TATATAATATATATCTATTA 110871
TATAT-A-ATAT–-ATATCTATTA
110872 CATATTATATATTGTATATCTATTA 110896
CATAT-TAT-ATAT-TGTATATCTATTA
110897 CATATATATTATATATGTAT-TATAT-A 110922
CAT-ATATATTATATATGTATTATATA
110923 TATTATATATTATATATGTATTATATA 110949
110950 TATTATATATTATATATCTATTATATA 110976
TATTATATATTATATATCTATTATAT-A
110977 TA-TA-ATATTATATAT-TA-TATATCA 111000
111001 T 111001

Relating repeats to known genes and diseases

There are numerous popular biological databases that include information about the human genome and the genes occurring in the human genome. It is important that our data be understood in the context of this existing information, in a setting that is familiar to researchers in biology. In order to integrate our data with other known genomic features that are interesting to biologists, we are working on joining our data with a well-known Genome Browser. Our first attempt is through the UCSC Genome Browser (28) (http://genome.ucsc.edu/cgi-bin/hgCustom?hgsid=148664807), due to its popularity among biologists. We are using the genomic annotations, i.e. standard tracks, and the configuration for adding extra information called ‘custom annotation tracks’.

We have begun parsing our data into GFF format in order to be able to upload it to the UCSC browser as a custom track. In Figure 4, we show an example of the UCSC genome browser for gene TNFSF9 from human chromosome 19, which extends from start location 6 482 010 to end location 6 486 939. We chose this particular range of chromosome 19, since there are known disease genes in this range (18). We have turned off most of the tracks that come with UCSC's browser, and kept only the genes, repeat masker and simple repeat tracks. Our custom track is red and it is labeled TRED.

Figure 4.

This figure is the view of the UCSC genome browser of gene TNFSF9 from human chromosome 19. The custom track is the red one, and it is labeled Tred.

We have checked eight genes manually, and it seems that our results ‘agree’ with the repeat masker and simple repeat on a larger scale but are not exactly the same (which is expected). In Figure 4, we chose an example that shows that our repeats are similar to known repeat prediction but are still different enough that they warrant further investigation.

Currently, we are in the midst of:

Adding repeat results from all of the chromosomes as custom tracks. This will integrate our results with the genomic features available in UCSC's genome browser.
Adding all repeat results from other repeats databases, such as TRDB. This will facilitate the comparison of our results with results from other repeat finding software.
Evaluating pros and cons between the UCSC genome browser and other browsers such as JBrowse (http://gmod.org/wiki/JBrowse).
Implementing two-way search between repeat elements and annotated genomic features.

Conclusion

We have created a database of tandem repeats in the human genome based upon a new and innovative definition of evolutive tandem repeats. We have also developed a tool to graphically depict the repeats occurring in a sequence which will greatly facilitate analysis of results. This tool can be used as well with other repeat-finding software, and we will distribute it freely.

Some questions that may be asked about evolutive tandem repeats concern the frequency and the levels of mutational difference between adjacent copies within a repeat. Non-uniform patterns of difference may suggest that the mutation process favors a restricted range of copy-to-copy similarity. It is our hope that the scientific community will use our database to gain new insights into tandem repeats in DNA, and we invite users to give feedback and suggestions.

Funding

National Science Foundation (grant number DB&I 0542751) Professional Staff Congress-City University of New York Research Award Program (grant number 62280-0040i). Funding for the Open Access charge: National Science Foundation Grant.

Conflict of interest statement. None declared.

References

1

Gatchel

JR

,

Zoghbi

H

.

Diseases of unstable repeat expansion: mechanisms and common principles

,

Nat. Rev. Genet.

,

2005

, vol.

6

(pg.

743

-

755

)

2

Mirkin

SM

.

DNA structures, repeat expansions and human hereditary disorders

,

Curr. Opin. Struct. Biol.

,

2008

, vol.

16

(pg.

351

-

358

)

3

Usdin

K

.

The biological effects of simple tandem repeats: lessons from the repeat expansion diseases

,

Genome Res.

,

2008

, vol.

18

(pg.

1011

-

1019

)

4

Jeffreys

AJ

.

DNA typing: approaches and applications

,

J. Forensic Sci. Soc.

,

1993

, vol.

3

(pg.

204

-

211

)

5

Uform

M

,

Wayne

R

.

Microsatellites and their application to population genetic studies

,

Curr. Opin. Genet. Dev.

,

1993

, vol.

3

(pg.

939

-

943

)

6

Spong

G

,

Hellborg

L

.

A near-extinction event in lynx: do microsatellite data tell the tale? Conservat

,

Ecol.

,

2002

, vol.

6

pg.

15

7

Benson

G

.

Sequence alignment with tandem duplication

,

J. Comp. Biology

,

1997

, vol.

4

(pg.

351

-

367

)

8

Kitada

H

,

Tono

K

,

Yamamoto

MT

, et al.

Multiple alignment of biological sequences containing tandem repeats

,

Genome Informat.

,

1996

, vol.

7

(pg.

276

-

277

)

http://www.loria.fr/mreps/

9

Frazier

ME

,

Johnson

GM

,

Thomassen

DG

, et al.

Realizing the potential of the genome revolution: the genomes to life program

,

Science

,

2003

, vol.

300

(pg.

290

-

293

)

10

Collins

FS

,

Morgan

M

,

Patrinos

A

.

The Human Genome Project: lessons from large-scale biology

,

Science

,

2003

, vol.

300

(pg.

286

-

290

)

11

Benson

G

.

Tandem repeats finder – a program to analyze DNA sequences

,

Nucleic Acids Res.

,

1999

, vol.

27

(pg.

573

-

580

)

12

Kolpakov

R

,

Kucherov

G

.

mreps: efficient and flexible detection of tandem repeats in DNA

,

Nucleic Acids Res.

,

2003

, vol.

31

(pg.

3672

-

3678,

)

13

Wexler

Y

,

Yakhini

Z

,

Kashi

Y

,

Geiger

D

.

Bourne

PE

,

Gusfield

D

.

Finding approximate tandem repeats in genomic sequences

,

RECOMB

,

2004

New York, NY

ACM

(pg.

223

-

232

)

Google Preview

http://bioinf.dms.med.uniroma1.it/JSTRING

14

Parisi

V

,

Fonzo

VD

,

Aluffi-Pentini

F

.

STRING: finding tandem repeats in DNA sequences

,

Bioinformatics

,

2003

, vol.

19

(pg.

1733

-

1738,

)

15

Boeva

V

,

Makeev

V

,

Régnier

M

.

SWAN: searching for highly divergent tandem repeats in DNA sequences and statistical significance

,

Proc. IEEE Comp. Soc., JOBIM'04

,

2004

Montréal, IEEE Computer Society.

16

Gelfand

Y

,

Rodriguez

A

,

Benson

G

.

TRDB - the Tandem Repeats Database

,

Nucleic Acids Res.

,

2007

, vol.

35

(pg.

80

-

87

)

17

Ruitberg

C

,

Reeder

D

,

Butler

J

.

STRBase: a short tandem repeat DNA database for the human identity testing community

,

Nucleic Acids Res.

,

2001

, vol.

29

(pg.

320

-

322

)

18

Boby

T

,

Patch

A

,

Aves

S

.

TRbase: a database relating tandem repeats to disease genes for the human genome

,

Bioinformatics

,

2005

, vol.

21

(pg.

860

-

921

)

19

Sokol

D

,

Benson

G

,

Tojeira

J

.

Tandem repeats over the edit distance

,

Bioinformatics

,

2007

, vol.

23

(pg.

e30

-

e35

)

20

Groult

R

,

Leonard

M

,

Mouchard

L

.

Speeding up the detection of evolutive tandem repeats

,

Theor. Comput. Sci.

,

2004

, vol.

310

(pg.

309

-

328

)

21

Landau

GM

,

Schmidt

J

,

Sokol

D

.

An algorithm for approximate tandem repeats

,

J. Comput. Biol.

,

2001

, vol.

8

(pg.

1

-

18

)

22

Main

M

,

Lorentz

R

.

An

algorithm for finding all repetitions in a string

,

J. Algorithms

,

1984

, vol.

5

(pg.

422

-

432

)

23

Landau

GM

,

Vishkin

U

.

Fast parallel and serial approximate string matching

,

J. Algorithms

,

1989

, vol.

10

(pg.

157

-

169

)

24

Landau

GM

,

Myers

EW

,

Schmidt

JP

.

Incremental string comparison

,

SIAM J. Comput.

,

1998

, vol.

27

(pg.

557

-

582

)

25

Sokol

D

,

Rakhamimov

R

.

Arabnia

HR

,

Yang

MQ

.

TandemGraph: a graphical tool for modeling string regularities

,

BIOCOMP

,

2009

Athens, GA

CSREA Press

(pg.

536

-

540

)

Google Preview