Repeat Browser Tutorial

Repeat Browser Tutorial

Last Updated Oct 2019

This tutorial will walk you through how to visualize data on the UCSC Repeat Browser. The Repeat Browser provides an easy way of visualizing genomic data on consensus versions of repeat families. This can be useful in a variety of ways; for instance if you'd like to study a particular transcription factor and its binding to transposable elements, the Repeat Browser can aggregate the data from every TE of the same class and display its binding on a consensus. The Repeat Browser is further described in Fernandes, Haeussler et al., 2018

Step 1: Process your data on the human genome assembly of your choice (hg19 or hg38).

The Repeat Browser is most commonly used to examine ChIP-SEQ data but potentially any coordinate data (e.g. ATAC-SEQ, annotation sets) can be lifted. Indeed many standard annotations are already lifted and available as default tracks. While nothing stops you from lifting RNA-SEQ data, you might want to stop and think about if that's what you really want to do (see FAQ).

For most ChIP-SEQ workflows you will map your reads to an assembly of the human genome. The two most recent assemblies are hg19 and hg38. Genomic mapping is typically done using a mapping algorithm like bowtie2 or bwa. Since you are studying repeats you probably don't want to get rid of multi-mapping reads (reads which map equally well to multiple parts of the genome)! Note that bowtie2 should probably be run in non-deterministic mode to assign multi-mapping reads randomly. After mapping, you will call peaks with peak calling software like macs2. The result will be something like a bed file containing coordiantes on the human genome that you now wish to view on the Repeat Browser. In most cases we are most interested in the summits of peaks which we can extend by an arbitray number of nucleotides (typically +/- 5-50 bases) to smooth Repeat Browser peaks. We provide two samples files that you can use for this tutorial. These files are ChIP-SEQ summites from this highly recommended paper. ZNF765 is a KRAB Zinc Finger Protein which binds the transposable element families L1PA6, L1PA5 and L1PA4 in a quite characteristic way.

Sample Files:
ZNF765_Imbeault_hg19.bed [summits of hg19 mapping and peak calling; summits extended to 40 nt]
ZNF765_Imbeault_hg38.bed [the above file "lifted" to hg38]

Step 2: Lift from the human genome assembly (hg19 or hg38) to the Repeat Browser (repeats2).

To lift you need to download the liftOver tool. "Lifting" is usually a process by which you can transform coordinates from one genome assembly to another. For the Repeat Browser we are "lifting" from the human genome to a library of consensus sequences.

You can download the appropriate binary from here:
http://hgdownload.soe.ucsc.edu/admin/exe/.
For instance, the tool for Mac OSX (x86, 64bit) is:
http://hgdownload.soe.ucsc.edu/admin/exe/macOSX.x86_64/liftOver

Once you have downloaded it you want to put in your path or working directory so that when you type "liftOver" into the command prompt you get a message about liftOver.

Once you have liftOver you need the liftOver file which provides mappings from the appropriate human genome assembly (hg19 or hg38) to the Repeat Browser (repeats2).

You can download the files here:
hg19Torepeats2.liftOver[transforms hg19 coordinate to Repeat Browser coordinates]
hg38Torepeats2.liftOver[transforms hg38 coordinate to Repeat Browser coordinates]

Now you have all three ingredients to lift to the Repeat Browser:
1) Your hg38/hg19 data
2) Your hg38 or hg19 to repeats2 liftover file
3) The liftOver tool

You can use the following syntax to lift:

liftOver <hg38/hg19 data> <hg38/hg19 liftover file> <Repeat Browser output file> <unmapped file>

The "Repeat Browser file" is your data now in Repeat Browser coordiantes. The unmapped file contains all the genomic data that wasn't able to be lifted. This should mostly be data which is not on repeat elements. You don't need this file for the Repeat Browser but it is nice to have.

This procedure implemented on the demo file is:

liftOver ZNF765_Imbeault_hg38.bed hg38Torepeats2.liftOver ZNF765_Imbeault_hg38_repeats2.bed ZNF765_Imbeault_hg38_repeats2.unmapped

Now you have a file which can be visualized on the Repeat Browser! If you wish to turn it into a coverage track do the following (requires bedtools & the repeats2 "genome file", and bedGraphToBigWig a UCSC tool available in the same download directory where you downloaded liftOver: http://hgdownload.soe.ucsc.edu/admin/exe/

bedSort ZNF765_Imbeault_hg38_repeats2.bed ZNF765_Imbeault_hg38_repeats2_sort.bed

followed by

bedtools genomecov -bg -split -i ZNF765_Imbeault_hg38_repeats2_sort.bed -g repeats2.chromInfo > ZNF765_Imbeault_hg19_repeats2_sort.bg

followed by

bedGraphToBigWig ZNF765_Imbeault_hg19_repeats2_sort.bg repeats2.chromInfo ZNF765_Imbeault_hg19_repeats2_sort.bw

Step 3: Visualizing on the Repeat Browser

Go to the Repeat Browser.
Click on My Data -> Custom Tracks

You can now upload the file (or copy and paste links to multiple files)

Click on Genome Browser

Your track will appear either as "User Track" (if no track information is in the file) or as a named track in the (Other) section.

The sample file (hg19) should look as below on L1PA5: [click here for interactive session]

You can go to any other repeat type by simply typing the name of the repeat into the search bar. A full list of all consensus repeats and their lengths is here

Step 4: Analyze your data with existing tracks.

You can click on the Table Browser to perform intersections, unions, etc through this user interface as you would normally with the Table Browser and the UCSC Genome Browser. You can also download tracks and perform this analysis on the command line with many of the UCSC tools.

FAQ

Can I visualize RNA-SEQ data on the Repeat Browser?

It is possible to map coverage tracks or other aspects of your RNA-SEQ data to the Repeat Browser, but it's not really clear what this means. The consensuses we provide have unequal coverage from all their genomic instances (see "Mapping Coverage" track in "Mapping to the Human Genome"). In simpler terms, if you look at all L1PA elements, there are many, many more bits of the 3' end scattered throughout the genome then full length elements so some sort of normalization would be needed to meaningfully interpret results mapped to the consensus.

If you are interested in quantifying the expression of TE families we recommend using existing tools designed for that purpose. TETranscripts, REdiscoverTE, and SalmonTE all do this in slightly different ways and can tell you if TEs are expressed in your dataset.

Can I visualize non-reference TE insertions on the Repeat Browser?

Depends on what you mean. Liftover files only exist for hg19 and hg38, so non-reference TEs don't have pre-computed coordinates on our consensuses and therefore can't be lifted. You can generate your own liftOver file for another assembly. Instructions on how to do that coming soon!

What you can also do is BLAT your TEs to the references and visualize them, similar to what we did for non-human primate TEs in our manuscript.
1) Download BLAT.
2) Then download the consensus sequence you wish to map to (you can View DNA for the element of interest in the browser, download the sequences from the Table Browser or download this file. You can also map to all the consensus sequences. Note mapping to all consensus sequences is slightly different than what we do when lifting. That is we only ever lift from an annotated L1PA2 to the L1PA2 consensus. If you blat a list of L1PA2 sequences against all consensuses (repeats2) you will almost certainly get mappings to other L1PAs. Be sure that's what you want to do.
3) BLAT them to create a psl file.
4) Load the psl as a custom track.

Are your references the RepBase consensuses?

No. Our consensuses are generated from RepeatMasker output. RepBase consensuses are available from GIRI with a subscription.

Why is <favorite TE element> not included?

Your TE probably isn't in the hg19 RepeatMasker output. Is your TE present in humans? If so, it's probably one of the TEs only present in hg38 annotations; if there is sufficient interest in it let us know we can try adding the consensus in a later update, but only hg38 data would be able to lift to it.

Where is L1PA9?

Good question, it's not in RepeatMasker output!

What is the deal with L1PA3long and L1PA3short?

Each of these are subclasses of L1PA3 elements. L1PA3 elements deleted a 129-bp region of themselves to escape ZNF93-mediated repression. See Jacobs et al, 2014 for details. Please note that the provided liftover files only lift to L1PA3long which is the L1PA3 consensus produced through our standard workflow.

Does this exist for mouse (mm10)?

No. Would you find a mouse browser useful? Let us know!

I have a question! I guess it's not frequently asked.

Please email

jferna10 at ucsc.edu
max at soe. ucsc.edu