Description

This track shows Open Reading Frame annotations. For consensuses from Dfam these come directly from the Dfam 3.1 annotations stored here.

You can also view them on Dfam.org under the "features" tab.

For consensuses built by the RepeatBrowser "buildSeqs" script we used tblastn to search the RepeatMasker peptide library and took the highest scoring hit (the highest scoring hit also had to pass a score threshold of 50).

Note that the Repeat Masker Peptide Library does not contain peptides for every element. That is, L1PA2 proteins are assigned to L1HS not because of a poor match, but because L1PA2 is the representative protein for all young L1PA (i.e. there is no L1HS peptide in the library).

Methods

#for exact matches

awk '($6=="exact")' summary.tsv | cut -f 1 | while read i; do sed -n -e '/^NM\s*'${i}'$/,/\/\// p' Dfam.embl | grep -A1 "FT.*CDS "| tr -s "\n" "\t" | sed "s/--/\n/g" | tr -s "." " \t" | tr -s " " "\t" | tr -s "\"" "\t" | awk -v s=$i '{print s"\t"$3"\t"$4"\t"$7}'; done > exact_annotations.tab

#for renamed matches

awk '($6=="int-strip")' summary.tsv | cut -f 1 | while read i; do sed -n -e '/^NM\s*'${i}'$/,/\/\// p' Dfam.embl | grep -A1 "FT.*CDS "| tr -s "\n" "\t" | sed "s/--/\n/g" | tr -s "." " \t" | tr -s " " "\t" | tr -s "\"" "\t" | awk -v s=$i '{print s"\t"$3"\t"$4"\t"$7}'; done > intstrip_annotations.tab

#for manual matches

awk '($6!="int-strip" && $6!="exact" && $6!="cons")' summary.tsv | cut -f 1 | while read i; do sed -n -e '/^NM\s*'${i}'$/,/\/\// p' Dfam.embl | grep -A1 "FT.*CDS "| tr -s "\n" "\t" | sed "s/--/\n/g" | tr -s "." " \t" | tr -s " " "\t" | tr -s "\"" "\t" | awk -v s=$i '{print s"\t"$3"\t"$4"\t"$7}'; done > manual_annotations.tab

cat intstrip_annotations.bed manual_annotations.bed exact_annotations.bed | awk '{print $0"\t0\t+"}'| sort | uniq > Dfam_annotations.bed

#built cons

awk '($6=="cons")' ../summary.tsv | cut -f 1 | while read i; do sed -n -e '/^>'${i}'$/,/^>/ p' RepeatBrowserHub/hg38reps/hg38reps.fa | head -n-1 >> built_cons.fa; done

cd peps/

#RepeatMasker pep library from Hubley

wget https://github.com/rmhubley/RepeatMasker/raw/master/Libraries/RepeatPeps.lib

#

makeblastdb -in built_cons.fa -dbtype nucl -parse_seqids

#phylogenetic groupings for TEs found in hg38 (this list was arrived at by taking the Dfam name, reduce repeat peptide libs to peptides from those groups

cut -f 1 summary.tsv | while read i ; do sed -n -e '/^NM\s*'${i}'$/,/\/\// p' Dfam.embl | grep "^OS" | cut -f2; done > hg38_TE_phlyo.txt

cat hg38_TE_phlyo.txt | tr -s " " "\t" | cut -f2 | sort | uniq | tr "\n" "|"

#copy and paste output to grep,reduces peptide library significantly, prevents aligning proteins from random creatures to human

grep -E 'Afrotheria|Amniota|Boreoeutheria|Carnivora|Catarrhini|Cetartiodactyla|Euarchontoglires|Euteleostomi|Eutheria|Glires|Haplorrhini|Hominidae|Homininae|Hominoidea|Homo|Mammalia|Metatheria|Metazoa|Monotremata|Primates|Rodentia|Sauropsida|Scandentia|Simiiformes|Tetrapoda|Theria|Vertebrata|Xenarthra' RepeatPeps.lib -A1 > HumanRepeatPeps.lib

grep ">" built_cons.fa | cut -f 2 -d">" | while read i; do grep -A1 "$i" HumanRepeatPeps.lib;done > Human_built_cons_peps.lib

sed -zi "s/--\n//g" Human_built_cons_peps.lib

#find peptides in consensuses

tblastn -db built_cons.fa -query Human_built_cons_peps.lib -outfmt 6 > built_consensus_annotations.tab

cat built_consensus_annotations.tab | sort | uniq | sort -k 2,12 -n -r > temp.tab

#take the highest scoring gag and pol, (since all built consesnsus are L1s); Also impose filter that peptide score must also be above 50

grep ">" built_cons.fa | cut -f 2 -d">" | while read i; do grep -P "\t$i\t" temp.tab | grep "gag" | sort -k 12 -n -r | head -n+1 | awk '($12> 50) {print $2"\t"$9-1"\t"$10"\t"$1"\t"$12"\t+"}';done | sort | uniq > cons_gag.bed

grep ">" built_cons.fa | cut -f 2 -d">" | while read i; do grep -P "\t$i\t" temp.tab | grep "pol" | sort -k 12 -n -r | head -n+1 | awk '($12> 50) {print $2"\t"$9-1"\t"$10"\t"$1"\t"$12"\t+"}';done | sort | uniq > cons_pol.bed

mv *.bed ../

cd ..

#combine and change score to 0 (avoid issues with decimals etc that make browser unhappy)

cat Dfam_annotations.bed cons_gag.bed cons_pol.bed HERV_full.bed| awk '{print $1"\t"$2"\t"$3"\t"$4"\t"0"\t"$6}' | sed "s/L1PBA1/L1PBa1/g" | bedtools sort > ORFs.bed

bedToBigBed ORFs.bed ../hg38reps/hg38reps.sizes ORFs.bb

mv ORFs.bb ../hg38reps/orfs.bb

References

https://www.dfam.org/home

The Dfam database of repetitive DNA families, Robert Hubley; Robert D. Finn; Jody Clements; Sean R. Eddy; Thomas A. Jones; Weidong Bao; Arian F.A. Smit; Travis J. Wheeler;Nucleic Acids Research (2016) Database Issue 44:D81-89. doi: 10.1093/nar/gkv1272

< a href = "http://www.repeatmasker.org/libraries/"> RepeatPeps library from RepeatMasker

Email:markd@ucsc.edu or max@soe.ucsc.edu