CSHL ENCODE Small RNA Production Pipeline 
Jan 2011 Data Freeze

Carrie A. Davis, Wei Lin, Alex Dobin and Thomas Gingeras

Structure of Small RNA Library:  

5’ Adapter (This is the RNA ligated onto the 5’ end): “r” = ribose, RNA base
	5’- rArCrArCrUrCrUrUrUrCrCrCrUrArCrArCrGrArCrGrCrUrCrUrUrCrCrGrArUrCrU
Alternate Barcoded 5’ Adapter (This is the RNA ligated onto the 5’ end): “r” = ribose, RNA base
	5’- rArCrArCrUrCrUrUrUrCrCrCrUrArCrArCrGrArCrGrCrUrCrUrUrCrCrGrArUrCrU

[Note some libraries may contain a counter at the 3’ end of the 5’ inker that contain some random bases NNN followed by a constant region, CG.  These can be used to more accurately determine expression values and will be read off as the first 5 bases in read 1].

3’ Adapter: Polyadenines added by Poly-A Polymerase
RT Primer:
	5’-TCTCGGCATTCCTGCTGAACCGCTCTTCCGATCTTTTTTTTTTTTVN
PE 5’ PCR (PCR Primer):
	5’-AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATC
PE 3’ PCR (PCR Primer):
	5’-CAAGCAGAAGACGGCATACGAGATCGGTCTCGGCATTCCTGCTGAACCGCTCTTC

Sequencing:  The libraries are sequenced in SE36 format from the 5’ end of the transcript.  

Pre-Mapping Filtration of the Reads:  Biochemistry is not perfect. Some residual DNA and small nonfunctional decay products are likely to be surreptitiously, yet reproducibly cloned in any protocol and will hence be mapped and survive IDR.  To avoid misinterpretation of these false positives we apply a variety of filters.  

It is important to remove the A-tails and Illumina 3’ linker added in vitro prior to mapping else the inclusion of these sequences can mismap these reads to sites that are genomically rich in polyadenine stretches and have homology to the Illumina adapter.  Prior to mapping, the sequences are scanned through and sequences with homology to the adapters are clipped off according to the following rule. The clipper algorithm first aligns the adapter sequence to the reads and finds alignment with the fewest number of mismatches. If the number of mismatches is less than 20% of the aligned length it clips off to sequence from the position of the first aligned base to the end of the read.  After clipping, the remaining sequence as well as the other unclipped sequences are then mapped.

Mapping Parameters:  STAR is used to map the reads with the following parameters: 

ntTrim5p 0
ntTrim3p 0
QasciiSubtract 64
Qsplit 10
Qtop 10
penGgap 1000
penNoncan 41
minDel 5
penDel 11
penDelBase 10
winFlank 250000
anchorMaxDist 501000
anchorMaxMapN 20
nMatch 16
nMM 10
pMM 0.1
multMapMaxScoreDiff 19
minDonorAcceptor 5
chimL 10000000
minChimericDA 20
maxChimReadGap 0
multMapNmaxOut 10
adapter3pSeq AAAAAAAAAAAAGATCGGAAGAGCGGTTCAGCAGGAATGCCGAGA
strandType 1
adapter3pClipMMp 0.2

All mapped reads satisfying these criteria are reported in the .BAM and .Wig files at UCSC.

Post-Mapping Filtration of Mapped Reads:  Some reads that would result in false mapping survive the pre-mapping filter so we apply additional post-mapping filters prior to element calling.
(1) If the mapped read contains 5 or more consecutive A’s in its sequence it is discarded.  We believe these are artifacts escaping the current pre-mapping filter.

Other types of false positives can only be identified post-mapping and also need to be filtered: 

The smaller the sequence is, the more times it is likely to map non-specifically to multiple locations hence we filter (2) all reads that map less than 16 nucleotides in length and (3) all reads that map to more than one locus (i.e. all multimappers).  Note: We don’t think that all the multimappers are artifacts. Indeed, many of the small RNA classes are by nature heavily repeated. Simply that we don’t yet know how to treat multimappers and their removal streamlines analysis.  

(4) Yes, RT prefers to use RNA as a template. However, it is also active on DNA.  Anchored oligo-dT is used to prime the RT reaction. Reads mapping upstream of genomically encoded poly-A tracks could represent RT priming off of residual DNA.  We filter mapped reads if there are 4 or more consecutive A’s in the 7 nucleotides 3’ of the mapped sequence. 

Filtered data from both biological replicates is then pooled and contigs are called.

Contig Calling (Annotation Agnostic):  Contigs represent the union of the qualified/post-filtered mapped reads from each biological replicate.  Each contig gets a score representing the sum of the reads from both bioreps.  Contigs with only a single read (0,1; 1,0) are discarded.  The surviving contigs are then independently scored for each replicate according to their mapped read count.  These values are used to compute Non-parametric-IDR (A Dobin, CSHL) values for each contig. The original non-parametric-IDR is based on the presumption of correlation between element abundances and their reproducibility.  The npIDR vs. read number curve is smoothed by taking the logarithm on both values and then performing linear fitting as equation.
, where ‘r’ is the total read number in both replicates. These new npIDR are reported in .BED format. The contigs with (0,X; X,0) are then filtered out from the bed file although they have been included in Non-parametric-IDR calculation.

.BED Format:
Column 1: Chromosome ID
Column 2: Contig Start
Column 3: Contig Stop
Column 4: Contig Name (named by co-ordinates and cross-reference on gencode v7)
Column 5: Contig Expression Score (RPKM computed using the summed reads from both bioreplicate. Value capped at 1,000). 
Column 6: Strand
Column 7:  Real RPKM (non-capped)
Column 8: IDR score (.dot if there is only 1 replicate)
Column 9: Sum of the read counts for both biological replicates. 

Expression Values for Gencode v7 Annotate Smaller RNAs (Annotation Dependent): Reads per million of mapped read (RPM) values were computed for all Gencode v7 exons.

Similar filtering criteria is applied here.  Namely, prior to determining the expression values for the above exons we filter out multi-mappers, read that map < 16 nucleotides in length, mapped reads with internal A-stretches, mapped reads that have genomically encoded A-stretch at their 3’ ends. The exons are then scored independently for each biological replicate according to their RPM values.  A read is counted only if it’s 5’ end maps into the exon.  Exons with a score of zero in both biological replicates are removed.  Smoothed Non-parametric IDR (described above) is then calculated.  The results are reported in GFF format. 

.GFF Format:
Column 1: Chromosome ID
Column 2: Gencode Annotation version (v7)
Column 3: Gencode Annotation type (exon)
Column 4: Start
Column 5: Stop 
Column 6: Strand
Column 7:  Exon Expression Score (RPM computed using the summed reads from both bioreplicate. Value capped at 1,000). 
Column 8: Strand
Column 9: Description, IDR value determined on read number in both biological replicates.