CSHL ENCODE Small RNA Production Pipeline Jan 2011 Data Freeze Carrie A. Davis, Wei Lin, Alex Dobin and Thomas Gingeras Structure of Small RNA Library: 5’ Adapter (This is the RNA ligated onto the 5’ end): “r” = ribose, RNA base 5’- rArCrArCrUrCrUrUrUrCrCrCrUrArCrArCrGrArCrGrCrUrCrUrUrCrCrGrArUrCrU Alternate Barcoded 5’ Adapter (This is the RNA ligated onto the 5’ end): “r” = ribose, RNA base 5’- rArCrArCrUrCrUrUrUrCrCrCrUrArCrArCrGrArCrGrCrUrCrUrUrCrCrGrArUrCrU [Note some libraries may contain a counter at the 3’ end of the 5’ inker that contain some random bases NNN followed by a constant region, CG. These can be used to more accurately determine expression values and will be read off as the first 5 bases in read 1]. 3’ Adapter: Polyadenines added by Poly-A Polymerase RT Primer: 5’-TCTCGGCATTCCTGCTGAACCGCTCTTCCGATCTTTTTTTTTTTTVN PE 5’ PCR (PCR Primer): 5’-AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATC PE 3’ PCR (PCR Primer): 5’-CAAGCAGAAGACGGCATACGAGATCGGTCTCGGCATTCCTGCTGAACCGCTCTTC Sequencing: The libraries are sequenced in SE36 format from the 5’ end of the transcript. Pre-Mapping Filtration of the Reads: Biochemistry is not perfect. Some residual DNA and small nonfunctional decay products are likely to be surreptitiously, yet reproducibly cloned in any protocol and will hence be mapped and survive IDR. To avoid misinterpretation of these false positives we apply a variety of filters. It is important to remove the A-tails and Illumina 3’ linker added in vitro prior to mapping else the inclusion of these sequences can mismap these reads to sites that are genomically rich in polyadenine stretches and have homology to the Illumina adapter. Prior to mapping, the sequences are scanned through and sequences with homology to the adapters are clipped off according to the following rule. The clipper algorithm first aligns the adapter sequence to the reads and finds alignment with the fewest number of mismatches. If the number of mismatches is less than 20% of the aligned length it clips off to sequence from the position of the first aligned base to the end of the read. After clipping, the remaining sequence as well as the other unclipped sequences are then mapped. Mapping Parameters: STAR is used to map the reads with the following parameters: ntTrim5p 0 ntTrim3p 0 QasciiSubtract 64 Qsplit 10 Qtop 10 penGgap 1000 penNoncan 41 minDel 5 penDel 11 penDelBase 10 winFlank 250000 anchorMaxDist 501000 anchorMaxMapN 20 nMatch 16 nMM 10 pMM 0.1 multMapMaxScoreDiff 19 minDonorAcceptor 5 chimL 10000000 minChimericDA 20 maxChimReadGap 0 multMapNmaxOut 10 adapter3pSeq AAAAAAAAAAAAGATCGGAAGAGCGGTTCAGCAGGAATGCCGAGA strandType 1 adapter3pClipMMp 0.2 All mapped reads satisfying these criteria are reported in the .BAM and .Wig files at UCSC. Post-Mapping Filtration of Mapped Reads: Some reads that would result in false mapping survive the pre-mapping filter so we apply additional post-mapping filters prior to element calling. (1) If the mapped read contains 5 or more consecutive A’s in its sequence it is discarded. We believe these are artifacts escaping the current pre-mapping filter. Other types of false positives can only be identified post-mapping and also need to be filtered: The smaller the sequence is, the more times it is likely to map non-specifically to multiple locations hence we filter (2) all reads that map less than 16 nucleotides in length and (3) all reads that map to more than one locus (i.e. all multimappers). Note: We don’t think that all the multimappers are artifacts. Indeed, many of the small RNA classes are by nature heavily repeated. Simply that we don’t yet know how to treat multimappers and their removal streamlines analysis. (4) Yes, RT prefers to use RNA as a template. However, it is also active on DNA. Anchored oligo-dT is used to prime the RT reaction. Reads mapping upstream of genomically encoded poly-A tracks could represent RT priming off of residual DNA. We filter mapped reads if there are 4 or more consecutive A’s in the 7 nucleotides 3’ of the mapped sequence. Filtered data from both biological replicates is then pooled and contigs are called. Contig Calling (Annotation Agnostic): Contigs represent the union of the qualified/post-filtered mapped reads from each biological replicate. Each contig gets a score representing the sum of the reads from both bioreps. Contigs with only a single read (0,1; 1,0) are discarded. The surviving contigs are then independently scored for each replicate according to their mapped read count. These values are used to compute Non-parametric-IDR (A Dobin, CSHL) values for each contig. The original non-parametric-IDR is based on the presumption of correlation between element abundances and their reproducibility. The npIDR vs. read number curve is smoothed by taking the logarithm on both values and then performing linear fitting as equation. , where ‘r’ is the total read number in both replicates. These new npIDR are reported in .BED format. The contigs with (0,X; X,0) are then filtered out from the bed file although they have been included in Non-parametric-IDR calculation. .BED Format: Column 1: Chromosome ID Column 2: Contig Start Column 3: Contig Stop Column 4: Contig Name (named by co-ordinates and cross-reference on gencode v7) Column 5: Contig Expression Score (RPKM computed using the summed reads from both bioreplicate. Value capped at 1,000). Column 6: Strand Column 7: Real RPKM (non-capped) Column 8: IDR score (.dot if there is only 1 replicate) Column 9: Sum of the read counts for both biological replicates. Expression Values for Gencode v7 Annotate Smaller RNAs (Annotation Dependent): Reads per million of mapped read (RPM) values were computed for all Gencode v7 exons. Similar filtering criteria is applied here. Namely, prior to determining the expression values for the above exons we filter out multi-mappers, read that map < 16 nucleotides in length, mapped reads with internal A-stretches, mapped reads that have genomically encoded A-stretch at their 3’ ends. The exons are then scored independently for each biological replicate according to their RPM values. A read is counted only if it’s 5’ end maps into the exon. Exons with a score of zero in both biological replicates are removed. Smoothed Non-parametric IDR (described above) is then calculated. The results are reported in GFF format. .GFF Format: Column 1: Chromosome ID Column 2: Gencode Annotation version (v7) Column 3: Gencode Annotation type (exon) Column 4: Start Column 5: Stop Column 6: Strand Column 7: Exon Expression Score (RPM computed using the summed reads from both bioreplicate. Value capped at 1,000). Column 8: Strand Column 9: Description, IDR value determined on read number in both biological replicates.