dukeDnaseCd4Signal Duke DNase Sig Duke DNaseI Hypersensitivity Signal in CD4+ T-cells Regulation Description This track shows genome-wide DNaseI hypersensitive (HS) sites as determined in human CD4+ T-cells. DNaseI HS sites identify regions of open chromatin believed to be largely free of nucleosomes and have been shown to accurately identify the locations of genetic regulatory elements, including promoters, enhancers, silencers, insulators, and locus control regions. DNaseI HS sites are believed to differ in degree of openness. Scores shown in this annotation reflect an estimate of this openness. Display Conventions and Configuration In full and pack display modes, DNaseI HS scores are displayed as a "wiggle" (histogram), where the height reflects the size of the score. In dense display mode, DNaseI HS is shown in grayscale using darker values to indicate higher levels of overall hypersensitivity. The DNaseI HS wiggle can be configured in a variety of ways to highlight different aspects of the displayed information. Click the Graph configuration help link for an explanation of the configuration options. Methods DNaseI HS were determined for human CD4+ T-cells using DNase-sequencing and DNase-chip. Over 12.5 million uniquely aligned DNase sequence tags were generated using Illumina (Solexa) and 454 Life Sciences (Roche) sequencers from the same biological sample. Each base was assigned a score using Parzen windows kernal density estimation. The Nimblegen 38-array whole genome platform was used to generate data from 2 biological replicates. Ratio scores for the two were averaged. These data were combined by re-scaling each using Z-scores and then summing. Resulting scores above 0 are considered hypersensitive. Credits This annotation was created by Alan Boyle, Terry Furey, and Greg Crawford at Duke University's Institute for Genome Sciences & Policy (IGSP). References Boyle AP, Davis S, Shulha HP, Meltzer P, Margulies EH, Weng Z, Furey TS, Crawford GE. High-resolution mapping and characterization of open chromatin across the genome. Cell. 2008 Jan 25;132(2):311-22. Crawford GE, Davis S, Scacheri PC, Renaud G, Halawi MJ, Erdos MR, Green R, Meltzer PS, Wolfsberg TG, Collins FS. DNase-chip: a high-resolution method to identify DNase I hypersensitive sites using tiled microarrays. Nat Methods. 2006 Jul;3(7):503-9. Crawford GE, Holt IE, Whittle J, Webb BD, Tai D, Davis S, Margulies EH, Chen Y, Bernat JA, Ginsburg D, Zhou D, Luo S, Vasicek TJ, Daly MJ, Wolfsberg TG, Collins FS. Genome-wide mapping of DNase hypersensitive sites using massively parallel signature sequencing (MPSS). Genome Res. 2006 Jan;16(1):123-31. dukeDnaseSuper Duke DNaseI HS Duke DNaseI Hypersensitivity in CD4+ T-cells Regulation Overview This super-track combines related tracks of DNaseI sensitivity data from Duke University. These tracks contain DNaseI analysis of CD4+ T-cells, using DNase-sequencing and DNase-chip methods. CD4+ T-cells, also known as helper or inducer T cells, are involved in generating an immune response. CD4+ T-cells are also one of the primary targets of the HIV virus. DNaseI has long been used to map general chromatin accessibility and the DNaseI "hyperaccessibility" or "hypersensitivity" that is a universal feature of active cis-regulatory sequences. The use of this method has led to the discovery of functional regulatory elements that include enhancers, insulators, promotors, locus control regions and novel elements. DNaseI hypersensitivity signifies chromatin accessibility following binding of trans-acting factors in place of a canonical nucleosome, and is a universal feature of active cis-regulatory sequences in vivo. Credits These annotations were created by Alan Boyle, Terry Furey, and Greg Crawford at Duke University's Institute for Genome Sciences & Policy (IGSP). References Boyle AP, Davis S, Shulha HP, Meltzer P, Margulies EH, Weng Z, Furey TS, Crawford GE. High-resolution mapping and characterization of open chromatin across the genome. Cell. 2008 Jan 25;132(2):311-22. Crawford GE, Davis S, Scacheri PC, Renaud G, Halawi MJ, Erdos MR, Green R, Meltzer PS, Wolfsberg TG, Collins FS. DNase-chip: a high-resolution method to identify DNase I hypersensitive sites using tiled microarrays. Nat Methods. 2006 Jul;3(7):503-9. Crawford GE, Holt IE, Whittle J, Webb BD, Tai D, Davis S, Margulies EH, Chen Y, Bernat JA, Ginsburg D, Zhou D, Luo S, Vasicek TJ, Daly MJ, Wolfsberg TG, Collins FS. Genome-wide mapping of DNase hypersensitive sites using massively parallel signature sequencing (MPSS). Genome Res. 2006 Jan;16(1):123-31. dukeDnaseCd4Sites Duke DNase Sites Duke DNaseI Hypersensitive Sites in CD4+ T-cells Regulation Description This track shows a discretized version of genome-wide DNaseI hypersensitive (HS) sites as determined in human CD4+ T-cells. DNaseI HS sites identify regions of open chromatin believed to be largely free of nucleosomes and have been shown to accurately identify the locations of genetic regulatory elements, including promoters, enhancers, silencers, insulators, and locus control regions. DNaseI HS sites are believed to differ in degree of openness. Scores shown in this annotation reflect an estimate of this openness. Display Conventions and Configuration DNaseI HS sites are represented as gray or black boxes. The intensity of the box reflects the HS score and represents the estimated degree of hypersensitivity of the site. Methods DNaseI HS were determined for human CD4+ T-cells using DNase-sequencing and DNase-chip. Over 12.5 million uniquely aligned DNase sequence tags were generated using Illumina (Solexa) and 454 Life Sciences (Roche) sequencers from the same biological sample. Each base was assigned a score using Parzen windows kernal density estimation. The Nimblegen 38-array whole genome platform was used to generate data from 2 biological replicates. Ratio scores for the two were averaged. These data were combined by re-scaling each using Z-scores and then summing. This annotation reflects those contiguous regions with combined scores above zero. Each record contains the maximum base pair score within each identified region. Credits This annotation was created by Alan Boyle, Terry Furey, and Greg Crawford at Duke University's Institute for Genome Sciences & Policy (IGSP). References Boyle AP, Davis S, Shulha HP, Meltzer P, Margulies EH, Weng Z, Furey TS, Crawford GE. High-resolution mapping and characterization of open chromatin across the genome. Cell. 2008 Jan 25;132(2):311-22. Crawford GE, Davis S, Scacheri PC, Renaud G, Halawi MJ, Erdos MR, Green R, Meltzer PS, Wolfsberg TG, Collins FS. DNase-chip: a high-resolution method to identify DNase I hypersensitive sites using tiled microarrays. Nat Methods. 2006 Jul;3(7):503-9. Crawford GE, Holt IE, Whittle J, Webb BD, Tai D, Davis S, Margulies EH, Chen Y, Bernat JA, Ginsburg D, Zhou D, Luo S, Vasicek TJ, Daly MJ, Wolfsberg TG, Collins FS. Genome-wide mapping of DNase hypersensitive sites using massively parallel signature sequencing (MPSS). Genome Res. 2006 Jan;16(1):123-31. cons17way Conservation Multiz Alignment & Conservation (17 Species) Comparative Genomics Description This track shows a measure of evolutionary conservation in 17 vertebrates, including mammalian, amphibian, bird, and fish species, based on a phylogenetic hidden Markov model, phastCons (Siepel et al., 2005). Multiz alignments of the following assemblies were used to generate this track: human (May 2004 (NCBI35/hg17), hg17) chimp (Nov 2003, panTro1) macaque (Jan 2006, rheMac2) mouse (May 2004, mm7) rat (Jun 2003, rn3) rabbit (May 2005, oryCun1) dog (May 2005, canFam2) cow (Mar 2005, bosTau2) armadillo (May 2005, dasNov1) elephant (May 2005, loxAfr1) tenrec (Jul 2005, echTel1) opossum (Jun 2005, monDom2) chicken (Feb 2004, galGal2) frog (Oct 2004, xenTro1) zebrafish (May 2005, danRer3) Tetraodon (Feb 2004, tetNig1) Fugu (Aug 2002, fr1) Display Conventions and Configuration In full and pack display modes, conservation scores are displayed as a "wiggle" (histogram), where the height reflects the size of the score. Pairwise alignments of each species to the human genome are displayed below as a grayscale density plot (in pack mode) or as a "wiggle" (in full mode) that indicates alignment quality. In dense display mode, conservation is shown in grayscale using darker values to indicate higher levels of overall conservation as scored by phastCons. The conservation wiggle can be configured in a variety of ways to highlight different aspects of the displayed information. Click the Graph configuration help link for an explanation of the configuration options. Checkboxes in the track configuration section allow excluding species from the pairwise display; however, this does not remove them from the conservation score display. To view detailed information about the alignments at a specific position, zoom in the display to 30,000 or fewer bases, then click on the alignment. Gap Annotation The "Display chains between alignments" configuration option enables display of gaps between alignment blocks in the pairwise alignments in a manner similar to the Chain track display. The following conventions are used: Single line: No bases in the aligned species. Possibly due to a lineage-specific insertion between the aligned blocks in the human genome or a lineage-specific deletion between the aligned blocks in the aligning species. Double line: Aligning species has one or more unalignable bases in the gap region. Possibly due to excessive evolutionary distance between species or independent indels in the region between the aligned blocks in both species. Pale yellow coloring: Aligning species has Ns in the gap region. Reflects uncertainty in the relationship between the DNA of both species, due to lack of sequence in relevant portions of the aligning species. Genomic Breaks Discontinuities in the genomic context (chromosome, scaffold or region) of the aligned DNA in the aligning species are shown as follows: Vertical blue bar: Represents a discontinuity that persists indefinitely on either side, e.g. a large region of DNA on either side of the bar comes from a different chromosome in the aligned species due to a large scale rearrangement. Green square brackets: Enclose shorter alignments consisting of DNA from one genomic context in the aligned species nested inside a larger chain of alignments from a different genomic context. The alignment within the brackets may represent a short misalignment, a lineage-specific insertion of a transposon in the human genome that aligns to a paralogous copy somewhere else in the aligned species, or other similar occurrence. Base Level When zoomed-in to the base-level display, the track shows the base composition of each alignment. The numbers and symbols on the Gaps line indicate the lengths of gaps in the human sequence at those alignment positions relative to the longest non-human sequence. If there is sufficient space in the display, the size of the gap is shown; if not, and if the gap size is a multiple of 3, a "*" is displayed, otherwise "+" is shown. Codon translation is available in base-level display mode if the displayed region is identified as a coding segment. To display this annotation, select the species for translation from the pull-down menu in the Codon Translation configuration section at the top of the page. Then, select one of the following modes: No codon translation: The gene annotation is not used; the bases are displayed without translation. Use default species reading frames for translation: The annotations from the genome displayed in the Default species to establish reading frame pull-down menu are used to translate all the aligned species present in the alignment. Use reading frames for species if available, otherwise no translation: Codon translation is performed only for those species where the region is annotated as protein coding. Use reading frames for species if available, otherwise use default species: Codon translation is done on those species that are annotated as being protein coding over the aligned region using species-specific annotation; the remaining species are translated using the default species annotation. Codon translation uses the following gene tracks as the basis for translation, depending on the species chosen: Gene TrackSpecies Known Geneshuman, mouse, rat RefSeq Geneschicken MGC GenesX. tropicalis Ensembl GenesFugu, chimp mRNAsrhesus, rabbit, dog, cow, zebrafish not translatedarmadillo, elephant, tenrec, opossum, Tetraodon Methods Best-in-genome pairwise alignments were generated for each species using blastz, followed by chaining and netting. The pairwise alignments were then multiply aligned using multiz, following the ordering of the species tree diagrammed above. The resulting multiple alignments were then assigned conservation scores by phastCons, using a tree model with branch lengths derived from the ENCODE project Multi-Species Sequence Analysis group, September 2005 tree model. This tree was generated from TBA alignments over 23 vertebrate species and is based on 4D sites. The phastCons program computes conservation scores based on a phylo-HMM, a type of probabilistic model that describes both the process of DNA substitution at each site in a genome and the way this process changes from one site to the next (Felsenstein and Churchill 1996, Yang 1995, Siepel and Haussler 2005). PhastCons uses a two-state phylo-HMM, with a state for conserved regions and a state for non-conserved regions. The value plotted at each site is the posterior probability that the corresponding alignment column was "generated" by the conserved state of the phylo-HMM. These scores reflect the phylogeny (including branch lengths) of the species in question, a continuous-time Markov model of the nucleotide substitution process, and a tendency for conservation levels to be autocorrelated along the genome (i.e., to be similar at adjacent sites). The general reversible (REV) substitution model was used. Note that, unlike many conservation-scoring programs, phastCons does not rely on a sliding window of fixed size, so short highly-conserved regions and long moderately conserved regions can both obtain high scores. More information about phastCons can be found in Siepel et al. (2005). PhastCons currently treats alignment gaps as missing data, which sometimes has the effect of producing undesirably high conservation scores in gappy regions of the alignment. We are looking at several possible ways of improving the handling of alignment gaps. Credits This track was created using the following programs: Alignment tools: blastz and multiz by Minmei Hou, Scott Schwartz and Webb Miller of the Penn State Bioinformatics Group Chaining and Netting: axtChain, chainNet by Jim Kent at UCSC Conservation scoring: PhastCons, phyloFit, tree_doctor, msa_view by Adam Siepel while at UCSC, now at Cold Spring Harbor Laboratory MAF Annotation tools: mafAddIRows by Brian Raney, UCSC; genePredToMafFrames by Mark Diekhans, UCSC Tree image generator: phyloPng by Galt Barber, UCSC Conservation track display: Kate Rosenbloom, Hiram Clawson (wiggle display), and Brian Raney (gap annotation and codon framing) at UCSC The phylogenetic tree is based on Murphy et al. (2001) and general consensus in the vertebrate phylogeny community. References Phylo-HMMs and phastCons Felsenstein, J. and Churchill, G.A. A hidden Markov model approach to variation among sites in rate of evolution. Mol Biol Evol 13, 93-104 (1996). Siepel, A. and Haussler, D. Phylogenetic hidden Markov models. In R. Nielsen, ed., Statistical Methods in Molecular Evolution, pp. 325-351, Springer, New York (2005). Siepel, A., Bejerano, G., Pedersen, J.S., Hinrichs, A., Hou, M., Rosenbloom, K., Clawson, H., Spieth, J., Hillier, L.W., Richards, S., Weinstock, G.M., Wilson, R. K., Gibbs, R.A., Kent, W.J., Miller, W., and Haussler, D. Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. Genome Res. 15, 1034-1050 (2005). Yang, Z. A space-time process model for the evolution of DNA sequences. Genetics 139, 993-1005 (1995). Chain/Net: Kent, W.J., Baertsch, R., Hinrichs, A., Miller, W., and Haussler, D. Evolution's cauldron: Duplication, deletion, and rearrangement in the mouse and human genomes. Proc Natl Acad Sci USA 100(20), 11484-11489 (2003). Multiz: Blanchette, M., Kent, W.J., Riemer, C., Elnitski, .L, Smit, A.F.A., Roskin, K.M., Baertsch, R., Rosenbloom, K., Clawson, H., Green, E.D., Haussler, D., Miller, W. Aligning Multiple Genomic Sequences with the Threaded Blockset Aligner. Genome Res. 14(4), 708-15 (2004). Blastz: Chiaromonte, F., Yap, V.B., and Miller, W. Scoring pairwise genomic sequence alignments. Pac Symp Biocomput 2002, 115-26 (2002). Schwartz, S., Kent, W.J., Smit, A., Zhang, Z., Baertsch, R., Hardison, R., Haussler, D., and Miller, W. Human-Mouse Alignments with BLASTZ. Genome Res. 13(1), 103-7 (2003). Phylogenetic Tree: Murphy, W.J., et al. Resolution of the early placental mammal radiation using Bayesian phylogenetics. Science 294(5550), 2348-51 (2001). cons17wayViewalign Multiz Alignments Multiz Alignment & Conservation (17 Species) Comparative Genomics multiz17way Multiz Align Multiz Alignments of 17 Species Comparative Genomics Description This track shows a measure of evolutionary conservation in 17 vertebrates, including mammalian, amphibian, bird, and fish species, based on a phylogenetic hidden Markov model, phastCons (Siepel et al., 2005). Multiz alignments of the following assemblies were used to generate this track: human (May 2004 (NCBI35/hg17), hg17) chimp (Nov 2003, panTro1) macaque (Jan 2006, rheMac2) mouse (May 2004, mm7) rat (Jun 2003, rn3) rabbit (May 2005, oryCun1) dog (May 2005, canFam2) cow (Mar 2005, bosTau2) armadillo (May 2005, dasNov1) elephant (May 2005, loxAfr1) tenrec (Jul 2005, echTel1) opossum (Jun 2005, monDom2) chicken (Feb 2004, galGal2) frog (Oct 2004, xenTro1) zebrafish (May 2005, danRer3) Tetraodon (Feb 2004, tetNig1) Fugu (Aug 2002, fr1) Display Conventions and Configuration In full and pack display modes, conservation scores are displayed as a "wiggle" (histogram), where the height reflects the size of the score. Pairwise alignments of each species to the human genome are displayed below as a grayscale density plot (in pack mode) or as a "wiggle" (in full mode) that indicates alignment quality. In dense display mode, conservation is shown in grayscale using darker values to indicate higher levels of overall conservation as scored by phastCons. The conservation wiggle can be configured in a variety of ways to highlight different aspects of the displayed information. Click the Graph configuration help link for an explanation of the configuration options. Checkboxes in the track configuration section allow excluding species from the pairwise display; however, this does not remove them from the conservation score display. To view detailed information about the alignments at a specific position, zoom in the display to 30,000 or fewer bases, then click on the alignment. Gap Annotation The "Display chains between alignments" configuration option enables display of gaps between alignment blocks in the pairwise alignments in a manner similar to the Chain track display. The following conventions are used: Single line: No bases in the aligned species. Possibly due to a lineage-specific insertion between the aligned blocks in the human genome or a lineage-specific deletion between the aligned blocks in the aligning species. Double line: Aligning species has one or more unalignable bases in the gap region. Possibly due to excessive evolutionary distance between species or independent indels in the region between the aligned blocks in both species. Pale yellow coloring: Aligning species has Ns in the gap region. Reflects uncertainty in the relationship between the DNA of both species, due to lack of sequence in relevant portions of the aligning species. Genomic Breaks Discontinuities in the genomic context (chromosome, scaffold or region) of the aligned DNA in the aligning species are shown as follows: Vertical blue bar: Represents a discontinuity that persists indefinitely on either side, e.g. a large region of DNA on either side of the bar comes from a different chromosome in the aligned species due to a large scale rearrangement. Green square brackets: Enclose shorter alignments consisting of DNA from one genomic context in the aligned species nested inside a larger chain of alignments from a different genomic context. The alignment within the brackets may represent a short misalignment, a lineage-specific insertion of a transposon in the human genome that aligns to a paralogous copy somewhere else in the aligned species, or other similar occurrence. Base Level When zoomed-in to the base-level display, the track shows the base composition of each alignment. The numbers and symbols on the Gaps line indicate the lengths of gaps in the human sequence at those alignment positions relative to the longest non-human sequence. If there is sufficient space in the display, the size of the gap is shown; if not, and if the gap size is a multiple of 3, a "*" is displayed, otherwise "+" is shown. Codon translation is available in base-level display mode if the displayed region is identified as a coding segment. To display this annotation, select the species for translation from the pull-down menu in the Codon Translation configuration section at the top of the page. Then, select one of the following modes: No codon translation: The gene annotation is not used; the bases are displayed without translation. Use default species reading frames for translation: The annotations from the genome displayed in the Default species to establish reading frame pull-down menu are used to translate all the aligned species present in the alignment. Use reading frames for species if available, otherwise no translation: Codon translation is performed only for those species where the region is annotated as protein coding. Use reading frames for species if available, otherwise use default species: Codon translation is done on those species that are annotated as being protein coding over the aligned region using species-specific annotation; the remaining species are translated using the default species annotation. Codon translation uses the following gene tracks as the basis for translation, depending on the species chosen: Gene TrackSpecies Known Geneshuman, mouse, rat RefSeq Geneschicken MGC GenesX. tropicalis Ensembl GenesFugu, chimp mRNAsrhesus, rabbit, dog, cow, zebrafish not translatedarmadillo, elephant, tenrec, opossum, Tetraodon Methods Best-in-genome pairwise alignments were generated for each species using blastz, followed by chaining and netting. The pairwise alignments were then multiply aligned using multiz, following the ordering of the species tree diagrammed above. The resulting multiple alignments were then assigned conservation scores by phastCons, using a tree model with branch lengths derived from the ENCODE project Multi-Species Sequence Analysis group, September 2005 tree model. This tree was generated from TBA alignments over 23 vertebrate species and is based on 4D sites. The phastCons program computes conservation scores based on a phylo-HMM, a type of probabilistic model that describes both the process of DNA substitution at each site in a genome and the way this process changes from one site to the next (Felsenstein and Churchill 1996, Yang 1995, Siepel and Haussler 2005). PhastCons uses a two-state phylo-HMM, with a state for conserved regions and a state for non-conserved regions. The value plotted at each site is the posterior probability that the corresponding alignment column was "generated" by the conserved state of the phylo-HMM. These scores reflect the phylogeny (including branch lengths) of the species in question, a continuous-time Markov model of the nucleotide substitution process, and a tendency for conservation levels to be autocorrelated along the genome (i.e., to be similar at adjacent sites). The general reversible (REV) substitution model was used. Note that, unlike many conservation-scoring programs, phastCons does not rely on a sliding window of fixed size, so short highly-conserved regions and long moderately conserved regions can both obtain high scores. More information about phastCons can be found in Siepel et al. (2005). PhastCons currently treats alignment gaps as missing data, which sometimes has the effect of producing undesirably high conservation scores in gappy regions of the alignment. We are looking at several possible ways of improving the handling of alignment gaps. Credits This track was created using the following programs: Alignment tools: blastz and multiz by Minmei Hou, Scott Schwartz and Webb Miller of the Penn State Bioinformatics Group Chaining and Netting: axtChain, chainNet by Jim Kent at UCSC Conservation scoring: PhastCons, phyloFit, tree_doctor, msa_view by Adam Siepel while at UCSC, now at Cold Spring Harbor Laboratory MAF Annotation tools: mafAddIRows by Brian Raney, UCSC; genePredToMafFrames by Mark Diekhans, UCSC Tree image generator: phyloPng by Galt Barber, UCSC Conservation track display: Kate Rosenbloom, Hiram Clawson (wiggle display), and Brian Raney (gap annotation and codon framing) at UCSC The phylogenetic tree is based on Murphy et al. (2001) and general consensus in the vertebrate phylogeny community. References Phylo-HMMs and phastCons Felsenstein, J. and Churchill, G.A. A hidden Markov model approach to variation among sites in rate of evolution. Mol Biol Evol 13, 93-104 (1996). Siepel, A. and Haussler, D. Phylogenetic hidden Markov models. In R. Nielsen, ed., Statistical Methods in Molecular Evolution, pp. 325-351, Springer, New York (2005). Siepel, A., Bejerano, G., Pedersen, J.S., Hinrichs, A., Hou, M., Rosenbloom, K., Clawson, H., Spieth, J., Hillier, L.W., Richards, S., Weinstock, G.M., Wilson, R. K., Gibbs, R.A., Kent, W.J., Miller, W., and Haussler, D. Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. Genome Res. 15, 1034-1050 (2005). Yang, Z. A space-time process model for the evolution of DNA sequences. Genetics 139, 993-1005 (1995). Chain/Net: Kent, W.J., Baertsch, R., Hinrichs, A., Miller, W., and Haussler, D. Evolution's cauldron: Duplication, deletion, and rearrangement in the mouse and human genomes. Proc Natl Acad Sci USA 100(20), 11484-11489 (2003). Multiz: Blanchette, M., Kent, W.J., Riemer, C., Elnitski, .L, Smit, A.F.A., Roskin, K.M., Baertsch, R., Rosenbloom, K., Clawson, H., Green, E.D., Haussler, D., Miller, W. Aligning Multiple Genomic Sequences with the Threaded Blockset Aligner. Genome Res. 14(4), 708-15 (2004). Blastz: Chiaromonte, F., Yap, V.B., and Miller, W. Scoring pairwise genomic sequence alignments. Pac Symp Biocomput 2002, 115-26 (2002). Schwartz, S., Kent, W.J., Smit, A., Zhang, Z., Baertsch, R., Hardison, R., Haussler, D., and Miller, W. Human-Mouse Alignments with BLASTZ. Genome Res. 13(1), 103-7 (2003). Phylogenetic Tree: Murphy, W.J., et al. Resolution of the early placental mammal radiation using Bayesian phylogenetics. Science 294(5550), 2348-51 (2001). cons17wayViewphastcons Element Conservation (phastCons) Multiz Alignment & Conservation (17 Species) Comparative Genomics phastCons17way 17 Species Cons 17 Species Conservation by PhastCons Comparative Genomics cons17wayViewelements Conserved Elements Multiz Alignment & Conservation (17 Species) Comparative Genomics phastConsElements17way 17 Species El 17 Species Conserved Elements Comparative Genomics Description This track shows predictions of conserved elements produced by the phastCons program. PhastCons is part of the PHAST (PHylogenetic Analysis with Space/Time models) package. The predictions are based on a phylogenetic hidden Markov model (phylo-HMM), a type of probabilistic model that describes both the process of DNA substitution at each site in a genome and the way this process changes from one site to the next. Methods Best-in-genome pairwise alignments were generated for each species using blastz, followed by chaining and netting. A multiple alignment was then constructed from these pairwise alignments using multiz. Predictions of conserved elements were then obtained by running phastCons on the multiple alignments with the --most-conserved option. PhastCons constructs a two-state phylo-HMM with a state for conserved regions and a state for non-conserved regions. The two states share a single phylogenetic model, except that the branch lengths of the tree associated with the conserved state are multiplied by a constant scaling factor rho (0 rho rho, are estimated from the data by maximum likelihood using an EM algorithm. This procedure is subject to certain constraints on the "coverage" of the genome by conserved elements and the "smoothness" of the conservation scores. Details can be found in Siepel et al. (2005). The predicted conserved elements are segments of the alignment that are likely to have been "generated" by the conserved state of the phylo-HMM. Each element is assigned a log-odds score equal to its log probability under the conserved model minus its log probability under the non-conserved model. The "score" field associated with this track contains transformed log-odds scores, taking values between 0 and 1000. (The scores are transformed using a monotonic function of the form a * log(x) + b.) The raw log odds scores are retained in the "name" field and can be seen on the details page or in the browser when the track's display mode is set to "pack" or "full". Credits This track was created at UCSC using the following programs: Blastz and multiz by Minmei Hou, Scott Schwartz and Webb Miller of the Penn State Bioinformatics Group. AxtBest, axtChain, chainNet, netSyntenic, and netClass by Jim Kent at UCSC. PhastCons by Adam Siepel at Cornell University. References PhastCons Siepel, A., Bejerano, G., Pedersen, J.S., Hinrichs, A., Hou, M., Rosenbloom, K., Clawson, H., Spieth, J., Hillier, L.W., Richards, S., Weinstock, G.M., Wilson, R. K., Gibbs, R.A., Kent, W.J., Miller, W., and Haussler, D. Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. Genome Res. 15, 1034-1050 (2005). Chain/Net Kent, W.J., Baertsch, R., Hinrichs, A., Miller, W., and Haussler, D. Evolution's cauldron: Duplication, deletion, and rearrangement in the mouse and human genomes. Proc Natl Acad Sci USA 100(20), 11484-11489 (2003). Multiz Blanchette, M., Kent, W.J., Riemer, C., Elnitski, L., Smit, A.F.A., Roskin, K.M., Baertsch, R., Rosenbloom, K., Clawson, H., Green, E.D., Haussler, D., Miller, W. Aligning multiple genomic sequences with the threaded blockset aligner. Genome Res. 14(4), 708-15 (2004). Blastz Chiaromonte, F., Yap, V.B., Miller, W. Scoring pairwise genomic sequence alignments. Pac Symp Biocomput 2002, 115-26 (2002). Schwartz, S., Kent, W.J., Smit, A., Zhang, Z., Baertsch, R., Hardison, R., Haussler, D., and Miller, W. Human-Mouse Alignments with BLASTZ. Genome Res. 13(1), 103-7 (2003). cpgIslandExt CpG Islands CpG Islands (Islands < 300 Bases are Light Green) Regulation Description CpG islands are associated with genes, particularly housekeeping genes, in vertebrates. CpG islands are typically common near transcription start sites and may be associated with promoter regions. Normally a C (cytosine) base followed immediately by a G (guanine) base (a CpG) is rare in vertebrate DNA because the Cs in such an arrangement tend to be methylated. This methylation helps distinguish the newly synthesized DNA strand from the parent strand, which aids in the final stages of DNA proofreading after duplication. However, over evolutionary time, methylated Cs tend to turn into Ts because of spontaneous deamination. The result is that CpGs are relatively rare unless there is selective pressure to keep them or a region is not methylated for some other reason, perhaps having to do with the regulation of gene expression. CpG islands are regions where CpGs are present at significantly higher levels than is typical for the genome as a whole. The unmasked version of the track displays potential CpG islands that exist in repeat regions and would otherwise not be visible in the repeat masked version. By default, only the masked version of the track is displayed. To view the unmasked version, change the visibility settings in the track controls at the top of this page. Methods CpG islands were predicted by searching the sequence one base at a time, scoring each dinucleotide (+17 for CG and -1 for others) and identifying maximally scoring segments. Each segment was then evaluated for the following criteria: GC content of 50% or greater length greater than 200 bp ratio greater than 0.6 of observed number of CG dinucleotides to the expected number on the basis of the number of Gs and Cs in the segment The entire genome sequence, masking areas included, was used for the construction of the track Unmasked CpG. The track CpG Islands is constructed on the sequence after all masked sequence is removed. The CpG count is the number of CG dinucleotides in the island. The Percentage CpG is the ratio of CpG nucleotide bases (twice the CpG count) to the length. The ratio of observed to expected CpG is calculated according to the formula (cited in Gardiner-Garden et al. (1987)): Obs/Exp CpG = Number of CpG * N / (Number of C * Number of G) where N = length of sequence. The calculation of the track data is performed by the following command sequence: twoBitToFa assembly.2bit stdout | maskOutFa stdin hard stdout \ | cpg_lh /dev/stdin 2> cpg_lh.err \ | awk '{$2 = $2 - 1; width = $3 - $2; printf("%s\t%d\t%s\t%s %s\t%s\t%s\t%0.0f\t%0.1f\t%s\t%s\n", $1, $2, $3, $5, $6, width, $6, width*$7*0.01, 100.0*2*$6/width, $7, $9);}' \ | sort -k1,1 -k2,2n > cpgIsland.bed The unmasked track data is constructed from twoBitToFa -noMask output for the twoBitToFa command. Data access CpG islands and its associated tables can be explored interactively using the REST API, the Table Browser or the Data Integrator. All the tables can also be queried directly from our public MySQL servers, with more information available on our help page as well as on our blog. The source for the cpg_lh program can be obtained from src/utils/cpgIslandExt/. The cpg_lh program binary can be obtained from: http://hgdownload.soe.ucsc.edu/admin/exe/linux.x86_64/cpg_lh (choose "save file") Credits This track was generated using a modification of a program developed by G. Miklem and L. Hillier (unpublished). References Gardiner-Garden M, Frommer M. CpG islands in vertebrate genomes. J Mol Biol. 1987 Jul 20;196(2):261-82. PMID: 3656447 cpgIslandSuper CpG Islands CpG Islands (Islands < 300 Bases are Light Green) Regulation Description CpG islands are associated with genes, particularly housekeeping genes, in vertebrates. CpG islands are typically common near transcription start sites and may be associated with promoter regions. Normally a C (cytosine) base followed immediately by a G (guanine) base (a CpG) is rare in vertebrate DNA because the Cs in such an arrangement tend to be methylated. This methylation helps distinguish the newly synthesized DNA strand from the parent strand, which aids in the final stages of DNA proofreading after duplication. However, over evolutionary time, methylated Cs tend to turn into Ts because of spontaneous deamination. The result is that CpGs are relatively rare unless there is selective pressure to keep them or a region is not methylated for some other reason, perhaps having to do with the regulation of gene expression. CpG islands are regions where CpGs are present at significantly higher levels than is typical for the genome as a whole. The unmasked version of the track displays potential CpG islands that exist in repeat regions and would otherwise not be visible in the repeat masked version. By default, only the masked version of the track is displayed. To view the unmasked version, change the visibility settings in the track controls at the top of this page. Methods CpG islands were predicted by searching the sequence one base at a time, scoring each dinucleotide (+17 for CG and -1 for others) and identifying maximally scoring segments. Each segment was then evaluated for the following criteria: GC content of 50% or greater length greater than 200 bp ratio greater than 0.6 of observed number of CG dinucleotides to the expected number on the basis of the number of Gs and Cs in the segment The entire genome sequence, masking areas included, was used for the construction of the track Unmasked CpG. The track CpG Islands is constructed on the sequence after all masked sequence is removed. The CpG count is the number of CG dinucleotides in the island. The Percentage CpG is the ratio of CpG nucleotide bases (twice the CpG count) to the length. The ratio of observed to expected CpG is calculated according to the formula (cited in Gardiner-Garden et al. (1987)): Obs/Exp CpG = Number of CpG * N / (Number of C * Number of G) where N = length of sequence. The calculation of the track data is performed by the following command sequence: twoBitToFa assembly.2bit stdout | maskOutFa stdin hard stdout \ | cpg_lh /dev/stdin 2> cpg_lh.err \ | awk '{$2 = $2 - 1; width = $3 - $2; printf("%s\t%d\t%s\t%s %s\t%s\t%s\t%0.0f\t%0.1f\t%s\t%s\n", $1, $2, $3, $5, $6, width, $6, width*$7*0.01, 100.0*2*$6/width, $7, $9);}' \ | sort -k1,1 -k2,2n > cpgIsland.bed The unmasked track data is constructed from twoBitToFa -noMask output for the twoBitToFa command. Data access CpG islands and its associated tables can be explored interactively using the REST API, the Table Browser or the Data Integrator. All the tables can also be queried directly from our public MySQL servers, with more information available on our help page as well as on our blog. The source for the cpg_lh program can be obtained from src/utils/cpgIslandExt/. The cpg_lh program binary can be obtained from: http://hgdownload.soe.ucsc.edu/admin/exe/linux.x86_64/cpg_lh (choose "save file") Credits This track was generated using a modification of a program developed by G. Miklem and L. Hillier (unpublished). References Gardiner-Garden M, Frommer M. CpG islands in vertebrate genomes. J Mol Biol. 1987 Jul 20;196(2):261-82. PMID: 3656447 mrna Human mRNAs Human mRNAs from GenBank mRNA and EST Description The mRNA track shows alignments between human mRNAs in GenBank and the genome. Display Conventions and Configuration This track follows the display conventions for PSL alignment tracks. In dense display mode, the items that are more darkly shaded indicate matches of better quality. The description page for this track has a filter that can be used to change the display mode, alter the color, and include/exclude a subset of items within the track. This may be helpful when many items are shown in the track display, especially when only some are relevant to the current task. To use the filter: Type a term in one or more of the text boxes to filter the mRNA display. For example, to apply the filter to all mRNAs expressed in a specific organ, type the name of the organ in the tissue box. To view the list of valid terms for each text box, consult the table in the Table Browser that corresponds to the factor on which you wish to filter. For example, the "tissue" table contains all the types of tissues that can be entered into the tissue text box. Multiple terms may be entered at once, separated by a space. Wildcards may also be used in the filter. If filtering on more than one value, choose the desired combination logic. If "and" is selected, only mRNAs that match all filter criteria will be highlighted. If "or" is selected, mRNAs that match any one of the filter criteria will be highlighted. Choose the color or display characteristic that should be used to highlight or include/exclude the filtered items. If "exclude" is chosen, the browser will not display mRNAs that match the filter criteria. If "include" is selected, the browser will display only those mRNAs that match the filter criteria. This track may also be configured to display codon coloring, a feature that allows the user to quickly compare mRNAs against the genomic sequence. For more information about this option, go to the Codon and Base Coloring for Alignment Tracks page. Several types of alignment gap may also be colored; for more information, go to the Alignment Insertion/Deletion Display Options page. Methods GenBank human mRNAs were aligned against the genome using the blat program. When a single mRNA aligned in multiple places, the alignment having the highest base identity was found. Only alignments having a base identity level within 0.5% of the best and at least 96% base identity with the genomic sequence were kept. Credits The mRNA track was produced at UCSC from mRNA sequence data submitted to the international public sequence databases by scientists worldwide. References Benson DA, Cavanaugh M, Clark K, Karsch-Mizrachi I, Lipman DJ, Ostell J, Sayers EW. GenBank. Nucleic Acids Res. 2013 Jan;41(Database issue):D36-42. PMID: 23193287; PMC: PMC3531190 Benson DA, Karsch-Mizrachi I, Lipman DJ, Ostell J, Wheeler DL. GenBank: update. Nucleic Acids Res. 2004 Jan 1;32(Database issue):D23-6. PMID: 14681350; PMC: PMC308779 Kent WJ. BLAT - the BLAST-like alignment tool. Genome Res. 2002 Apr;12(4):656-64. PMID: 11932250; PMC: PMC187518 knownGene Known Genes UCSC Known Genes (June, 05) Based on UniProt, RefSeq, and GenBank mRNA Genes and Gene Predictions Description The UCSC Known Genes track shows known protein-coding genes based on protein data from UniProt (SWISS-PROT and TrEMBL) and mRNA data from the NCBI reference sequences collection (RefSeq) and GenBank. Each Known Gene is represented by an mRNA and a protein. Display Conventions and Configuration This track follows the display conventions for gene prediction tracks with the following color scheme: Black: indicates the gene has a corresponding entry in the Protein Databank (PDB). Dark Blue: indicates the gene has either a corresponding RefSeq mRNA that is "Reviewed" or "Validated" or a corresponding Swiss-Prot protein. Medium Blue: indicates the gene has a corresponding RefSeq mRNA that is not "Reviewed" nor "Validated". Light Blue: everything else. That is, the gene does not have a corresponding Protein Databank entry, RefSeq mRNA, or Swiss-Prot protein, but it has supporting evidence of a GenBank mRNA with a UniProt (TrEMBL) protein. This track contains an optional codon coloring feature that allows users to quickly validate and compare gene predictions. To display codon colors, select the genomic codons option from the Color track by codons pull-down menu. Click Coloring Gene Predictions and Annotations by Codon page for more information about this feature. Methods This release of UCSC Known Genes was built by a new process, KG II, as described below. UniProt protein sequences (including alternative splicing isoforms) and mRNA sequences from RefSeq and GenBank were aligned against the base genome using BLAT. RefSeq alignments having a base identity level within 0.1% of the best and at least 96% base identity with the genomic sequence were kept. GenBank mRNA alignments having a base identity level within 0.2% of the best and at least 97% base identity with the genomic sequence were kept. Protein alignments having a base identity level within 0.2% of the best and at least 80% base identity with the genomic sequence were kept. Then the genomic mRNA and protein alignments were compared, and protein-mRNA pairings were determined from their overlaps. mRNA CDS data were obtained from RefSeq and GenBank data and supplemented by CDS structures derived from UCSC protein-mRNA BLAT alignments. The initial set of UCSC Known Genes candidates consists of all protein-mRNA pairs with valid mRNA CDS structures. A gene-check program (similar to the one used for the Consensus CDS (CCDS) project) is used to remove questionable candidates, such as those with in-frame stop codons, missing start or stop codons, etc. From each group of gene candidates that share the same CDS structure, the protein-mRNA pair having the best ranking and protein-mRNA alignment score is selected as a UCSC Known Gene. The ranking of a gene candidate depends on its gene-check quality measures. When all else is equal, a preference is given to RefSeq mRNAs and next to MGC mRNAs. Similarly a preference is given to gene candidates represented by Swiss-Prot proteins. The protein-mRNA alignment score is calculated based on protein to mRNA alignment using TBLASTN, plus weighted sub-scores according to the date and length of the mRNA. Credits The UCSC Known Genes track was produced using protein data from UniProt and mRNA data from NCBI RefSeq and GenBank. Data Use Restrictions The UniProt data have the following terms of use, UniProt copyright(c) 2002 - 2004 UniProt consortium: For non-commercial use, all databases and documents in the UniProt FTP directory may be copied and redistributed freely, without advance permission, provided that this copyright statement is reproduced with each copy. For commercial use, all databases and documents in the UniProt FTP directory except the files ftp://ftp.uniprot.org/pub/databases/uniprot/knowledgebase/uniprot_sprot.dat.gz ftp://ftp.uniprot.org/pub/databases/uniprot/knowledgebase/uniprot_sprot.xml.gz may be copied and redistributed freely, without advance permission, provided that this copyright statement is reproduced with each copy. More information for commercial users can be found at the UniProt License & disclaimer page. From January 1, 2005, all databases and documents in the UniProt FTP directory may be copied and redistributed freely by all entities, without advance permission, provided that this copyright statement is reproduced with each copy. References Hsu F, Kent WJ, Clawson H, Kuhn RM, Diekhans M, Haussler D. The UCSC Known Genes. Bioinformatics. 2006;22(9):1036-46. Benson DA, Karsch-Mizrachi I, Lipman DJ, Ostell J, Wheeler DL. GenBank: update. Nucleic Acids Res. 2004 Jan 1;32(Database issue):D23-6. Kent WJ. BLAT - The BLAST-Like Alignment Tool. Genome Res. 2002 Apr;12(4):656-64. rmsk RepeatMasker Repeating Elements by RepeatMasker Repeats Description This track was created by using Arian Smit's RepeatMasker program, which screens DNA sequences for interspersed repeats and low complexity DNA sequences. The program outputs a detailed annotation of the repeats that are present in the query sequence (represented by this track), as well as a modified version of the query sequence in which all the annotated repeats have been masked (generally available on the Downloads page). RepeatMasker uses the Repbase Update library of repeats from the Genetic Information Research Institute (GIRI). Repbase Update is described in Jurka (2000) in the References section below. Some newer assemblies have been made with Dfam, not Repbase. You can find the details for how we make our database data here in our "makeDb/doc/" directory. Display Conventions and Configuration In full display mode, this track displays up to ten different classes of repeats: Short interspersed nuclear elements (SINE), which include ALUs Long interspersed nuclear elements (LINE) Long terminal repeat elements (LTR), which include retroposons DNA repeat elements (DNA) Simple repeats (micro-satellites) Low complexity repeats Satellite repeats RNA repeats (including RNA, tRNA, rRNA, snRNA, scRNA, srpRNA) Other repeats, which includes class RC (Rolling Circle) Unknown The level of color shading in the graphical display reflects the amount of base mismatch, base deletion, and base insertion associated with a repeat element. The higher the combined number of these, the lighter the shading. A "?" at the end of the "Family" or "Class" (for example, DNA?) signifies that the curator was unsure of the classification. At some point in the future, either the "?" will be removed or the classification will be changed. Methods Data are generated using the RepeatMasker -s flag. Additional flags may be used for certain organisms. Repeats are soft-masked. Alignments may extend through repeats, but are not permitted to initiate in them. See the FAQ for more information. Credits Thanks to Arian Smit, Robert Hubley and GIRI for providing the tools and repeat libraries used to generate this track. References Smit AFA, Hubley R, Green P. RepeatMasker Open-3.0. http://www.repeatmasker.org. 1996-2010. Repbase Update is described in: Jurka J. Repbase Update: a database and an electronic journal of repetitive elements. Trends Genet. 2000 Sep;16(9):418-420. PMID: 10973072 For a discussion of repeats in mammalian genomes, see: Smit AF. Interspersed repeats and other mementos of transposable elements in mammalian genomes. Curr Opin Genet Dev. 1999 Dec;9(6):657-63. PMID: 10607616 Smit AF. The origin of interspersed repeats in the human genome. Curr Opin Genet Dev. 1996 Dec;6(6):743-8. PMID: 8994846 snp125 SNPs Simple Nucleotide Polymorphisms (dbSNP build 125) Variation and Repeats Description This track contains dbSNP build 125, available from ftp.ncbi.nih.gov/snp. Interpreting and Configuring the Graphical Display Variants are shown as single tick marks at most zoom levels. When viewing the track at or near base-level resolution, the displayed width of the SNP corresponds to the width of the variant in the reference sequence. Insertions are indicated by a single tick mark displayed between two nucleotides, single nucleotide polymorphisms are displayed as the width of a single base, and multiple nucleotide variants are represented by a block that spans two or more bases. The configuration categories reflect the following definitions (not all categories apply to this assembly): Location Type: Describes the alignment of the flanking sequence Range - the flank alignments leave a gap of 2 or more bases in the reference assembly Exact - the flank alignments leave exactly one base between them Between - the flank alignments are contiguous; the variation is an insertion RangeInsertion - the flank alignments surround a distinct polymorphism between the submitted sequence and reference assembly; the submitted sequence is shorter RangeSubstitution - the flank alignments surround a distinct polymorphism between the submitted sequence and reference assembly; the submitted sequence and the reference assembly sequence are of equal length RangeDeletion - the flank alignments surround a distinct polymorphism between the submitted sequence and reference assembly; the submitted sequence is longer Class: Describes the observed alleles Single - single nucleotide variation: all observed alleles are single nucleotides (can have 2, 3 or 4 alleles) In-del - insertion/deletion (applies to RangeInsertion, RangeSubstitution, RangeDeletion) Heterozygous - heterozygous (undetermined) variation: allele contains string '(heterozygous)' Microsatellite - the observed allele from dbSNP is variation in counts of short tandem repeats Named - the observed allele from dbSNP is given as a text name No Variation - no variation asserted for sequence Mixed - the cluster contains submissions from multiple classes Multiple Nucleotide Polymorphism - alleles of the same length, length > 1, and from set of {A,T,C,G} Insertion - the polymorphism is an insertion relative to the reference assembly Deletion - the polymorphism is a deletion relative to the reference assembly Unknown - no classification provided by data contributor Validation: Method used to validate the variant (each variant may be validated by more than one method) By Frequency - at least one submitted SNP in cluster has frequency data submitted By Cluster - cluster has at least 2 submissions, with at least one submission assayed with a non-computational method By Submitter - at least one submitter SNP in cluster was validated by independent assay By 2 Hit/2 Allele - all alleles have been observed in at least 2 chromosomes By HapMap - validated by HapMap project Unknown - no validation has been reported for this variant Function: dbSNP's predicted functional effect of variant on RefSeq transcripts, both curated (NM_* and NR_*) as in the RefSeq Genes track and predicted (XM_* and XR_*), not shown in UCSC Genome Browser. A variant may have more than one functional role if it overlaps multiple transcripts. Locus Region - variation is 3' to and within 500 bases of a transcript, or is 5' to and within 2000 bases of a transcript (dbSNP term: locus; Sequence Ontology term: feature_variant) Coding - Synonymous - no change in peptide for allele with respect to the reference assembly (dbSNP term: coding-synon; Sequence Ontology term: synonymous_variant) Coding - Non-Synonymous - change in peptide for allele with respect to the reference assembly (dbSNP term: coding-nonsynon; Sequence Ontology term: protein_altering_variant) Untranslated - variation is in a transcript, but not in a coding region interval (dbSNP term: untranslated; Sequence Ontology term: UTR_variant) Intron - variation is in an intron, but not in the first two or last two bases of the intron (dbSNP term: intron; Sequence Ontology term: intron_variant) Splice Site - variation is in the first two or last two bases of an intron (dbSNP term: splice-site; Sequence Ontology term: splice_site_variant) Reference (coding) - one of the observed alleles of a SNP in a coding region matches the reference assembly (cds-reference) Sequence Ontology term: coding_sequence_variant) Unknown - no known functional classification Molecule Type: Sample used to find this variant Genomic - variant discovered using a genomic template cDNA - variant discovered using a cDNA template Unknown - sample type not known Average heterozygosity: Calculated by dbSNP as described here Average heterozygosity should not exceed 0.5 for bi-allelic single-base substitutions. Weight: Alignment count Weight can be 1, 2, 3 or 10. Weight = 10 is excluded from the data set. A filter on maximum weight value is supported, which defaults to 3. Alignments to chrN_random are not included. You can configure this track such that the details page displays the function and coding differences relative to particular gene sets. Choose the gene sets from the list on the SNP configuration page displayed beneath this heading: On details page, show function and coding differences relative to. When one or more gene tracks are selected, the SNP details page lists all genes that the SNP hits (or is close to), with the same keywords used in the function category. The function usually agrees with NCBI's function, but can sometimes give a bit more detail (e.g. more detail about how close a near-gene SNP is to a nearby gene). Insertions/Deletions dbSNP uses a class called 'in-del'. This has been split into the 'insertion' and 'deletion' categories, based on location type. The location types 'range' and 'exact' are deletions relative to the reference assembly. The location type 'between' indicates insertions relative to the reference assembly. For the new location types, the class 'in-del' is preserved. UCSC Annotations In addition to presenting the dbSNP data, the following annotations are provided: The size of the dbSNP reference allele is checked to see if it matches the coordinate span; exceptions are noted. The dbSNP reference allele is compared to the UCSC reference allele, and a note is made if the dbSNP reference allele is the reverse complement of the UCSC reference allele. Single-base substitutions are noted where the alignments of the flanking sequences are adjacent or have a gap of more than one base. A note is made if the observed alleles are not available from the rs_fasta files. Observed alleles with an unexpected format are noted. The length of the observed alleles is checked for consistency with location types; exceptions are noted. Single-base substitutions are checked to see that one of the observed alleles matches the reference allele; exceptions are noted. Simple deletions are checked to see that the observed allele matches the reference allele; exceptions are noted. Tri-allelic and quad-allelic single-base substitutions are noted. Variants that have multiple mappings are noted. Data Sources Coordinates, orientation, location type and dbSNP reference allele data were obtained from b125_SNPContigLoc.bcp.gz. b125_SNPMapInfo.bcp.gz provided the alignment weights; alignments with weight = 10 were filtered out. Functional classification information was obtained from b125_SNPContigLocusId.bcp.gz. The internal database representation uses dbSNP's function terms, but for display in SNP details pages, these are translated into Sequence Ontology terms. Validation status and heterozygosity were obtained from SNP.bcp.gz. The header lines in the rs_fasta files were used for class, observed polymorphism and molecule type. dgv DGV Struct Var Database of Genomic Variants: Structural Variation (CNV, Inversion, In/del) Variation and Repeats Description This track displays copy number variants (CNVs), insertions/deletions (InDels), inversions and inversion breakpoints annotated by the Database of Genomic Variants (DGV), which contains genomic variations observed in healthy individuals. DGV focuses on structural variation, defined as genomic alterations that involve segments of DNA that are larger than 1000 bp. Insertions/deletions of 100 bp or larger are also included. Display Conventions Color is used to indicate the type of variation. Please note that variants now link to DGV's new browser as of February 2013. inversions and inversion breakpoints are purple. CNVs and InDels are blue if there is a gain in size relative to the reference. CNVs and InDels are red if there is a loss in size relative to the reference. CNVs and InDels are brown if there are reports of both a loss and a gain in size relative to the reference. CNVs and InDels are black if there is an unknown change relative to the reference. Variants are displayed with accession numbers with the following DGV nomenclature. When possible, DGV uses accessions from peer archives of structural variation (dbVar at NCBI or DGVa at EBI). These accessions begin with either "essv", "esv", "nssv", or "nsv", followed by a number. Variant submissions processed by EBI begin with "e" and those processed by NCBI begin with "n". Accessions with ssv are for variant calls on a particular sample, and if they are copy number variants, they generally indicate whether the change is a gain or loss. Accessions with sv are for regions asserted by submitters to contain structural variants, and often span ssv elements for both losses and gains. dbVar and DGVa do not record numbers of losses and gains encompassed within sv regions. DGV merges clusters of variants that share at least 70% reciprocal overlap in size/location, and provides an sv-like record with an accession that begins with "dgv_". For most sv and dgv variants, DGV displays the total number of gains and/or losses at the bottom of their variant detail page. Since each ssv variant is for one sample, its total is 1. Methods DGV collects these variants by ongoing manual curation of the literature. A brief description of the method and sample used for a particular variant is included on the details page, along with a link to the PubMed abstract for the study from which the variants were collected. For data sets where the variation calls are reported at a sample-by-sample level, DGV merges calls with similar boundaries across the sample set. Only variants of the same type (i.e. CNVs, Indels, inversions) are merged, and gains and losses are merged separately. In addition, if several different platforms/approaches are used within the same study, these datasets are merged separately. Sample level calls that overlap by ≥ 70% are merged in this process. Data Access The raw data can be explored interactively with the Table Browser, or the Data Integrator. For automated access, this track, like all others, is available via our API. However, for bulk processing, it is recommended to download the dataset. The genome annotation is stored in a bigBed file that can be downloaded from the download server. The exact filenames can be found in the track configuration file. Annotations can be converted to ASCII text by our tool bigBedToBed which can be compiled from the source code or downloaded as a precompiled binary for your system. Instructions for downloading source code and binaries can be found here. The tool can also be used to obtain only features within a given range, for example: bigBedToBed http://hgdownload.soe.ucsc.edu/gbdb/hg19/gnomAD/structuralVariants/gnomad_v2.1_sv.sites.bb -chrom=chr6 -start=0 -end=1000000 stdout Credits Thanks to the Database of Genomic Variants for providing these data. In citing the Database of Genomic Variants please refer to Iafrate et al.. References Iafrate AJ, Feuk L, Rivera MN, Listewnik ML, Donahoe PK, Qi Y, Scherer SW, Lee C. Detection of large-scale variation in the human genome. Nat Genet. 2004 Sep;36(9):949-51. PMID:15286789. Zhang J, Feuk L, Duggan GE, Khaja R, Scherer SW. Development of bioinformatics resources for display and analysis of copy number and other structural variants in the human genome. Cytogenet Genome Res. 2006;115(3-4):205-14. PMID:17124402. refSeqComposite NCBI RefSeq RefSeq genes from NCBI Genes and Gene Predictions Description The NCBI RefSeq Genes composite track shows human protein-coding and non-protein-coding genes taken from the NCBI RNA reference sequences collection (RefSeq). All subtracks use coordinates provided by RefSeq, except for the UCSC RefSeq track, which UCSC produces by realigning the RefSeq RNAs to the genome. This realignment may result in occasional differences between the annotation coordinates provided by UCSC and NCBI. For RNA-seq analysis, we advise using NCBI aligned tables like RefSeq All or RefSeq Curated. See the Methods section for more details about how the different tracks were created. Please visit NCBI's Feedback for Gene and Reference Sequences (RefSeq) page to make suggestions, submit additions and corrections, or ask for help concerning RefSeq records. For more information on the different gene tracks, see our Genes FAQ. Display Conventions and Configuration This track is a composite track that contains differing data sets. To show only a selected set of subtracks, uncheck the boxes next to the tracks that you wish to hide. Note: Not all subtracts are available on all assemblies. The possible subtracks include: RefSeq aligned annotations and UCSC alignment of RefSeq annotations RefSeq All – all curated and predicted annotations provided by RefSeq. RefSeq Curated – subset of RefSeq All that includes only those annotations whose accessions begin with NM, NR, NP or YP. (NP and YP are used only for protein-coding genes on the mitochondrion; YP is used for human only.) RefSeq Predicted – subset of RefSeq All that includes those annotations whose accessions begin with XM or XR. RefSeq Other – all other annotations produced by the RefSeq group that do not fit the requirements for inclusion in the RefSeq Curated or the RefSeq Predicted tracks, as they do not have a product and therefore no RefSeq accession. More than 90% are pseudogenes, T-cell receptor or immunoglobulin segments. The few remaining entries are gene clusters (e.g. protocadherin). RefSeq Alignments – alignments of RefSeq RNAs to the human genome provided by the RefSeq group, following the display conventions for PSL tracks. RefSeq Diffs – alignment differences between the human reference genome(s) and RefSeq transcripts. (Track not currently available for every assembly.) UCSC RefSeq – annotations generated from UCSC's realignment of RNAs with NM and NR accessions to the human genome. This track was previously known as the "RefSeq Genes" track. RefSeq Select+MANE (subset) – Subset of RefSeq Curated, transcripts marked as RefSeq Select or MANE Select. A single Select transcript is chosen as representative for each protein-coding gene. This track includes transcripts categorized as MANE, which are further agreed upon as representative by both NCBI RefSeq and Ensembl/GENCODE, and have a 100% identical match to a transcript in the Ensembl annotation. See NCBI RefSeq Select. Note that we provide a separate track, MANE (hg38), which contains only the MANE transcripts. RefSeq HGMD (subset) – Subset of RefSeq Curated, transcripts annotated by the Human Gene Mutation Database. This track is only available on the human genomes hg19 and hg38. It is the most restricted RefSeq subset, targeting clinical diagnostics. The RefSeq All, RefSeq Curated, RefSeq Predicted, RefSeq HGMD, RefSeq Select/MANE and UCSC RefSeq tracks follow the display conventions for gene prediction tracks. The color shading indicates the level of review the RefSeq record has undergone: predicted (light), provisional (medium), or reviewed (dark), as defined by RefSeq. Color Level of review Reviewed: the RefSeq record has been reviewed by NCBI staff or by a collaborator. The NCBI review process includes assessing available sequence data and the literature. Some RefSeq records may incorporate expanded sequence and annotation information. Provisional: the RefSeq record has not yet been subject to individual review. The initial sequence-to-gene association has been established by outside collaborators or NCBI staff. Predicted: the RefSeq record has not yet been subject to individual review, and some aspect of the RefSeq record is predicted. The item labels and codon display properties for features within this track can be configured through the check-box controls at the top of the track description page. To adjust the settings for an individual subtrack, click the wrench icon next to the track name in the subtrack list . Label: By default, items are labeled by gene name. Click the appropriate Label option to display the accession name or OMIM identifier instead of the gene name, show all or a subset of these labels including the gene name, OMIM identifier and accession names, or turn off the label completely. Codon coloring: This track has an optional codon coloring feature that allows users to quickly validate and compare gene predictions. To display codon colors, select the genomic codons option from the Color track by codons pull-down menu. For more information about this feature, go to the Coloring Gene Predictions and Annotations by Codon page. The RefSeq Diffs track contains five different types of inconsistency between the reference genome sequence and the RefSeq transcript sequences. The five types of differences are as follows: mismatch – aligned but mismatching bases, plus HGVS g. to show the genomic change required to match the transcript and HGVS c./n. to show the transcript change required to match the genome. short gap – genomic gaps that are too small to be introns (arbitrary cutoff of < 45 bp), most likely insertions/deletion variants or errors, with HGVS g. and c./n. showing differences. shift gap – shortGap items whose placement could be shifted left and/or right on the genome due to repetitive sequence, with HGVS c./n. position range of ambiguous region in transcript. Here, thin and thick lines are used -- the thin line shows the span of the repetitive sequence, and the thick line shows the rightmost shifted gap. double gap – genomic gaps that are long enough to be introns but that skip over transcript sequence (invisible in default setting), with HGVS c./n. deletion. skipped – sequence at the beginning or end of a transcript that is not aligned to the genome (invisible in default setting), with HGVS c./n. deletion HGVS Terminology (Human Genome Variation Society): g. = genomic sequence ; c. = coding DNA sequence ; n. = non-coding RNA reference sequence. When reporting HGVS with RefSeq sequences, to make sure that results from research articles can be mapped to the genome unambiguously, please specify the RefSeq annotation release displayed on the transcript's Genome Browser details page and also the RefSeq transcript ID with version (e.g. NM_012309.4 not NM_012309). Methods Tracks contained in the RefSeq annotation and RefSeq RNA alignment tracks were created at UCSC using data from the NCBI RefSeq project. Data files were downloaded from RefSeq in GFF file format and converted to the genePred and PSL table formats for display in the Genome Browser. Information about the NCBI annotation pipeline can be found here. The RefSeq Diffs track is generated by UCSC using NCBI's RefSeq RNA alignments. The UCSC RefSeq Genes track is constructed using the same methods as previous RefSeq Genes tracks. RefSeq RNAs were aligned against the human genome using BLAT. Those with an alignment of less than 15% were discarded. When a single RNA aligned in multiple places, the alignment having the highest base identity was identified. Only alignments having a base identity level within 0.1% of the best and at least 96% base identity with the genomic sequence were kept. Data Access The raw data for these tracks can be accessed in multiple ways. It can be explored interactively using the REST API, Table Browser or Data Integrator. The tables can also be accessed programmatically through our public MySQL server or downloaded from our downloads server for local processing. The previous track versions are available in the archives of our downloads server. You can also access any RefSeq table entries in JSON format through our JSON API. The data in the RefSeq Other and RefSeq Diffs tracks are organized in bigBed file format; more information about accessing the information in this bigBed file can be found below. The other subtracks are associated with database tables as follows: genePred format: RefSeq All - ncbiRefSeq RefSeq Curated - ncbiRefSeqCurated RefSeq Predicted - ncbiRefSeqPredicted RefSeq HGMD - ncbiRefSeqHgmd RefSeq Select+MANE - ncbiRefSeqSelect UCSC RefSeq - refGene PSL format: RefSeq Alignments - ncbiRefSeqPsl The first column of each of these tables is "bin". This column is designed to speed up access for display in the Genome Browser, but can be safely ignored in downstream analysis. You can read more about the bin indexing system here. The annotations in the RefSeqOther and RefSeqDiffs tracks are stored in bigBed files, which can be obtained from our downloads server here, ncbiRefSeqOther.bb and ncbiRefSeqDiffs.bb. Individual regions or the whole set of genome-wide annotations can be obtained using our tool bigBedToBed which can be compiled from the source code or downloaded as a precompiled binary for your system from the utilities directory linked below. For example, to extract only annotations in a given region, you could use the following command: bigBedToBed http://hgdownload.soe.ucsc.edu/gbdb/hg17/ncbiRefSeq/ncbiRefSeqOther.bb -chrom=chr16 -start=34990190 -end=36727467 stdout You can download a GTF format version of the RefSeq All table from the GTF downloads directory. The genePred format tracks can also be converted to GTF format using the genePredToGtf utility, available from the utilities directory on the UCSC downloads server. The utility can be run from the command line like so: genePredToGtf hg17 ncbiRefSeqPredicted ncbiRefSeqPredicted.gtf Note that using genePredToGtf in this manner accesses our public MySQL server, and you therefore must set up your hg.conf as described on the MySQL page linked near the beginning of the Data Access section. A file containing the RNA sequences in FASTA format for all items in the RefSeq All, RefSeq Curated, and RefSeq Predicted tracks can be found on our downloads server here. Please refer to our mailing list archives for questions. Previous versions of the ncbiRefSeq set of tracks can be found on our archive download server. Credits This track was produced at UCSC from data generated by scientists worldwide and curated by the NCBI RefSeq project. References Kent WJ. BLAT - the BLAST-like alignment tool. Genome Res. 2002 Apr;12(4):656-64. PMID: 11932250; PMC: PMC187518 Pruitt KD, Brown GR, Hiatt SM, Thibaud-Nissen F, Astashyn A, Ermolaeva O, Farrell CM, Hart J, Landrum MJ, McGarvey KM et al. RefSeq: an update on mammalian reference sequences. Nucleic Acids Res. 2014 Jan;42(Database issue):D756-63. PMID: 24259432; PMC: PMC3965018 Pruitt KD, Tatusova T, Maglott DR. NCBI Reference Sequence (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Res. 2005 Jan 1;33(Database issue):D501-4. PMID: 15608248; PMC: PMC539979 refGene UCSC RefSeq UCSC annotations of RefSeq RNAs (NM_* and NR_*) Genes and Gene Predictions Description The RefSeq Genes track shows known human protein-coding and non-protein-coding genes taken from the NCBI RNA reference sequences collection (RefSeq). The data underlying this track are updated weekly. Please visit the Feedback for Gene and Reference Sequences (RefSeq) page to make suggestions, submit additions and corrections, or ask for help concerning RefSeq records. For more information on the different gene tracks, see our Genes FAQ. Display Conventions and Configuration This track follows the display conventions for gene prediction tracks. The color shading indicates the level of review the RefSeq record has undergone: predicted (light), provisional (medium), reviewed (dark). The item labels and display colors of features within this track can be configured through the controls at the top of the track description page. Label: By default, items are labeled by gene name. Click the appropriate Label option to display the accession name instead of the gene name, show both the gene and accession names, or turn off the label completely. Codon coloring: This track contains an optional codon coloring feature that allows users to quickly validate and compare gene predictions. To display codon colors, select the genomic codons option from the Color track by codons pull-down menu. For more information about this feature, go to the Coloring Gene Predictions and Annotations by Codon page. Hide non-coding genes: By default, both the protein-coding and non-protein-coding genes are displayed. If you wish to see only the coding genes, click this box. Methods RefSeq RNAs were aligned against the human genome using BLAT. Those with an alignment of less than 15% were discarded. When a single RNA aligned in multiple places, the alignment having the highest base identity was identified. Only alignments having a base identity level within 0.1% of the best and at least 96% base identity with the genomic sequence were kept. Credits This track was produced at UCSC from RNA sequence data generated by scientists worldwide and curated by the NCBI RefSeq project. References Kent WJ. BLAT - the BLAST-like alignment tool. Genome Res. 2002 Apr;12(4):656-64. PMID: 11932250; PMC: PMC187518 Pruitt KD, Brown GR, Hiatt SM, Thibaud-Nissen F, Astashyn A, Ermolaeva O, Farrell CM, Hart J, Landrum MJ, McGarvey KM et al. RefSeq: an update on mammalian reference sequences. Nucleic Acids Res. 2014 Jan;42(Database issue):D756-63. PMID: 24259432; PMC: PMC3965018 Pruitt KD, Tatusova T, Maglott DR. NCBI Reference Sequence (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Res. 2005 Jan 1;33(Database issue):D501-4. PMID: 15608248; PMC: PMC539979 intronEst Spliced ESTs Human ESTs That Have Been Spliced mRNA and EST Description This track shows alignments between human expressed sequence tags (ESTs) in GenBank and the genome that show signs of splicing when aligned against the genome. ESTs are single-read sequences, typically about 500 bases in length, that usually represent fragments of transcribed genes. To be considered spliced, an EST must show evidence of at least one canonical intron (i.e., the genomic sequence between EST alignment blocks must be at least 32 bases in length and have GT/AG ends). By requiring splicing, the level of contamination in the EST databases is drastically reduced at the expense of eliminating many genuine 3' ESTs. For a display of all ESTs (including unspliced), see the human EST track. Display Conventions and Configuration This track follows the display conventions for PSL alignment tracks. In dense display mode, darker shading indicates a larger number of aligned ESTs. The strand information (+/-) indicates the direction of the match between the EST and the matching genomic sequence. It bears no relationship to the direction of transcription of the RNA with which it might be associated. The description page for this track has a filter that can be used to change the display mode, alter the color, and include/exclude a subset of items within the track. This may be helpful when many items are shown in the track display, especially when only some are relevant to the current task. To use the filter: Type a term in one or more of the text boxes to filter the EST display. For example, to apply the filter to all ESTs expressed in a specific organ, type the name of the organ in the tissue box. To view the list of valid terms for each text box, consult the table in the Table Browser that corresponds to the factor on which you wish to filter. For example, the "tissue" table contains all the types of tissues that can be entered into the tissue text box. Multiple terms may be entered at once, separated by a space. Wildcards may also be used in the filter. If filtering on more than one value, choose the desired combination logic. If "and" is selected, only ESTs that match all filter criteria will be highlighted. If "or" is selected, ESTs that match any one of the filter criteria will be highlighted. Choose the color or display characteristic that should be used to highlight or include/exclude the filtered items. If "exclude" is chosen, the browser will not display ESTs that match the filter criteria. If "include" is selected, the browser will display only those ESTs that match the filter criteria. This track may also be configured to display base labeling, a feature that allows the user to display all bases in the aligning sequence or only those that differ from the genomic sequence. For more information about this option, go to the Base Coloring for Alignment Tracks page. Several types of alignment gap may also be colored; for more information, go to the Alignment Insertion/Deletion Display Options page. Methods To make an EST, RNA is isolated from cells and reverse transcribed into cDNA. Typically, the cDNA is cloned into a plasmid vector and a read is taken from the 5' and/or 3' primer. For most — but not all — ESTs, the reverse transcription is primed by an oligo-dT, which hybridizes with the poly-A tail of mature mRNA. The reverse transcriptase may or may not make it to the 5' end of the mRNA, which may or may not be degraded. In general, the 3' ESTs mark the end of transcription reasonably well, but the 5' ESTs may end at any point within the transcript. Some of the newer cap-selected libraries cover transcription start reasonably well. Before the cap-selection techniques emerged, some projects used random rather than poly-A priming in an attempt to retrieve sequence distant from the 3' end. These projects were successful at this, but as a side effect also deposited sequences from unprocessed mRNA and perhaps even genomic sequences into the EST databases. Even outside of the random-primed projects, there is a degree of non-mRNA contamination. Because of this, a single unspliced EST should be viewed with considerable skepticism. To generate this track, human ESTs from GenBank were aligned against the genome using blat. Note that the maximum intron length allowed by blat is 750,000 bases, which may eliminate some ESTs with very long introns that might otherwise align. When a single EST aligned in multiple places, the alignment having the highest base identity was identified. Only alignments having a base identity level within 0.5% of the best and at least 96% base identity with the genomic sequence are displayed in this track. Credits This track was produced at UCSC from EST sequence data submitted to the international public sequence databases by scientists worldwide. References Benson DA, Cavanaugh M, Clark K, Karsch-Mizrachi I, Lipman DJ, Ostell J, Sayers EW. GenBank. Nucleic Acids Res. 2013 Jan;41(Database issue):D36-42. PMID: 23193287; PMC: PMC3531190 Benson DA, Karsch-Mizrachi I, Lipman DJ, Ostell J, Wheeler DL. GenBank: update. Nucleic Acids Res. 2004 Jan 1;32(Database issue):D23-6. PMID: 14681350; PMC: PMC308779 Kent WJ. BLAT - the BLAST-like alignment tool. Genome Res. 2002 Apr;12(4):656-64. PMID: 11932250; PMC: PMC187518 cpgIslandExtUnmasked Unmasked CpG CpG Islands on All Sequence (Islands < 300 Bases are Light Green) Regulation Description CpG islands are associated with genes, particularly housekeeping genes, in vertebrates. CpG islands are typically common near transcription start sites and may be associated with promoter regions. Normally a C (cytosine) base followed immediately by a G (guanine) base (a CpG) is rare in vertebrate DNA because the Cs in such an arrangement tend to be methylated. This methylation helps distinguish the newly synthesized DNA strand from the parent strand, which aids in the final stages of DNA proofreading after duplication. However, over evolutionary time, methylated Cs tend to turn into Ts because of spontaneous deamination. The result is that CpGs are relatively rare unless there is selective pressure to keep them or a region is not methylated for some other reason, perhaps having to do with the regulation of gene expression. CpG islands are regions where CpGs are present at significantly higher levels than is typical for the genome as a whole. The unmasked version of the track displays potential CpG islands that exist in repeat regions and would otherwise not be visible in the repeat masked version. By default, only the masked version of the track is displayed. To view the unmasked version, change the visibility settings in the track controls at the top of this page. Methods CpG islands were predicted by searching the sequence one base at a time, scoring each dinucleotide (+17 for CG and -1 for others) and identifying maximally scoring segments. Each segment was then evaluated for the following criteria: GC content of 50% or greater length greater than 200 bp ratio greater than 0.6 of observed number of CG dinucleotides to the expected number on the basis of the number of Gs and Cs in the segment The entire genome sequence, masking areas included, was used for the construction of the track Unmasked CpG. The track CpG Islands is constructed on the sequence after all masked sequence is removed. The CpG count is the number of CG dinucleotides in the island. The Percentage CpG is the ratio of CpG nucleotide bases (twice the CpG count) to the length. The ratio of observed to expected CpG is calculated according to the formula (cited in Gardiner-Garden et al. (1987)): Obs/Exp CpG = Number of CpG * N / (Number of C * Number of G) where N = length of sequence. The calculation of the track data is performed by the following command sequence: twoBitToFa assembly.2bit stdout | maskOutFa stdin hard stdout \ | cpg_lh /dev/stdin 2> cpg_lh.err \ | awk '{$2 = $2 - 1; width = $3 - $2; printf("%s\t%d\t%s\t%s %s\t%s\t%s\t%0.0f\t%0.1f\t%s\t%s\n", $1, $2, $3, $5, $6, width, $6, width*$7*0.01, 100.0*2*$6/width, $7, $9);}' \ | sort -k1,1 -k2,2n > cpgIsland.bed The unmasked track data is constructed from twoBitToFa -noMask output for the twoBitToFa command. Data access CpG islands and its associated tables can be explored interactively using the REST API, the Table Browser or the Data Integrator. All the tables can also be queried directly from our public MySQL servers, with more information available on our help page as well as on our blog. The source for the cpg_lh program can be obtained from src/utils/cpgIslandExt/. The cpg_lh program binary can be obtained from: http://hgdownload.soe.ucsc.edu/admin/exe/linux.x86_64/cpg_lh (choose "save file") Credits This track was generated using a modification of a program developed by G. Miklem and L. Hillier (unpublished). References Gardiner-Garden M, Frommer M. CpG islands in vertebrate genomes. J Mol Biol. 1987 Jul 20;196(2):261-82. PMID: 3656447 consIndelsHgMmCanFam Cons Indels MmCf Indel-based Conservation for human hg17, mouse mm5 and dog canFam1 Comparative Genomics Description This track displays regions showing evidence for conservation with respect to mutations involving sequence insertions and deletions (indels). These “indel-purified sequences” (IPSs) were obtained by comparing the predictions of a neutral model of indel evolution with data obtained from human (hg17), mouse (mm5) and dog (canFam1) alignments (Lunter et al., 2006) The evidence for conservation is statistical, and each region is annotated with a posterior probability. It may be interpreted as the probability that the segment shows the paucity of indels by selection, rather than by random chance. Apart from the underlying alignment, these data are independent of the conservation of the nucleotide sequence itself. Any inferred conservation of the sequence, e.g. as shown by phastCons, is therefore independent evidence for selection. It may happen that sequence is conserved with respect to indel mutations without concomitant evidence of conservation of the nucleotide sequence. The opposite may also happen. Display Conventions The score (based on the false discovery rate, FDR) is reflected in the bluescale density gradient coloring the track items. Lighter colours reflect a higher FDR. Methods In the absence of selection, indels have a certain predicted distribution over the genome. The actual distribution shows an over-abundance of ungapped regions, due to selection purifying functional sequence from the deleterious effects of indels. Neutrally evolving sequence, such as (by and large) ancestral repeats, show an exceedingly good fit to the neutral predictions. This accurate fit allows the identification of a good proportion of conserved sequence at a relatively low false discovery rate (FDR). For example, at an FDR of 10%, the predicted sensitivity is 75%. Each identified indel-purified sequence (IPS) is annotated by two numbers: a false discovery rate (FDR), and a posterior probability (p). The FDR refers to a set of segments, not a given segment by itself. In this case, it refers to the minimum FDR of all sets that include the segment of interest. For example, a segment annotated with a 10% FDR also belongs to a set with a 15% FDR, but not a set with a 5% FDR. The posterior probability does refer to the single segment by itself. It has a frequentist interpretation, namely, as the proportion of regions, annotated with the same posterior probability, that have been under purifying selection, rather than showing the observed lack of indels by random chance. The data include segments for a false-discovery rate of either 1% or 10%. The score directly reflects the FDR, through the following formula: score = 90 / (FDR + 0.08) This maps FDR 1% (the most restrictive category) to 999, and FDR 10% to 500. For further details of the Methods, see Lunter et al., 2006. Verification The neutral indel model was calibrated using ancestral repeats, against which it showed an excellent fit. Among the identified IPSs at 10% FDR and predicted sensitivity of 75%, we found 75% of annotated protein-coding exons (weighted by length), and 75% of the 222 microRNAs that were annotated at the time. Ancestral repeats were heavily depleted among the identified segments. Credits These data were generated by Gerton Lunter and Chris Ponting, MRC Functional Genetics Unit, University of Oxford, United Kingdom and Jotun Hein, Department of Statistics, University of Oxford, United Kingdom. References Lunter G, Ponting, CP, Hein J. Genome-wide identification of human functional DNA using a neutral indel model. PLoS Comp Biol. 2006 Jan;2(1):e5. The data may also be browsed here. hapmapLd HapMap LD HapMap Linkage Disequilibrium - Phase II Variation and Repeats Description Linkage disequilibrium (LD) is the association of alleles on chromosomes. It measures the difference between the observed allele frequency for a two-locus allele combination as compared to its expected frequency, which is the product of the two single allele frequencies. When LD is low, the two loci tend to be inherited in a nearly random manner. This track shows three different measures of linkage disequilibrium — D', r2, and LOD (log odds) — between pairs of SNPs as genotyped by the HapMap consortium. LD is useful for understanding the associations between genetic variants throughout the genome, and can be helpful in selecting SNPs for genotyping. By default, the display in full mode shows LOD values. Each diagonal represents a different SNP with each diamond representing a pairwise comparison between two SNPs. Shades are used to indicate linkage disequilibrium between the pair of SNPs, with darker shades indicating stronger LD. For the LOD values, additional colors are used in some cases: White diamonds indicate pairwise D' values less than 1 with no statistically significant evidence of LD (LOD < 2). Light blue diamonds indicate high D' values (>0.99) with low statistical significance (LOD < 2). Light pink diamonds are drawn when the statistical significance is high (LOD >= 2) but the D' value is low (less than 0.5). Methods Genotypes from HapMap Phase II release 20 were used with Haploview to infer phasing and calculate LD values for all SNP pairs within 250 kb. As the children in the trios are not independent samples, Haploview uses only the parents from those populations. The YRI and CEU tracks each use 60 unrelated individuals (parents from the trios), and the combined JPT+CHB track uses 90 unrelated individuals. Haploview uses a two marker EM (ignoring missing data) to estimate the maximum-likelihood values of the four gamete frequencies, from which the D', LOD, and r2 calculations derive. Haplotype phase is inferred using a standard EM algorithm with a partition-ligation approach for blocks with greater than 10 markers. Display Conventions and Configuration Display Mode Full mode shows the pairwise LD values in a Haploview-style mountain plot. Dense mode shows the pairwise LD values in a single line for each population, where the intensity at each position is the average of all of the LD values between the SNP at that position and all other SNPs within 250 kb. LD Values: measures of linkage disequilibrium r2 displays the raw r2 value, or the square of the correlation coefficient for a given marker pair. SNPs that have not been separated by recombination have r2 = 1; in this case, these two markers are said to be redundant for genotyping, but may have different functional effects. Lower r2 values show a lower degree of LD, indicating that some recombination has occurred in this population. See Hill and Robertson (1966) for details. D' displays the raw D' value, which is the normalized covariance for a given marker pair. A D' value of 1 (complete LD) indicates that two SNPs have not been separated by recombination, while lower values indicate evidence of recombination in the history of the sample. Only D' values near 1 are a reliable measure of LD; lower values are difficult to interpret as the magnitude of D' depends strongly on sample size. See Lewontin (1988) for more details. LOD displays the log odds score for linkage disequilibrium between a given marker pair, and is shown by default. Track Geometry Trim to triangle shows the standard mountain plot (default); turning this option off will show LD values with SNPs outside the window. Inverting makes it easier to visually compare two adjacent populations. Colors LD Values can be drawn in a variety of colors, with red as default. The intensity of the color is proportional to the strength of the LD measure chosen above. Outlines can be drawn in contrasting colors or turned off. Outlines are automatically suppressed when the window is larger than 100,000 bp. Population Selection The HapMap populations can be individually displayed or hidden. YRI: Yoruba people in Ibadan, Nigeria (30 parent-and-adult-child trios) CEU: European samples from the Centre d'Etude du Polymorphisme Humain (CEPH) (30 trios) JPT+CHB: Combination of Japanese in Tokyo (45 unrelated individuals) and Han Chinese in Beijing (45 unrelated individuals) Credits This track was created by Daryl Thomas at UCSC using data from the International HapMap Project, following the display style from Haploview. References HapMap Project International HapMap Consortium. A second generation human haplotype map of over 3.1 million SNPs. Nature. 2007 Oct 18;449(7164):851-61. International HapMap Consortium. A haplotype map of the human genome. Nature. 2005 Oct 27;437(7063):1299-320. The International HapMap Consortium. The International HapMap Project. Nature. 2003 Dec 18;426(6968):789-96. HapMap Data Coordination Center Thorisson GA, Smith AV, Krishnan L, Stein LD. The International HapMap Project Web site. Genome Res. 2005 Nov;15(11):1592-3. Haploview Barrett JC, Fry B, Maller J, Daly MJ. Haploview: analysis and visualization of LD and haplotype maps. Bioinformatics. 2005 Jan 15;21(2):263-5. Epub 2004 Aug 5. General references on Linkage Disequilibrium Lewontin, RC. On measures of gametic disequilibrium. Genetics. 1988 Nov;120(3):849-52. Hill WG, Robertson A. The effect of linkage on limits to artificial selection. Genet Res. 1966 Dec;8(3):269-94. hapmapLdChbJpt LD JPT+CHB Linkage Disequilibrium for the Han Chinese and Japanese from Tokyo (JPT+CHB) Variation and Repeats hapmapLdCeu LD CEU Linkage Disequilibrium for the CEPH (CEU) Variation and Repeats hapmapLdYri LD YRI Linkage Disequilibrium for the Yoruba (YRI) Variation and Repeats hapmapSnps HapMap SNPs HapMap SNPs Variation and Repeats Description The HapMap Project identified a set of approximately four million common SNPs, and genotyped these SNPs in four populations. The intent is that this data can be used as a reference for future studies of human disease. This track displays the genotype counts and allele frequencies of those SNPs. The data displayed are from release 21a (HapMap Phase II), based on dbSNP build 125. This track also provides orthologous alleles from chimp and macaque. The HapMap data are from these four human populations: Yoruba in Ibadan, Nigeria (YRI) Japanese in Tokyo, Japan (JPT) Han Chinese in Beijing, China (CHB) CEPH (Utah residents with ancestry from northern and western Europe) (CEU) Each of the populations is displayed in a separate subtrack. The CEU and YRI data are comprised of 90 individuals in parent-child trios. The UCSC display removes the data for the children, leaving 60 individuals in each population. The CHB and JPT data are comprised of 45 individuals. Over 12% of HapMap SNPs are available for only a subset (1-3) of the populations. When available, the CHB and JPT SNPs were assayed in a minimum of 18 individuals, with over 97% of SNPs assayed in 45 or more individuals. The minimums for CEU and YRI are 26 and 24 respectively, with over 94% of SNPs assayed in 55 or more individuals. The HapMap assays provide biallelic results. Over 99.9% of HapMap SNPs are included in dbSNP build125 as biallelic; approximately 3,000 are more complex. Two-thirds of the HapMap SNPs are transitions: one-third are A/G, one-third are C/T. The orthologous alleles in chimp (panTro2) and macaque (rheMac2) were derived using liftOver. Chimp alleles are available for over 96% of the human HapMap SNPs; macaque alleles are available for 88%. 15% of HapMap SNPs are monomorphic in all individuals in all populations. Within single populations, 21.5% of the SNPs are monomorphic in YRI and 38% of the SNPs are monomorphic in JPT individuals. Approximately 20% of HapMap SNPs have a different major allele in different populations. No two HapMap SNPs occupy the same position. Aside from seven SNPs from the pseudo autosomal region of chrX, no rsIds are included more than once. No HapMap SNPs occur on chrM or on "random" chromosomes. Display Conventions and Configuration Note: calculation of heterozygosity has changed since this version of the track. In this track, expected heterozygosity is calculated as follows: the allele counts from all populations are summed (not normalized for population size) and used to determine overall major and minor allele frequencies. Assuming Hardy-Weinberg equilibrium, overall expected heterozygosity is calculated as two times the product of major and minor allele frequencies (see Modern Genetic Analysis, section 17-2). [In the HapMap SNPs track in the Mar. 2006 (hg18) assembly, observed heterozygosity is calculated as follows: each population's heterozygosity is computed as the proportion of heterozygous individuals in the population. The population heterozygosities are averaged to determine the overall observed heterozygosity.] The human SNPs are displayed in gray using a color gradient based on minor allele frequency. The higher the minor allele frequency, the darker the display. By definition, the maximum minor allele frequency is 50%. When zoomed to base level, the major allele is displayed for each population. Reversing the base position track will cause the HapMap display to reverse as well. This is the recommended configuration for SNPs on the negative strand. The orthologous alleles from chimp and macaque are displayed in brown using a color gradient based on quality score. Quality scores range from 0 to 100 representing low to high quality. For orthologous alleles, the higher the quality, the darker the display. Quality scores are not available for chimp chromosomes chr21 and chrY; these were set to 98, consistent with the panTro2 browser quality track. Filters are provided for the data attributes described above. Additionally, a filter is provided for heterozgosity over all populations. The measure of heterozygosity used is 2pq (from Hardy-Weinberg equilibrium). Filters are applied to all six subtracks. This is true, even if a subtrack is not displayed. Notes on orthologous allele filters: If the major allele is different between populations, no overall major allele for human is determined, thus the "matching" filters for orthologous alleles do not apply to these SNPs. If a SNP is monomorphic in all populations, the minor allele is not verified in the HapMap dataset. In these cases, the filter to match orthologous alleles to the minor human allele will yield no results. Credits This track is based on International HapMap Project release 21a data, provided by the HapMap Data Coordination Center. References HapMap Project The International HapMap Consortium. A second generation human haplotype map of over 3.1 million SNPs. Nature. 2007 Oct 18;449(7164):851-61. The International HapMap Consortium. A haplotype map of the human genome. Nature. 2005 Oct 27;437(7063):1299-320. The International HapMap Consortium. The International HapMap Project. Nature. 2003 Dec 18;426(6968):789-96. HapMap Data Coordination Center Thorisson GA, Smith AV, Krishnan L, Stein LD. The International HapMap Project Web site. Genome Res. 2005 Nov;15(11):1592-3. A Sampling of HapMap Literature Gibson J, Morton NE, Collins A. Extended tracts of homozygosity in outbred human populations. Hum Mol Genet. 2006 Mar 1; 15(5):789-95. Redon R, Ishikawa S, Fitch KR, Feuk L, Perry GH, Andrews TD, Fiegler H, Shapero MH, Carson AR, Chen W et al. Global variation in copy number in the human genome. Nature. 2006 Nov 23;444(7118):444-454. Spielman RS, Bastone LA, Burdick JT, Morley M, Ewens WJ, Cheung VG. Common genetic variants account for differences in gene expression among ethnic groups. Nature Genet. 2007 Feb;39(2):226-31. Tenesa A, Navarro P, Hayes BJ, Duffy DL, Clarke GM, Goddard ME, Visscher PM. Recent human effective population size estimated from linkage disequilibrium. Genome Res. 2007 Apr;17(4):520-6. Voight BF, Kudaravalli S, Wen X, Pritchard JK. A Map of Recent Positive Selection in the Human Genome. PLoS Biol. 2006 Mar;4(3):e72. Weir BS, Cardon LR, Anderson AD, Nielsen DM, Hill WG. Measures of human population structure show heterogeneity among genomic regions. Genome Res. 2005 Nov;15(11):1468-76. Data Source The source for this track are the genotypes_chr*_*_r21a_nr.txt.gz files from http://www.hapmap.org/downloads/genotypes/2007-01/rs_strand/non-redundant. hapmapAllelesMacaque Macaque Alleles Orthologous Alleles from Macaque (rheMac2) Variation and Repeats hapmapAllelesChimp Chimp Alleles Orthologous Alleles from Chimp (panTro2) Variation and Repeats hapmapSnpsYRI SNPs YRI SNPs from the YRI Population Variation and Repeats hapmapSnpsJPT SNPs JPT SNPs from the JPT Population Variation and Repeats hapmapSnpsCHB SNPs CHB SNPs from the CHB Population Variation and Repeats hapmapSnpsCEU SNPs CEU SNPs from the CEU Population Variation and Repeats kiddEichlerDisc HGSV Discordant HGSV Discordant Clone End Alignments Variation and Repeats Description This track shows data from the Human Genome Structural Variation Project. Clone ends from nine individuals from Kidd, et al. were mapped to the reference Human genome. This track shows clones whose end mappings were discordant with the reference genome in one of the following ways: deletion: Clone mapping too large relative to reference insertion: Clone mapping too small relative to reference inversion: In appropriate orientation, clone mapping spans potential inversion breakpoint OEA: One End Anchored clones (only one end could be mapped to reference) transchrm: Clone ends map to different chromosomes (name indicates identity of other chromosome after the underscore). Each individual's discordant clone end mappings are in a different subtrack. The nine individuals' labels used in Kidd, et al., populations of origin, and Coriell Cell Repository catalog IDs are shown here: Individual  Population  Coriell ID ABC14CEPHNA12156 ABC13YorubaNA19129 ABC12CEPHNA12878 ABC11ChinaNA18555 ABC10YorubaNA19240 ABC9JapanNA18956 ABC8YorubaNA18507 ABC7YorubaNA18517 G248UnknownNA15510 Methods Excerpted from Kidd, et al.: We selected eight individuals as part of the first phase of the Human Genome Structural Variation Project. This included four individuals of Yoruba Nigerian ethnicity and four individuals of non-African ethnicity. For each individual we constructed a whole genomic library of about 1 million clones, using a fosmid subcloning strategy. Each library was arrayed and both ends of each clone insert were sequenced to generate a pair of high-quality end sequences (termed an end-sequence pair (ESP)). The overall approach generated a physical clone map for each individual human genome, flagging regions discrepant by size or orientation on the basis of the placement of end sequences against the reference assembly. Across all eight libraries, we mapped 6.1 million clones to distinct locations against the reference sequence (http://hgsv.washington.edu). Of these, 76,767 were discordant by length and/or orientation, indicating potential sites of structural variation. About 0.4% (23,742) of the ESPs mapped with only one end to the reference assembly despite the presence of high-quality sequence at the other end (termed one-end anchored (OEA) clones). Note: This track contains many more than the 76,767 + 23,742 items mentioned above because it also includes clones whose ends map to different chromosomes (transchrm). References Kidd JM, Cooper GM, Donahue WF, Hayden HS, Sampas N, Graves T, Hansen N, Teague B, Alkan C, Antonacci F, et al. Mapping and sequencing of structural variation from eight human genomes. Nature. 2008 May 1;453(7191):56-64. kiddEichlerDiscG248 Discordant G248 HGSV Individual G248 Discordant Clone End Alignments Variation and Repeats kiddEichlerDiscAbc7 Discordant ABC7 HGSV Individual ABC7 (Yoruba) Discordant Clone End Alignments Variation and Repeats kiddEichlerDiscAbc8 Discordant ABC8 HGSV Individual ABC8 (Yoruba) Discordant Clone End Alignments Variation and Repeats kiddEichlerDiscAbc9 Discordant ABC9 HGSV Individual ABC9 (Japan) Discordant Clone End Alignments Variation and Repeats kiddEichlerDiscAbc10 Discordant ABC10 HGSV Individual ABC10 (Yoruba) Discordant Clone End Alignments Variation and Repeats kiddEichlerDiscAbc11 Discordant ABC11 HGSV Individual ABC11 (China) Discordant Clone End Alignments Variation and Repeats kiddEichlerDiscAbc12 Discordant ABC12 HGSV Individual ABC12 (CEPH) Discordant Clone End Alignments Variation and Repeats kiddEichlerDiscAbc13 Discordant ABC13 HGSV Individual ABC13 (Yoruba) Discordant Clone End Alignments Variation and Repeats kiddEichlerDiscAbc14 Discordant ABC14 HGSV Individual ABC14 (CEPH) Discordant Clone End Alignments Variation and Repeats kiddEichlerValid HGSV Validated HGSV Validated Sites of Structural Variation Variation and Repeats Description Data from Human Genome Structural Variation Project. This track shows validated regions of structural variation in nine individuals from Kidd, et al.. Deletions, insertions and inversions are included. For inversions, sites corresponding to both breakpoints may be depicted. Clones corresponding to only a single breakpoint were selected to validate the site. Coordinates correspond to the variant region predicted by end-sequence pairs (ESPs), not to sequence-derived breakpoints. Each site was validated by at least one of these methods: Agi: Agilent CGH FISH: Inversion FISH assay MCD: Clone fingerprint NIL: Overlap with "novel" insertion locus Nim: NimbleGen CGH Seq: Clone sequencing Each individual's validated sites are in a different subtrack. The nine individuals' labels used in Kidd, et al., populations of origin, and Coriell Cell Repository catalog IDs are shown here: Individual  Population  Coriell ID ABC14CEPHNA12156 ABC13YorubaNA19129 ABC12CEPHNA12878 ABC11ChinaNA18555 ABC10YorubaNA19240 ABC9JapanNA18956 ABC8YorubaNA18507 ABC7YorubaNA18517 G248UnknownNA15510 Methods Excerpted from Kidd, et al.:     We selected eight individuals as part of the first phase of the Human Genome Structural Variation Project. This included four individuals of Yoruba Nigerian ethnicity and four individuals of non-African ethnicity. For each individual we constructed a whole genomic library of about 1 million clones by using a fosmid subcloning strategy. Each library was arrayed and both ends of each clone insert were sequenced to generate a pair of high-quality end sequences (termed an end-sequence pair (ESP)). The overall approach generated a physical clone map for each individual human genome, flagging regions discrepant by size or orientation on the basis of the placement of end sequences against the reference assembly. Across all eight libraries, we mapped 6.1 million clones to distinct locations against the reference sequence (http://hgsv.washington.edu). Of these, 76,767 were discordant by length and/or orientation, indicating potential sites of structural variation. About 0.4% (23,742) of the ESPs mapped with only one end to the reference assembly despite the presence of high-quality sequence at the other end (termed one-end anchored (OEA) clones). Fosmid clones discordant by size (n = 3,371 fosmid clones) were subjected to fingerprint analysis using four multiple complete restriction enzyme digests (MCD analysis) to confirm insert size and eliminate rearranged clones. Two high-density customized oligonucleotide microarrays (Agilent and NimbleGen) were designed to confirm sites of deletion and insertion (GEO accessions GSE10008 and GSE10037). We developed a new, expectation maximization-based clustering approach to genotype deletions with the use of data from the Illumina Human1M BeadChip collected for 125 HapMap DNA samples. We found that more than 98% of the children's genotypes were consistent with mendelian transmission on the basis of an analysis of 28 parent-child trios. References Kidd JM, Cooper GM, Donahue WF, Hayden HS, Sampas N, Graves T, Hansen N, Teague B, Alkan C, Antonacci F, et al. Mapping and sequencing of structural variation from eight human genomes. Nature. 2008 May 1;453(7191):56-64. kiddEichlerValidG248 Validated G248 HGSV Individual G248 Validated Sites of Structural Variation Variation and Repeats kiddEichlerValidAbc7 Validated ABC7 HGSV Individual ABC7 (Yoruba) Validated Sites of Structural Variation Variation and Repeats kiddEichlerValidAbc8 Validated ABC8 HGSV Individual ABC8 (Yoruba) Validated Sites of Structural Variation Variation and Repeats kiddEichlerValidAbc9 Validated ABC9 HGSV Individual ABC9 (Japan) Validated Sites of Structural Variation Variation and Repeats kiddEichlerValidAbc10 Validated ABC10 HGSV Individual ABC10 (Yoruba) Validated Sites of Structural Variation Variation and Repeats kiddEichlerValidAbc11 Validated ABC11 HGSV Individual ABC11 (China) Validated Sites of Structural Variation Variation and Repeats kiddEichlerValidAbc12 Validated ABC12 HGSV Individual ABC12 (CEPH) Validated Sites of Structural Variation Variation and Repeats kiddEichlerValidAbc13 Validated ABC13 HGSV Individual ABC13 (Yoruba) Validated Sites of Structural Variation Variation and Repeats kiddEichlerValidAbc14 Validated ABC14 HGSV Individual ABC14 (CEPH) Validated Sites of Structural Variation Variation and Repeats microsat Microsatellite Microsatellites - Di-nucleotide and Tri-nucleotide Repeats Variation and Repeats Description This track displays regions that are likely to be useful as microsatellite markers. These are sequences of at least 15 perfect di-nucleotide and tri-nucleotide repeats and tend to be highly polymorphic in the population. Methods The data shown in this track are a subset of the Simple Repeats track, selecting only those repeats of period 2 and 3, with 100% identity and no indels and with at least 15 copies of the repeat. The Simple Repeats track is created using the Tandem Repeats Finder. For more information about this program, see Benson (1999). Credits Tandem Repeats Finder was written by Gary Benson. References Benson G. Tandem repeats finder: a program to analyze DNA sequences. Nucleic Acids Res. 1999 Jan 15;27(2):573-80. PMID: 9862982; PMC: PMC148217 retroposons RIPs Retrotransposon Insertion Polymorphisms in Humans Variation and Repeats Description Retrotransposons constitute over 40% of the human genome and consist of several millions of family members. They play important roles in shaping the structure and evolution of the genome and in participating in gene functioning and regulation. Because L1, Alu, and SVA retrotransposons are currently active in the human genome, their recent and ongoing retrotranspositional insertions generate a unique and important class of genetic polymorphisms (for the presence or absence of an insertion) among and within human populations. As such, they are useful genetic markers in population genetics studies due to their identical-by-descent and essentially homoplasy-free nature. Additionally, some polymorphic insertions are known to be responsible for a variety of human genetic diseases. dbRIP, a database of human retrotransposon insertion polymorphisms (RIPs), contains all currently known Alu, L1, and SVA polymorphic insertion loci in the human genome. Items shown in blue are found on the human reference assembly; those displayed in red are not found in the human reference assembly. Methods The dbRIP annotation process is described in Wang, J. et al. (2006) in the References section below. Credits Thanks to the Liang Lab at Roswell Park Cancer Institute for providing these data. These data originated from the dbRIP database. References Wang J, Song L, Grover D, Azrak S, Batzer MA, Liang P. dbRIP: a highly integrated database of retrotransposon insertion polymorphisms in humans. Hum Mutat. 2006 Apr;27(4):323-9. dbRIPSVA Polymorphic SVA SVA Retrotransposon Insertion Polymorphisms in Humans Variation and Repeats dbRIPL1 Polymorphic L1 L1 Retrotransposon Insertion Polymorphisms in Humans Variation and Repeats dbRIPAlu Polymorphic Alu Alu Retrotransposon Insertion Polymorphisms in Humans Variation and Repeats genomicSuperDups Segmental Dups Duplications of >1000 Bases of Non-RepeatMasked Sequence Variation and Repeats Description This track shows regions detected as putative genomic duplications within the golden path. The following display conventions are used to distinguish levels of similarity: Light to dark gray: 90 - 98% similarity Light to dark yellow: 98 - 99% similarity Light to dark orange: greater than 99% similarity Red: duplications of greater than 98% similarity that lack sufficient Segmental Duplication Database evidence (most likely missed overlaps) For a region to be included in the track, at least 1 Kb of the total sequence (containing at least 500 bp of non-RepeatMasked sequence) had to align and a sequence identity of at least 90% was required. Methods Segmental duplications play an important role in both genomic disease and gene evolution. This track displays an analysis of the global organization of these long-range segments of identity in genomic sequence. Large recent duplications (>= 1 kb and >= 90% identity) were detected by identifying high-copy repeats, removing these repeats from the genomic sequence ("fuguization") and searching all sequence for similarity. The repeats were then reinserted into the pairwise alignments, the ends of alignments trimmed, and global alignments were generated. For a full description of the "fuguization" detection method, see Bailey et al., 2001. This method has become known as WGAC (whole-genome assembly comparison); for example, see Bailey et al., 2002. Credits These data were provided by Ginger Cheng, Xinwei She, Archana Raja, Tin Louie and Evan Eichler at the University of Washington. References Bailey JA, Gu Z, Clark RA, Reinert K, Samonte RV, Schwartz S, Adams MD, Myers EW, Li PW, Eichler EE. Recent segmental duplications in the human genome. Science. 2002 Aug 9;297(5583):1003-7. PMID: 12169732 Bailey JA, Yavor AM, Massa HF, Trask BJ, Eichler EE. Segmental duplications: organization and impact within the current human genome project assembly. Genome Res. 2001 Jun;11(6):1005-17. PMID: 11381028; PMC: PMC311093 simpleRepeat Simple Repeats Simple Tandem Repeats by TRF Variation and Repeats Description This track displays simple tandem repeats (possibly imperfect repeats) located by Tandem Repeats Finder (TRF) which is specialized for this purpose. These repeats can occur within coding regions of genes and may be quite polymorphic. Repeat expansions are sometimes associated with specific diseases. Methods For more information about the TRF program, see Benson (1999). Credits TRF was written by Gary Benson. References Benson G. Tandem repeats finder: a program to analyze DNA sequences. Nucleic Acids Res. 1999 Jan 15;27(2):573-80. PMID: 9862982; PMC: PMC148217 encodeGencodeGeneMar07 Gencode Genes Mar07 Gencode Gene Annotations (March 2007) ENCODE Regions and Genes Description The Gencode Genes track (v3.1, March 2007) shows high-quality manual annotations in the ENCODE regions generated by the GENCODE project. The gene annotations are colored based on the HAVANA annotation type. See the table below for the color key, as well as more detail about the transcript and feature types. The Gencode project recommends that the annotations with known and validated transcripts; i.e., the types Known and Novel_CDS (which are colored dark green in the track display) be used as the reference gene annotation. The v3.1 release includes the following updates and enhancements to v2.2 (Oct. 2005): Apart from the usual additions and corrections, 69 loci (consisting of 132 transcripts) were re-annotated based on Rapid Amplification of cDNA Ends (RACE), array, and sequencing analyses performed within the Affymetrix/GENCODE collaboration (see the Methods section below, also Denoeud et al., 2007 and The ENCODE Project Consortium, 2007). The polymorphic gene type was added. PolyA features were added. A bug affecting frames of CDSs with missing start or stop codons was fixed. The experimental validation data contained in the Gencode Introns track from the previous release were integrated into the Gencode Genes track by annotators using the Human and Vertebrate Analysis and Annotation manual curation process (HAVANA). Type Color Description Known dark green Known protein-coding genes (i.e., referenced in Entrez Gene) Novel_CDS dark green Have an open reading frame (ORF) and are identical, or have homology, to cDNAs or proteins but do not fall into the above category. These can be known in the sense that they are represented by mRNA sequences in the public databases, but they are not yet represented in Entrez Gene or have not received an official gene name. They can also be novel in that they are not yet represented by an mRNA sequence in human. Novel_transcript light green Similar to Novel_CDS; however, cannot be assigned an unambigous ORF. Putative light green Have identical, or have homology to spliced ESTs, but are devoid of significant ORF and polyA features. These are generally short (two or three exon) genes or gene fragments. TEC light green (To Experimentally Confirm) Single-exon objects (supported by multiple unspliced ESTs with polyA sites and signals). Polymorphic purple Have functional transcripts in one haplotype and "pseudo" (non-functional) transcripts in another. Processed_pseudogene blue Pseudogenes that lack introns and are thought to arise from reverse transcription of mRNA followed by reinsertion of DNA into the genome. Unprocessed_pseudogene blue Pseudogenes that can contain introns, as they are produced by gene duplication. Artifact grey Transcript evidence and/or its translation equivocal. Usually these arise from high-throughput cDNA sequencing projects that submit automatic annotation, sometimes resulting in erroneous CDSs in what turns out to be, for example, 3' UTRs. In addition HAVANA has extended this category to include cDNAs with non-canonical splice sites due to deletion/sequencing errors. PolyA_signal brown Polyadenylation signal PolyA_site orange Polyadenylation site Pseudo_polyA pink "Pseudo"-polyadenylation signal detected in the sequence of a processed pseudogene. Warning: Pseudo_polyA features and processed_pseudogenes generally don't overlap. The reason is that pseudogene annotations are based solely on protein evidence, whereas pseudo_polyA signals are identified from transcript evidence; as they are found at the end of the 3' UTR, they can lie several kb downstream of the 3' end of the pseudogene. The current full set of GENCODE annotations is available for download here. Methods For a detailed description of the methods and references used, see Harrow et al., 2006 and Denoeud et al., 2007. 5' RACE/array experiments A combination of 5’ RACE and high-density tiling microarrays were used to empirically annotate 5’ transcription start sites (TSSs) and internal exons of all 410 annotated protein-coding loci across the 44 ENCODE regions (Oct. 2005 GENCODE freeze). The 5’ RACE reactions were performed with oligonucleotides mapping to a coding exon common to most of the transcripts of a protein-coding gene locus annotated by GENCODE (Oct. 2005 freeze) on polyA+ RNA from twelve adult human tissues (brain, heart, kidney, spleen, liver, colon, small intestine, muscle, lung, stomach, testis, placenta) and three cell lines (GM06990 (lymphoblastoid), HL60 (acute promyelocytic leukemia) and HeLaS3 (cervix carcinoma)). The RACE reactions were then hybridized to 20 nucleotide-resolution Affymetrix tiling arrays covering the non-repeated regions of the 44 ENCODE regions. The resulting "RACEfrags" -- array-detected fragments of RACE products -- were assessed for novelty by comparing their genome coordinates to those of GENCODE-annotated exons. Connectivity between novel RACEfrags and their respective index exon were further investigated by RT-PCR, cloning and sequencing. The resulting cDNA sequences (deposited in GenBank under accession numbers DQ655905-DQ656069 and EF070113-EF070122) were then fed into the HAVANA annotation pipeline as mRNA evidence (see "HAVANA manual annotations" below). HAVANA manual annotations The HAVANA process was used to produce these annotations. Before the manual annotation process begins, an automated analysis pipeline for similarity searches and ab initio predictions is run on a computer farm and stored in an Ensembl MySQL database using a modified Ensembl analysis pipeline system. All searches and prediction algorithms, except CpG island prediction (see cpgreport in the EMBOSS application suite), are run on repeat-masked sequence. RepeatMasker is used to mask interspersed repeats, followed by Tandem repeats finder to mask tandem repeats. Nucleotide sequence databases are searched with wuBLASTN, and significant hits are re-aligned to the unmasked genomic sequence using est2genome. The UniProt protein database is searched with wuBLASTX, and the accession numbers of significant hits are found in the Pfam database. The hidden Markov models for Pfam protein domains are aligned against the genomic sequence using Genewise to provide annotation of protein domains. Several ab initio prediction algorithms are also run: Genescan and Fgenesh for genes, tRNAscan to find tRNAgenes and Eponine TSS to predict transcription start sites. Once the automated analysis is complete, the annotator uses a Perl/Tk based graphical interface, "otterlace", developed in-house at the Wellcome Trust Sanger Institute to edit annotation data held in a separate MySQL database system. The interface displays a rich, interactive graphical view of the genomic region, showing features such as database matches, gene predictions, and transcripts created by the annotators. Gapped alignments of nucleotide and protein blast hits to the genomic sequence are viewed and explored using the "Blixem" alignment viewer. Additionally, the "Dotter" dot plot tool is used to show the pairwise alignments of unmasked sequence, thus revealing the location of exons that are occasionally missed by the automated blast searches because of their small size and/or match to repeat-masked sequence. The interface provides a number of tools that the annotator uses to build genes and edit annotations: adding transcripts, exon coordinates, translation regions, gene names and descriptions, remarks and polyadenlyation signals and sites. Verification See Harrow et al., 2006 for information on verification techniques. Credits This GENCODE release is the result of a collaborative effort among the following laboratories: Lab/Institution Contributors HAVANA annotation group, Wellcome Trust Sanger Insitute, Hinxton, UK Adam Frankish, Jonathan Mudge, James Gilbert, Tim Hubbard, Jennifer Harrow Genome Bioinformatics Lab CRG, Barcelona, Spain France Denoeud, Julien Lagarde, Sylvain Foissac, Robert Castelo, Roderic Guigó (GENCODE Principal Investigator) Department of Genetic Medicine and Development, University of Geneva, Switzerland Catherine Ucla, Carine Wyss, Caroline Manzano, Colette Rossier, Stylianos E. Antonorakis Center for Integrative Genomics, University of Lausanne, Switzerland Jacqueline Chrast, Charlotte N. Henrichsen, Alexandre Reymond Affymetrix, Inc., Santa Clara, CA, USA Philipp Kapranov, Thomas R. Gingeras References Denoeud F, Kapranov P, Ucla C, Frankish A, Castelo R, Drenkow J, Lagarde J, Alioto T, Manzano C, Chrast J et al. Prominent use of distal 5' transcription start sites and discovery of a large number of additional exons in ENCODE regions. Genome Res. 2007 Jun;17(6):746-59. Harrow J, Denoeud F, Frankish A, Reymond A, Chen CK, Chrast J, Lagarde J, Gilbert JG, Storey R, Swarbreck D et al. GENCODE: producing a reference annotation for ENCODE. Genome Biol. 2006;7 Suppl 1:S4.1-9. ENCODE Project Consortium, Birney E, Stamatoyannopoulos JA, Dutta A, Guigó R, Gingeras TR, Margulies EH, Weng Z, Snyder M, Dermitzakis ET et al. Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project. Nature. 2007 Jun 14;447(7146):799-816. encodeGencodeSuper Gencode Genes Gencode Gene Annotation ENCODE Regions and Genes Overview This super-track combines related tracks from the GENCODE project. The goal of this project is to identify all protein-coding genes in the ENCODE regions using a pipeline that uses computational predictions, experimental verification, and manual annotation, based on the Sanger Havana process. Gencode Genes Mar07 This track shows gene annotations from the GENCODE release v3.1 (March 2007). These annotations contain updates and corrections to the GENCODE October 2005 annotations, based on validation data from 5' RACE and RT-PCR experiments, which are displayed in the Gencode RACEfrags and Gencode Introns Oct05 tracks. Gencode RACEfrags This track shows the products of 5' RACE reactions performed on GENCODE genes in 12 tissues and 3 cell lines, as assayed on Affymetrix ENCODE 20nt tiling arays. The results were used to annotate 5' transcription start sites and internal exons of all annotated protein-coding loci in the Oct. 2005 GENCODE freeze. Gencode Genes Oct05 This track shows gene annotations from the GENCODE release v2.2 (Oct 2005), which was released as part of the ENCODE October 2005 data freeze. Gencode Introns Oct05 This track shows validation status of the introns in selected gene models from the Gencode Oct 05 gene annotations, as identified by RT-PCR and RACE experiments in 24 human tissues. Credits This GENCODE release is the result of a collaborative effort among the following laboratories: Lab/Institution Contributors HAVANA annotation group, Wellcome Trust Sanger Insitute, Hinxton, UK Adam Frankish, Jonathan Mudge, James Gilbert, Tim Hubbard, Jennifer Harrow Genome Bioinformatics Lab CRG, Barcelona, Spain France Denoeud, Julien Lagarde, Sylvain Foissac, Robert Castelo, Roderic Guigó (GENCODE Principal Investigator) Department of Genetic Medicine and Development, University of Geneva, Switzerland Catherine Ucla, Carine Wyss, Caroline Manzano, Colette Rossier, Stylianos E. Antonorakis Center for Integrative Genomics, University of Lausanne, Switzerland Jacqueline Chrast, Charlotte N. Henrichsen, Alexandre Reymond Affymetrix, Inc., Santa Clara, CA, USA Philipp Kapranov, Thomas R. Gingeras The RACEfrags result from a collaborative effort among the following laboratories: Lab/Institution Contributors Genome Bioinformatics Lab CRG, Barcelona, Spain France Denoeud, Julien Lagarde, Tyler Alioto, Sylvain Foissac, Robert Castelo, Roderic Guigó Department of Genetic Medicine and Development, University of Geneva, Switzerland Catherine Ucla, Carine Wyss, Caroline Manzano, Colette Rossier, Stylianos E. Antonorakis Center for Integrative Genomics, University of Lausanne, Switzerland Jacqueline Chrast, Charlotte N. Henrichsen, Alexandre Reymond Affymetrix, Inc., Santa Clara, CA, USA Philipp Kapranov, Jorg Drenkow, Sujit Dike, Jill Cheng, Thomas R. Gingeras HAVANA annotation group, Wellcome Trust Sanger Insitute, Hinxton, UK Adam Frankish, James Gilbert, Tim Hubbard, Jennifer Harrow References Denoeud F, Kapranov P, Ucla C, Frankish A, Castelo R, Drenkow J, Lagarde J, Alioto TS, Manzano C, Chrast J et al. Prominent use of distal 5' transcription start sites and discovery of a large number of additional exons in ENCODE regions. Genome Res. 2007 Jun;17(6):746-59. Harrow J, Denoeud F, Frankish A, Reymond A, Chen CK, Chrast J, Lagarde J, Gilbert JG, Storey R, Swarbreck D et al. GENCODE: producing a reference annotation for ENCODE. Genome Biol. 2006;7 Suppl 1:S4.1-9. encodeGencodeGenePolyAMar07 Gencode PolyA Gencode polyA Features ENCODE Regions and Genes encodeGencodeGenePseudoMar07 Gencode Pseudo Gencode Pseudogenes ENCODE Regions and Genes encodeGencodeGenePolymorphicMar07 Gencode Polymorph Gencode Polymorphic ENCODE Regions and Genes encodeGencodeGenePutativeMar07 Gencode Putative Gencode Putative Genes ENCODE Regions and Genes encodeGencodeGeneKnownMar07 Gencode Ref Gencode Reference Genes ENCODE Regions and Genes encodeGencodeRaceFrags Gencode RACEfrags 5' RACE-Array experiments on Gencode loci ENCODE Regions and Genes Description RACEfrags are the products of 5’ RACE reactions performed on GENCODE genes (using the primers displayed in the subtrack "Gencode 5’ RACE primer") in 12 tissues and 3 cell lines (15 subtracks) followed by hybridization on ENCODE tiling arrays. Each RACEfrag is linked to the 5’ RACE primer but no other connectivity information is available from this experiment. Methods For a detailed description of the methods and references used, see Denoeud et al., 2007. A combination of 5’ RACE and high-density tiling microarrays were used to empirically annotate 5’ transcription start sites (TSSs) and internal exons of all 410 annotated protein-coding loci across the 44 ENCODE regions (Oct. 2005 GENCODE freeze ; Harrow et al., 2006). Oligonucleotides for 5’ RACE experiments were chosen such that they map to a coding exon (the index exon) common to most of the transcripts of protein-coding gene loci annotated by the GENCODE (Oct. 2005 freeze). The 5’ RACE reactions were performed with oligonucleotides mapping to a coding exon (the index exon) on polyA+ RNA from twelve adult human tissues (brain, heart, kidney, spleen, liver, colon, small intestine, muscle, lung, stomach, testis, placenta) and three cell lines (GM06990 (lymphoblastoid), HL60 (acute promyelocytic leukemia) and HeLaS3 (cervix carcinoma)). The RACE reactions were then hybridized to 20 nucleotide-resolution Affymetrix tiling arrays covering the non-repeated regions of the 44 ENCODE regions. The resulting "RACEfrags" -- array-detected fragments of RACE products -- were assessed for novelty by comparing their genomic coordinates to those of GENCODE-annotated exons. Verification Connectivity between novel RACEfrags and their respective index exon were investigated by RT-PCR using the 5’ RACE primer as one of the primers, followed by hybridization on tiling arrays. 385 RT-PCR reactions corresponding to 199 GENCODE loci were positive after hybridization on tiling arrays (244 RACE reactions). All positive RT-PCR reactions and a subset of those that were negative in the hybridization experiments were further verified by cloning and sequencing of the RT-PCR products. In most cases, eight clones were selected from each set of RT-PCR products for sequencing. To be retained in the dataset, these sequences must unambiguously map to the correct location, show splicing and pass manual inspection by the HAVANA team. By these criteria, 89 of these RT-PCR reactions (69 GENCODE loci) were positive after cloning and sequencing. (see Denoeud et al., 2007 for further details). The resulting cDNA sequences were deposited in GenBank under accession numbers DQ655905-DQ656069 and EF070113-EF070122. See additional information about the sequences here. Credits The RACEfrags result from a collaborative effort among the following laboratories: Lab/Institution Contributors Genome Bioinformatics Lab CRG, Barcelona, Spain France Denoeud, Julien Lagarde, Tyler Alioto, Sylvain Foissac, Robert Castelo, Roderic Guigó Department of Genetic Medicine and Development, University of Geneva, Switzerland Catherine Ucla, Carine Wyss, Caroline Manzano, Colette Rossier, Stylianos E. Antonorakis Center for Integrative Genomics, University of Lausanne, Switzerland Jacqueline Chrast, Charlotte N. Henrichsen, Alexandre Reymond Affymetrix, Inc., Santa Clara, CA, USA Philipp Kapranov, Jorg Drenkow, Sujit Dike, Jill Cheng, Thomas R. Gingeras HAVANA annotation group, Wellcome Trust Sanger Insitute, Hinxton, UK Adam Frankish, James Gilbert, Tim Hubbard, Jennifer Harrow References Denoeud F, Kapranov P, Ucla C, Frankish A, Castelo R, Drenkow J, Lagarde J, Alioto T, Manzano C, Chrast J et al. Prominent use of distal 5' transcription start sites and discovery of a large number of additional exons in ENCODE regions. Genome Res. 2007 Jun;17(6):746-59. Harrow J, Denoeud F, Frankish A, Reymond A, Chen CK, Chrast J, Lagarde J, Gilbert JG, Storey R, Swarbreck D et al. GENCODE: producing a reference annotation for ENCODE. Genome Biol. 2006;7 Suppl 1:S4.1-9. ENCODE Project Consortium, Birney E, Stamatoyannopoulos JA, Dutta A, Guigó R, Gingeras TR, Margulies EH, Weng Z, Snyder M, Dermitzakis ET et al. Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project. Nature. 2007 Jun 14;447(7146):799-816. encodeGencodeRaceFragsHela RACEfrags HeLaS3 Gencode RACEfrags from HeLaS3 cells ENCODE Regions and Genes encodeGencodeRaceFragsHL60 RACEfrags HL60 Gencode RACEfrags from HL60 cells ENCODE Regions and Genes encodeGencodeRaceFragsGM06990 RACEfrags GM06990 Gencode RACEfrags from GM06990 cells ENCODE Regions and Genes encodeGencodeRaceFragsTestis RACEfrags Testis Gencode RACEfrags from Testis ENCODE Regions and Genes encodeGencodeRaceFragsStomach RACEfrags Stomach Gencode RACEfrags from Stomach ENCODE Regions and Genes encodeGencodeRaceFragsSpleen RACEfrags Spleen Gencode RACEfrags from Spleen ENCODE Regions and Genes encodeGencodeRaceFragsSmallIntest RACEfrags Sm Int Gencode RACEfrags from Small Intestine ENCODE Regions and Genes encodeGencodeRaceFragsPlacenta RACEfrags Placenta Gencode RACEfrags from Placenta ENCODE Regions and Genes encodeGencodeRaceFragsMuscle RACEfrags Muscle Gencode RACEfrags from Muscle ENCODE Regions and Genes encodeGencodeRaceFragsLung RACEfrags Lung Gencode RACEfrags from Lung ENCODE Regions and Genes encodeGencodeRaceFragsLiver RACEfrags Liver Gencode RACEfrags from Liver ENCODE Regions and Genes encodeGencodeRaceFragsKidney RACEfrags Kidney Gencode RACEfrags from Kidney ENCODE Regions and Genes encodeGencodeRaceFragsHeart RACEfrags Heart Gencode RACEfrags from Heart ENCODE Regions and Genes encodeGencodeRaceFragsColon RACEfrags Colon Gencode RACEfrags from Colon ENCODE Regions and Genes encodeGencodeRaceFragsBrain RACEfrags Brain Gencode RACEfrags from Brain ENCODE Regions and Genes encodeGencodeRaceFragsPrimer RACEfrags Primer Gencode 5' RACE primer ENCODE Regions and Genes encodeGencodeGeneOct05 Gencode Genes Oct05 Gencode Gene Annotations (October 2005) ENCODE Regions and Genes Description The Gencode Gene track shows high-quality manual annotations in the ENCODE regions generated by the GENCODE project. A companion track, Gencode Introns, shows experimental gene structure validations for these annotations. The gene annotations are colored based on the Havana annotation type. Known and validated transcripts are colored dark green, putative and unconfirmed are light green, pseudogenes are blue, and artifacts are grey. The transcript types are defined in more detail in the accompanying table. The Gencode project recommends that the annotations with known and validated transcripts; i.e., the types Known, Novel_CDS, Novel_transcript_gencode_conf, and Putative_gencode_conf (which are colored dark green in the track display) be used as the reference annotation. Type Color Description Known dark green Known protein coding genes (referenced in Entrez Gene, NCBI) Novel_CDS dark green Novel protein coding genes annotated by Havana (not referenced in Entrez Gene, NCBI) Novel_transcript_gencode_conf dark green Novel transcripts annotated by Havana (no ORF assigned) with at least one junction validated by RT-PCR Putative_gencode_conf dark green Putative transcripts (similar to "novel transcripts", EST supported, short, no viable ORF) with at least one junction validated by RT-PCR Novel_transcript light green Novel transcripts annotated by Havana (no ORF assigned) not validated by RT-PCR Putative light green Putative transcripts (similar to "novel transcripts", EST supported, short, no viable ORF) not validated by RT-PCR TEC light green Single exon objects (supported by multiple ESTs with polyA sites and signals) undergoing experimental validation/extension. Processed_pseudogene blue Pseudogenes arising via retrotransposition (exon structure of parent gene lost) Unprocessed_pseudogene blue Pseudogenes arising via gene duplication (exon structure of parent gene retained) Artifact grey Transcript evidence and/or its translation equivocal Methods The Human and Vertebrate Analysis and Annotation manual curation process (HAVANA) was used to produce these annotations. Finished genomic sequence was analyzed on a clone-by-clone basis using a combination of similarity searches against DNA and protein databases, as well as a series of ab initio gene predictions. Nucleotide sequence databases were searched with WUBLASTN and significant hits were realigned to the unmasked genomic sequence by EST2GENOME. WUBLASTX was used to search the Uniprot protein database, and the accession numbers of significant hits were retrieved from the Pfam database. Hidden Markov models for Pfam protein domains were aligned against the genomic sequence using Genewise to provide annotation of protein domains. A number of ab initio prediction algorithms were also run: Genscan and Fgenesh for genes, tRNAscan to find tRNA genes, and Eponine TSS for transcription start site predictions. The annotators used the (AceDB-based) Otterlace interface to create and edit gene objects, which were then stored in a local database named Otter. In cases where predicted transcript structures from Ensembl are available, these can be viewed from within the Otterlace interface and may be used as starting templates for gene curation. Annotation in the Otter database is submitted to the EMBL/Genbank/DDBJ nucleotide database. Verification The gene objects selected for verification came from various computational prediction methods and HAVANA annotations. RT-PCR and RACE experiments were performed on them, using a variety of human tissues, to confirm their structure. Human cDNAs from 24 different tissues (brain, heart, kidney, spleen, liver, colon, small intestine, muscle, lung, stomach, testis, placenta, skin, peripheral blood leucocytes, bone marrow, fetal brain, fetal liver, fetal kidney, fetal heart, fetal lung, thymus, pancreas, mammary gland, prostate) were synthesized using 12 poly(A)+ RNAs from Origene, eight from Clemente Associates/Quantum Magnetics and four from BD Biosciences as described in [Reymond et al., 2002a,b]. The relative amount of each cDNA was normalized by quantitative PCR using SyberGreen as intercalator and an ABI Prism 7700 Sequence Detection System. Predictions of human genes junctions were assayed experimentally by RT-PCR as previously described and modified [Reymond, 2002b; Mouse Genome Sequencing Consortium, 2002; Guigo, 2003]. Similar amounts of Homo sapiens cDNAs were mixed with JumpStart REDTaq ReadyMix (Sigma) and four ng/ul primers (Sigma-Genosys) with a BioMek 2000 robot (Beckman). The ten first cycles of PCR amplification were performed with a touchdown annealing temperatures decreasing from 60 to 50°C; annealing temperature of the next 30 cycles was carried out at 50°C. Amplimers were separated on "Ready to Run" precast gels (Pharmacia) and sequenced. RACE experiments were performed with the BD SMART RACE cDNA Amplification Kit following the manufacturer instructions (BD Biosciences). Credits Click here for a complete list of people who participated in the GENCODE project. References Ashurst, J.L. et al. The Vertebrate Genome Annotation (Vega) database. Nucleic Acids Res 33 (Database Issue), D459-65 (2005). Guigo, R. et al. Comparison of mouse and human genomes followed by experimental verification yields an estimated 1,019 additional genes. Proc Natl Acad Sci U S A 100(3), 1140-5 (2003). Mouse Genome Sequencing Consortium. Initial sequencing and comparative analysis of the mouse genome. Nature 420(6915), 520-62 (2002). Reymond, A. et al. Human chromosome 21 gene expression atlas in the mouse. Nature 420(6915), 582-6 (2002). Reymond, A. et al. Nineteen additional unpredicted transcripts from human chromosome 21. Genomics 79(6), 824-32 (2002). encodeGencodeGenePseudoOct05 Gencode Pseudo Gencode Pseudogenes ENCODE Regions and Genes encodeGencodeGenePutativeOct05 Gencode Putative Gencode Putative Genes ENCODE Regions and Genes encodeGencodeGeneKnownOct05 Gencode Ref Gencode Reference Genes ENCODE Regions and Genes encodeGencodeIntronOct05 Gencode Introns Oct05 Gencode Intron Validation (October 2005) ENCODE Regions and Genes Description The Gencode Intron Validation track shows gene structure validations generated by the GENCODE project. This track serves as a companion to the Gencode Genes track. The items in this track are colored based on the validation status determined via RT-PCR of exons flanking the intron: Status Color Validation Result RT_positive green Intron validated (RT-PCR product corresponds to expected junction) RACE_validated green Intron validated (RACE product corresponds to expected junction) RT_negative red Intron not validated (no RT-PCR product was obtained) RT_wrong_junction gold Intron not validated, but another junction exists between the two (RT-PCR product does not correspond to the expected junction) Methods Selected gene models from the Genecode Genes track were picked for RT-PCR and RACE verification experiments. RT-PCR and RACE experiments were performed on the objects, using a variety of human tissues, to confirm their structure. Human cDNAs from 24 different tissues (brain, heart, kidney, spleen, liver, colon, small intestine, muscle, lung, stomach, testis, placenta, skin, peripheral blood leucocytes, bone marrow, fetal brain, fetal liver, fetal kidney, fetal heart, fetal lung, thymus, pancreas, mammary gland, prostate) were synthesized using twelve poly(A)+ RNAs from Origene, eight from Clemente Associates/Quantum Magnetics and four from BD Biosciences as described in [Reymond et al., 2002a,b]. The relative amount of each cDNA was normalized with glyceraldehyde-3-phosphate dehydrogenase (GAPDH) by quantitative PCR using SyberGreen as intercalator and an ABI Prism 7700 Sequence Detection System. Predictions of human genes junctions were assayed experimentally by RT-PCR as previously described and modified [Reymond, 2002b; Mouse Genome Sequencing Consortium, 2002; Guigo, 2003]. Similar amounts of Homo sapiens cDNAs were mixed with JumpStart REDTaq ReadyMix (Sigma) and 4 ng/ul primers (Sigma-Genosys) with a BioMek 2000 robot (Beckman). The ten first cycles of PCR amplification were performed with a touchdown annealing temperatures decreasing from 60 to 50°C; annealing temperature of the next 30 cycles was carried out at 50°C. Amplimers were separated on "Ready to Run" precast gels (Pharmacia) and sequenced. RACE experiments were performed with the BD SMART RACE cDNA Amplification Kit following the manufacturer instructions (BD Biosciences). Credits Click here for a complete list of people who participated in the GENCODE project. References Ashurst, J.L. et al. The Vertebrate Genome Annotation (Vega) database. Nucleic Acids Res 33 (Database Issue), D459-65 (2005). Guigo, R. et al. Comparison of mouse and human genomes followed by experimental verification yields an estimated 1,019 additional genes. Proc Natl Acad Sci U S A 100(3), 1140-5 (2003). Mouse Genome Sequencing Consortium. Initial sequencing and comparative analysis of the mouse genome. Nature 420(6915), 520-62 (2002). Reymond, A. et al. Human chromosome 21 gene expression atlas in the mouse. Nature 420(6915), 582-6 (2002). Reymond, A. et al. Nineteen additional unpredicted transcripts from human chromosome 21. Genomics 79(6), 824-32 (2002). snpArray SNP Arrays SNP Genotyping Arrays Variation and Repeats Description This track displays the SNPs used in genotyping platforms. Affymetrix 500K (250K Nsp and 250K Sty) This annotation displays the SNPs available for genotyping with the GeneChip Human Mapping 500K Array Set from Affymetrix. It is comprised of two arrays: Nsp and Sty, which contain approximately 262,000 and 238,000 SNPs, respectively. Affymetrix 100K (50K HindIII and 50K XbaI) This annotation displays the SNPs available for genotyping with the GeneChip Human Mapping 100K Array Set from Affymetrix. It is comprised of two 50K arrays: HindIII and XbaI. Affymetrix 10K This annotation displays the SNPs available for genotyping with the GeneChip Human Mapping 10K Array Sets from Affymetrix. There are two versions of the 10K array; each contains approximately 10,000 SNPs. Illumina 300K This annotation displays the SNPs available for genotyping with Illumina's HumanHap300 BeadChip. The HumanHap300 contains over 317,000 tagSNP markers derived from the International HapMap Project. References Further information on the Affymetrix arrays is available at these sites: 500K 100K 10K 10K v2 Further information on the Illumina array is available from Illumina. Methods Position, strand, and polymorphism data were obtained from Affymetrix and supplemented with links to corresponding dbSNP rsIDs based on a positional lookup into dbSNP build 125. In fewer than 2% of the cases, a dbSNP rsID was not present in dbSNP build 125 at the Affymetrix array position. Reference allele information was retrieved from the UCSC database based on Affymetrix position and strand data. Illumina data were supplied as rsIDs and was supplemented with position, strand, and polymorphism data based on a name lookup into dbSNP build 125. Reference allele information was retrieved from the UCSC database based on dbSNP position and strand data. Credits Thanks to Venu Valmeekam from Affymetrix and Jeff Ohmen from Illumina for providing these data. snpArrayIllumina300 Illumina 300K Illumina HumanHap300 BeadChip Variation and Repeats snpArrayAffy10v2 Affy 10Kv2 Affymetrix GeneChip Human Mapping 10K v2 Variation and Repeats snpArrayAffy10 Affy 10K Affymetrix GeneChip Human Mapping 10K Variation and Repeats snpArrayAffy50XbaI Affy 50KXbaI Affymetrix GeneChip Human Mapping 50K XbaI Variation and Repeats snpArrayAffy50HindIII Affy 50KHindIII Affymetrix GeneChip Human Mapping 50K HindIII Variation and Repeats snpArrayAffy250Sty Affy 250KSty Affymetrix GeneChip Human Mapping 250K Sty Variation and Repeats snpArrayAffy250Nsp Affy 250KNsp Affymetrix GeneChip Human Mapping 250K Nsp Variation and Repeats encodeEgaspFull EGASP Full ENCODE Gene Prediction Workshop (EGASP) All ENCODE Regions ENCODE Regions and Genes Description This track shows full sets of gene predictions covering all 44 ENCODE regions originally submitted for the ENCODE Gene Annotation Assessment Project (EGASP) Gene Prediction Workshop 2005. The following gene predictions are included: AceView DOGFISH-C Ensembl Exogean ExonHunter Fgenesh Pseudogenes Fgenesh++ GeneID-U12 GeneMark JIGSAW Pairagon/N-SCAN SGP2-U12 SPIDA Twinscan-MARS The EGASP Partial companion track shows original gene prediction submissions for a partial set of the 44 ENCODE regions; the EGASP Update track shows updated versions of the submitted predictions. These annotations were originally produced using the hg17 assembly. Display Conventions and Configuration Data for each gene prediction method within this composite annotation track are displayed in a separate subtrack. See the top of the track description page for configuration options allowing display of selected subsets of gene predictions. To remove a subtrack from the display, uncheck the appropriate box. The individual subtracks within this annotation follow the display conventions for gene prediction tracks. Display characteristics specific to individual subtracks are described in the Methods section. The track description page offers the option to color and label codons in a zoomed-in display of the subtracks to facilitate validation and comparison of gene predictions. To enable this feature, select the genomic codons option from the "Color track by codons" menu. Click the Help on codon coloring link for more information about this feature. Color differences among the subtracks are arbitrary. They provide a visual cue for distinguishing the different gene prediction methods. Methods AceView These annotations were generated using AceView. All mRNAs and cDNAs available in GenBank, excluding NMs, were co-aligned on the Gencode sections. The results were then examined and filtered to resemble Havana. The very restrictive view of Havana on CDS was not reproduced, due to a lack of experimental data. DOGFISH-C Candidate splice sites and coding starts/stops were evaluated using DNA alignments between the human assembly and seven other vertebrate species (UCSC multiz alignments, adding the frog and removing the chimp). Genes (single transcripts only) were then predicted using dynamic programming. Ensembl The Ensembl annotation includes two types of predictions: protein-coding genes (the Ensembl Gene Predictions subtrack) and pseudogenes of protein-coding genes (the Ensembl Pseudogene Predictions subtrack). The Ensembl Pseudo track is not intended as a comprehensive annotation of pseudogenes, but rather an attempt to identify and label those gene predictions made by the Ensembl pipeline that have pseudogene characteristics. Exons that lie partially outside the ENCODE region are not included in the data set. The "Alternate Name" field on the subtrack details page shows the Ensembl ID for the selected gene or transcript. ExonHunter ExonHunter is a comprehensive gene-finder based on hidden Markov models (HMMs) allowing the use of a variety of additional sources of information (ESTs, proteins, genome-genome comparisons). Exogean Exogean annotates protein coding genes by combining mRNA and cross-species protein alignments in directed acyclic colored multigraphs where nodes and edges respectively represent biological objects and human expertise. Additional predictions and methods for this subtrack are available in the EGASP Updates track. Fgenesh Pseudogenes Fgenesh is an HMM gene structure prediction program. This data set shows predictions of potential pseudogenes. Fgenesh++ These gene predictions were generated by Fgenesh++, a gene-finding program that uses both HMMs and protein similarity to find genes in a completely automated manner. GeneID-U12 The GeneID-U12 gene prediction set, generated using a version of GeneID modified to detect U12-dependent introns (both GT-AG and AT-AC subtypes) when present, employs a single-genome ab initio method. This modified version of GeneID uses matrices for U12 donor, acceptor and branch sites constructed from examples of published U12 intron splice junctions (both experimentally confirmed and expressed-sequence-validated predictions). Two GeneID-U12 subtracks are included: GeneID Gene Predictions and GeneID U12 Intron Predictions. The U12 splice sites for features in the U12 Intron Predictions track are displayed on the track details pages. Additional predictions and methods for this subtrack are available in the EGASP Updates track. GeneMark The eukaryotic version of the GeneMark.hmm (release 2.2) gene prediction program utilizes the HMM statistical model with duration or hidden semi-Markov model (HSMM). The HMM includes hidden states for initial, internal and terminal exons, introns, intergenic regions and single exon genes. It also includes the "border" states, such as start site (initiation codon), stop site (termination codons), and donor and acceptor splice sites. Sequences of all protein-coding regions were modeled by three periodic inhomogeneous Markov chains; sequences of non-coding regions were modeled by homogeneous Markov chains. Nucleotide sequences corresponding to the site states were modeled by position-specific inhomogeneous Markov chains. Parameters of the gene models were derived from the set of genes obtained by cDNA mapping to genomic DNA. To reflect variations in G+C composition of the genome, the gene model parameters were estimated separately for the three G+C regions. JIGSAW JIGSAW uses the output from gene-finders, splice-site prediction programs and sequence alignments to predict gene models. Annotation data downloaded from the UCSC Genome Browser and TIGR gene-finder output was used as input for these predictions. JIGSAW predicts both partial and complete genes. Additional predictions and methods for this subtrack are available in the EGASP Updates track. Pairagon/N-SCAN The pairHMM-based alignment program, Pairagon, was used to align high-quality mRNA sequences to the ENCODE regions. These were supplemented with N-SCAN EST predictions which are displayed in the Pairgn/NSCAN-E subtrack, and extended further with additional transcripts from the Brent Lab to produce the predictions displayed as the Pairgn/NSCAN-E/+ subtrack. The NSCAN subtrack contains only predictions from the N-SCAN program. SGP2-U12 The SGP2-U12 gene prediction set, generated using a version of GeneID modified to detect U12-dependent introns (both AT-AC and GT-AG subtypes) when present, employs a dual-genome method (SGP2) that utilizes similarity (tblastx) to mouse genomic sequence syntenic to the ENCODE regions (Oct. 2004 MSA freeze). This modified version of GeneID uses matrices for U12 donor, acceptor and branch sites constructed from examples of published U12 intron splice junctions (both experimentally confirmed and expressed-sequence-validated predictions). Two SGP2-U12 subtracks are included: SGP2 Gene Predictions and SGP2 U12 Intron Predictions. The U12 splice sites for features in the U12 Intron Predictions track are displayed on the track details pages. Additional predictions and methods for this subtrack are available in the EGASP Updates track. SPIDA This exon-only prediction set was produced using SPIDA (Substitution Periodicity Index and Domain Analysis). Exons derived by mapping ESTs to the genome were validated by seeking periodic substitution patterns in the aligned informant DNA sequences. First, all available ESTs were mapped to the genome using Exonerate. The resulting transcript structures were "flattened" to remove redundancy. Each exon of the flattened transcripts was subjected to SPI analysis, which involves identifying periodicity in the pattern of mutations occurring between the human and an informant species DNA sequence (the informant sequences and their TBA alignments were provided by Elliott Margulies). SPI was calculated for all available human-informant pairs for whole exons and in a sliding 48 bp window. SPI analysis requires that a threshold level of periodicity be identified in at least two of the informant species if the exon is to be accepted. If accepted, SPI provides the correct frame for translation of the exon. This exon was used as a starting point for extending the ORF coding region of the flattened transcript from which it came. This gave a full or partial CDS; different exons may give different CDSs. The CDSs were translated and searched for domains using hmmpfam and Pfam_fs. Only transcripts with a domain hit with e > 1.0 were retained. Heuristics were applied to the retained CDSs to identify problems with the transcript structure, particularly frame-shifts. Many transcripts may identify the same exon, but only a single instance of each exon has been retained. Twinscan-MARS This gene prediction set was produced by a version of Twinscan that employs multiple pairwise genome comparisons to identify protein-coding genes (including alternative splices) using nucleotide homology information. No expression or protein data were used. Credits The following individuals and institutions provided the data for the subtracks in this annotation: AceView: Danielle and Jean Thierry-Mieg, NCBI, National Institutes of Health. DOGFISH-C: David Carter, Informatics Dept., Wellcome Trust Sanger Institute. Ensembl: Stephen Searle, Wellcome Trust Sanger Institute (joint Sanger/EBI project). Exogean: Sarah Djebali, Dyogen Lab, Ecole Normale Supérieure (Paris, France). ExonHunter: Tomas Vinar, Waterloo Bioinformatics, School of Computer Science, University of Waterloo. Fgenesh, Fgenesh++: Victor Solovyev, Department of Computer Science, Royal Holloway, London University. GeneID-U12, SGP2-U12: Tyler Alioto, Grup de Recerca en Informàtica Biomèdica (GRIB) at the Institut Municipal d'Investigació Mèdica (IMIM), Barcelona. GeneMark: Mark Borodovsky, Alex Lomsadze and Alexander Lukashin, Department of Biology, Georgia Institute of Technology. JIGSAW: Jonathan Allen, Steven Salzberg group, The Institute for Genomic Research (TIGR) and the Center for Bioinformatics and Computational Biology (CBCB) at the University of Maryland, College Park. Pairagon/N-SCAN: Randall Brown, Laboratory for Computational Genomics, Washington University in St. Louis. SPIDA: Damian Keefe, Birney Group, EMBL-EBI. Twinscan: Paul Flicek, Brent Lab, Washington University in St. Louis. encodeEgaspSuper EGASP ENCODE Gene Prediction Workshop (EGASP) ENCODE Regions and Genes Overview This super-track combines related tracks from the ENCODE Gene Annotation Assessment Project (EGASP) 2005 Gene Prediction Workshop. The goal of the workshop was to evaluate automatic methods for gene annotation of the human genome, with a focus on protein-coding genes. Predictions were evaluated in terms of their ability to reproduce the high-quality manually assisted GENCODE gene annotations and to predict novel transcripts. The EGASP Full track shows gene predictions covering all 44 ENCODE regions submitted before the GENCODE annotations were released. The EGASP Partial track shows gene predictions that cover some of the ENCODE regions, submitted before the GENCODE release. The EGASP Update track shows gene predictions that cover all ENCODE regions, submitted after the GENCODE release. These annotations were originally produced using the hg17 assembly. The following gene predictions are included: ACEScan AceView DOGFISH-C Ensembl Exogean ExonHunter Fgenesh Pseudogenes Fgenesh++ GeneID-U12 GeneMark GeneZilla JIGSAW Pairagon/N-SCAN SAGA SGP2-U12 SPIDA Twinscan-MARS Yale pseudogenes Credits Click here for a complete list of people who participated in the GENCODE project. The following individuals and institutions provided the data for the subtracks in this annotation: AceView: Danielle and Jean Thierry-Mieg, NCBI, National Institutes of Health. DOGFISH-C: David Carter, Informatics Dept., Wellcome Trust Sanger Institute. Ensembl: Stephen Searle, Wellcome Trust Sanger Institute (joint Sanger/EBI project). Exogean: Sarah Djebali, Dyogen Lab, Ecole Normale Supérieure (Paris, France). ExonHunter: Tomas Vinar, Waterloo Bioinformatics, School of Computer Science, University of Waterloo. Fgenesh, Fgenesh++: Victor Solovyev, Department of Computer Science, Royal Holloway, London University. GeneID-U12, SGP2-U12: Tyler Alioto, Grup de Recerca en Informàtica Biomèdica (GRIB) at the Institut Municipal d'Investigació Mèdica (IMIM), Barcelona. GeneMark: Mark Borodovsky, Alex Lomsadze and Alexander Lukashin, Department of Biology, Georgia Institute of Technology. JIGSAW: Jonathan Allen, Steven Salzberg group, The Institute for Genomic Research (TIGR) and the Center for Bioinformatics and Computational Biology (CBCB) at the University of Maryland, College Park. Pairagon/N-SCAN: Randall Brown, Laboratory for Computational Genomics, Washington University in St. Louis. SPIDA: Damian Keefe, Birney Group, EMBL-EBI. Twinscan: Paul Flicek, Brent Lab, Washington University in St. Louis. ACEScan: Gene Yeo, Crick-Jacobs Center for Computational Biology, Salk Institute. Augustus: Mario Stanke, Department of Bioinformatics, University of Göttingen, Germany. GeneZilla: William Majoros, Dept. of Bioinformatics, The Institute for Genomic Research (TIGR). SAGA: Sourav Chatterji, Lior Pachter lab, Department of Mathematics, U.C. Berkeley. References Ashurst JL, Chen CK, Gilbert JG, Jekosch K, Keenan S, Meidl P, Searle SM, Stalker J, Storey R, Trevanion S et al. The Vertebrate Genome Annotation (Vega) database. Nucleic Acids Res. 2005 Jan 1;33(Database issue):D459-65. Guigo R, Dermitzakis ET, Agarwal P, Ponting CP, Parra G, Reymond A, Abril JF, Keibler E, Lyle R, Ucla C et al. Comparison of mouse and human genomes followed by experimental verification yields an estimated 1,019 additional genes. Proc Natl Acad Sci U S A. 2003 Feb 4;100(3):1140-5. Mouse Genome Sequencing Consortium. Initial sequencing and comparative analysis of the mouse genome. Nature. 2002 Dec 5;420(6915):520-62. Reymond A, Marigo V, Yaylaoglu MB, Leoni A, Ucla C, Scamuffa N, Caccioppoli C, Dermitzakis ET, Lyle R, Banfi S et al. Human chromosome 21 gene expression atlas in the mouse. Nature. 2002 Dec 5;420(6915):582-6. Reymond A, Camargo AA, Deutsch S, Stevenson BJ, Parmigiani RB, Ucla C, Bettoni F, Rossier C, Lyle R, Guipponi M et al. Nineteen additional unpredicted transcripts from human chromosome 21. Genomics. 2002 Jun;79(6):824-32. Chatterji S, Pachter L. Multiple organism gene finding by collapsed Gibbs sampling. J Comput Biol. 2005 Jul-Aug;12(6):599-608. Siepel A, Haussler D. Computational identification of evolutionarily conserved exons. Proc. 8th Int'l Conf. on Research in Computational Molecular Biology. 2004;177-186. Augustus Stanke M, Waack S. Gene prediction with a hidden Markov model and a new intron submodel. Bioinformatics. 2003;19(Suppl. 2):ii215-ii225. Stanke M, Steinkamp R, Waack S, Morgenstern B. AUGUSTUS: a web server for gene finding in eukaryotes. Nucleic Acids Res. 2004 Jul 1;32(Web Server issue):W309-12. FGenesh++ Solovyev VV. "Statistical approaches in Eukaryotic gene prediction". In Handbook of Statistical Genetics (eds. Balding D et al.) (John Wiley & Sons, Inc., 2001). p. 83-127. GeneID Blanco E, Parra G, Guigó R. "Using geneid to identify genes". In Current Protocols in Bioinformatics, Unit 4.3. (eds. Baxevanis AD.) (John Wiley & Sons, Inc., 2002). Guigó R. Assembling genes from predicted exons in linear time with dynamic programming. J Comput Biol. 1998 Winter;5(4):681-702. Guigó R, Knudsen S, Drake N, Smith T. Prediction of gene structure. J Mol Biol. 1992 Jul 5;226(1):141-57. Parra G, Blanco E, Guigó R. GeneID in Drosophila. Genome Res. 2000 Apr;10(4):511-5. JIGSAW Allen JE, Pertea M, Salzberg SL. Computational gene prediction using multiple sources of evidence. Genome Res. 2004 Jan;14(1):142-8. Allen JE, Salzberg SL. JIGSAW: integration of multiple sources of evidence for gene prediction. Bioinformatics. 2005 Sep 15;21(18):3596-603. SGP2 Guigó R, Dermitzakis ET, Agarwal P, Ponting CP, Parra G, Reymond A, Abril JF, Keibler E, Lyle R, Ucla C et al. Comparison of mouse and human genomes followed by experimental verification yields an estimated 1,019 additional genes. Proc Natl Acad Sci U S A. 2003 Feb 4;100(3):1140-5. Parra G, Agarwal P, Abril JF, Wiehe T, Fickett JW, Guigó R. Comparative gene prediction in human and mouse. Genome Res. 2003 Jan;13(1):108-17. encodeEgaspFullTwinscan Twinscan Twinscan Gene Predictions ENCODE Regions and Genes encodeEgaspFullSpida SPIDA Exons SPIDA Exon Predictions ENCODE Regions and Genes encodeEgaspFullSgp2U12 SGP2 U12 SGP2 U12 Intron Predictions ENCODE Regions and Genes encodeEgaspFullSgp2 SGP2 SGP2 Gene Predictions ENCODE Regions and Genes encodeEgaspFullPairagonMultiple NSCAN N-SCAN Gene Predictions ENCODE Regions and Genes encodeEgaspFullPairagonAny Pairgn/NSCAN-E/+ Pairagon/NSCAN Any Evidence Gene Predictions ENCODE Regions and Genes encodeEgaspFullPairagonMrna Pairgn/NSCAN-E Pairagon/NSCAN-EST Gene Predictions ENCODE Regions and Genes encodeEgaspFullJigsaw Jigsaw Jigsaw Gene Predictions ENCODE Regions and Genes encodeEgaspFullGenemark GeneMark GeneMark Gene Predictions ENCODE Regions and Genes encodeEgaspFullGeneIdU12 GeneID U12 GeneID U12 Intron Predictions ENCODE Regions and Genes encodeEgaspFullGeneId GeneID GeneID Gene Predictions ENCODE Regions and Genes encodeEgaspFullSoftberryPseudo Fgenesh Pseudo Fgenesh Pseudogene Predictions ENCODE Regions and Genes encodeEgaspFullFgenesh Fgenesh++ Fgenesh++ Gene Predictions ENCODE Regions and Genes encodeEgaspFullExonhunter ExonHunter ExonHunter Gene Predictions ENCODE Regions and Genes encodeEgaspFullExogean Exogean Exogean Gene Predictions ENCODE Regions and Genes encodeEgaspFullEnsemblPseudo Ensembl Pseudo Ensembl Pseudogene Predictions ENCODE Regions and Genes encodeEgaspFullEnsembl Ensembl Ensembl Gene Predictions ENCODE Regions and Genes encodeEgaspFullDogfish DOGFISH-C DOGFISH-C Gene Predictions ENCODE Regions and Genes encodeEgaspFullAceview AceView AceView Gene Predictions ENCODE Regions and Genes snpRecombHotspot SNP Recomb Hots Recombination Hotspots from SNP Genotyping Variation and Repeats Description This track shows the location of recombination hotspots detected from patterns of genetic variation. It is based on the HapMap Phase I data, release 16a, and Perlegen data (Hinds et al., 2005). Observations from sperm studies (Jeffreys et al., 2001) and patterns of genetic variation (McVean et al., 2004; Crawford et al., 2004) show that recombination rates in the human genome vary extensively over kilobase scales and that much recombination occurs in recombination hotspots. This provides an explanation for the apparent block-like structure of linkage disequilibrium (Daly et al., 2001; Gabriel et al., 2002). Recombination hotspot estimates provide a new route to understanding the molecular mechanisms underlying human recombination. A better understanding of the genomic landscape of human recombination hotspots would facilitate the efficient design and analysis of disease association studies and greatly improve inferences from polymorphism data about selection and human demographic history. Methods Recombination hotspots are identified using the likelihood-ratio test described in McVean et al. (2004) and Winckler et al. (2005), referred to as LDhot. For successive intervals of 200 kb, the maximum likelihood of a model with a constant recombination rate is compared to the maximum likelihood of a model in which the central 2 kb is a recombination hotspot (likelihoods are approximated by the composite likelihood method of Hudson 2001). The observed difference in log composite likelihood is compared against the null distribution, which is obtained by simulations. Simulations are matched for sample size, SNP density, background recombination rate and an approximation to the ascertainment scheme (a panel of 12 individuals with a Poisson number of chromosomes, mean 1, sampled from this panel, using a single hit ascertainment scheme for dbSNP and resequencing of 16 individuals for the ten HapMap ENCODE regions). Evidence for a hotspot was assessed in each analysis panel separately (YRI, CEU and combined CHB+JPT), and p-values were combined such that a hotspot requires that two of the three populations show some evidence of a hotspot (p < 0.05) and at least one population showed stronger evidence for a hotspot (p < 0.01). Hotspot centers were estimated at those locations where distinct recombination rate estimate peaks occurred with at least a factor of two separation between peaks, within the low p-value intervals. Validation This approach has been validated in three ways: Over large scales from the genetic map, both by extensive simulation studies and by comparisons with independent estimates of recombination rates, and over fine scales from sperm analysis. Full details of validation can be found in McVean et al. (2004) and Winckler et al. (2005). Credits The HapMap data are based on HapMap release 16a; the Perlegen data are from Hinds et al. (2005). The recombination hotspots were ascertained by Simon Myers from the Mathematical Genetics Group at the University of Oxford. References Crawford, D.C., Bhangale, T., Li, N., Hellenthal, G., Rieder, M.J., Nickerson, D.A. and Stephens, M. Evidence for substantial fine-scale variation in recombination rates across the human genome. Nat Genet. 36(7), 700-6 (2004). Daly, M.J., Rioux, J.D., Schaffner, S.F., Hudson, T.J. and Lander, E.S. High-resolution haplotype structure in the human genome. Nat Genet. 29(2), 229-32 (2001). Gabriel, S.B., Schaffner, S.F., Nguyen, H., Moore, J.M., Roy, J., Blumenstiel, B., Higgins, J., DeFelice, M., Lochner, A., Faggart, M. et al. The structure of haplotype blocks in the human genome. Science 296(5576), 2225-9 (2002). Hudson, R. R. Two-locus sampling distributions and their application. Genetics 159(4):1805-1817 (2001). Hinds, D.A., Stuve, L.L., Nilsen, G.B., Halperin, E., Eskin, E., Ballinger, D.G., Frazer, K.A., Cox, D.R. Whole-Genome Patterns of Common DNA Variation in Three Human Populations. Science 307(5712), 1072-1079 (2005). Jeffreys, A.J,. Kauppi, L. and Neumann, R. Intensely punctate meiotic recombination in the class II region of the major histocompatibility complex. Nat Genet. 29(2), 217-22 (2001). McVean, G.A., Myers, S.R., Hunt, S., Deloukas, P., Bentley, D.R. and Donnelly, P. The fine-scale structure of recombination rate variation in the human genome. Science 304(5670), 581-4 (2004). Winckler, W., Myers, S.R., Richter, D.J., Onofrio, R.C., McDonald, G.J., Bontrop, R.E., McVean, G.A., Gabriel, S.B., Reich, D., Donnelly, P. et al. Comparison of fine-scale recombination rates in humans and chimpanzees. Science 308(5718), 107-11 (2005). snpRecombHotspotPerlegen Perlegen Oxford Recombination Hotspots from Perlegen Data Variation and Repeats snpRecombHotspotHapmap HapMap Oxford Recombination Hotspots from HapMap Phase I Release 16c.1 Variation and Repeats encodeEgaspPartial EGASP Partial ENCODE Gene Prediction Workshop (EGASP) for Partial ENCODE Regions ENCODE Regions and Genes Description This track shows gene predictions submitted for the ENCODE Gene Annotation Assessment Project (EGASP) Gene Prediction Workshop 2005 that cover only a partial set of the 44 ENCODE regions. The partial set excludes the 13 ENCODE regions for which high-quality annotations were released in late 2004. The following gene predictions are included: ACEScan Augustus GeneZilla SAGA The EGASP Full companion track shows original gene prediction submissions for the full set of 44 ENCODE regions using Gene Prediction algorithms other than those used here; the EGASP Update track shows updated versions of some of the submitted predictions. Display Conventions and Configuration Data for each gene prediction method within this composite annotation track is displayed in a separate subtrack. See the top of the track description page for a complete list of the subtracks available for this annotation. To display only selected subtracks, uncheck the boxes next to the tracks you wish to hide. The individual subtracks within this annotation follow the display conventions for gene prediction tracks. The track description page offers the option to color and label codons in a zoomed-in display of the subtracks to facilitate validation and comparison of gene predictions. To enable this feature, select the genomic codons option from the "Color track by codons" menu. Click the Help on codon coloring link for more information about this feature. Color differences among the subtracks are arbitrary. They provide a visual cue for distinguishing the different gene prediction methods. Methods ACEScan ACEScan (Alternative Conserved Exons Scan) indicates alternative splicing that is evolutionarily conserved in human and mouse/rat. The Conserved Alternative Exon Predictions subtrack shows predicted alternative conserved exons. The Unconserved Alternative and Constitutive Exon Predictions subtrack shows exons that are predicted to be constitutive or may have species-specific alternative splicing. Augustus Augustus uses a generalized hidden Markov model (GHMM) that models coding and non-coding sequence, splice sites, the branch point region, translation start and end, and lengths of exons and introns. The track contains four different sets of predictions. Ab initio single genome predictions are based solely on the input sequence. EST and protein evidence predictions were generated using AGRIPPA hints based on alignments of human sequence from the dbEST and nr databases. Mouse homology gene predictions were produced using mouse genomic sequence only; BLAST, CHAOS, DIALIGN were used to generate the hints for Augustus. The combined EST/protein evidence and mouse homology gene predictions were created using human sequence from the dbEST and nr databases and mouse genomic sequence to generate hints for Augustus. Additional predictions and methods for this subtrack are available in the EGASP Updates track. GeneZilla GeneZilla is a program for the computational prediction of protein-coding genes in eukaryotic DNA, based on the generalized hidden Markov model (GHMM) framework. These predictions were generated using GeneZilla and IsoScan, which uses a four-state hidden Markov model to predict isochores (regions of homogeneous G+C content) in genomic DNA. SAGA SAGA is an ab initio multiple-species gene-finding program based on the Gibbs sampling-based method described in Chatterji et al. (2004). In addition to sampling parameters, SAGA also uses a phyloHMM based model to boost the scores, similar to the method described in Siepel et al. (2004). Credits The gene prediction data sets were submitted by the following individuals and institutions: ACEScan: Gene Yeo, Crick-Jacobs Center for Computational Biology, Salk Institute. Augustus: Mario Stanke, Department of Bioinformatics, University of Göttingen, Germany. GeneZilla: William Majoros, Dept. of Bioinformatics, The Institute for Genomic Research (TIGR). SAGA: Sourav Chatterji, Lior Pachter lab, Department of Mathematics, U.C. Berkeley. References Chatterji, S. and Pachter, L. Multiple organism gene finding by collapsed Gibbs sampling. Proc. 8th Int'l Conf. on Research in Computational Molecular Biology, 187-193 (2004). Siepel, A. and Haussler, D. Computational identification of evolutionarily conserved exons. Proc. 8th Int'l Conf. on Research in Computational Molecular Biology, 177-186 (2004). encodeEgaspPartSaga SAGA SAGA Gene Predictions ENCODE Regions and Genes encodeEgaspPartGenezilla GeneZilla GeneZilla Gene Predictions ENCODE Regions and Genes encodeEgaspPartAugustusAny Augustus/EST/Mouse Augustus + EST/Protein Evidence + Mouse Homology Gene Predictions ENCODE Regions and Genes encodeEgaspPartAugustusDual Augustus/Mouse Augustus + Mouse Homology Gene Predictions ENCODE Regions and Genes encodeEgaspPartAugustusEst Augustus/EST Augustus + EST/Protein Evidence Gene Predictions ENCODE Regions and Genes encodeEgaspPartAugustusAbinitio Augustus Augustus Ab Initio Gene Predictions ENCODE Regions and Genes encodeEgaspPartAceOther ACEScan Other ACEScan Unconserved Alternative and Constitutive Exon Predictions ENCODE Regions and Genes encodeEgaspPartAceCons ACEScan Cons Alt ACEScan Conserved Alternative Exon Predictions ENCODE Regions and Genes encodeEgaspUpdate EGASP Update ENCODE Gene Prediction Workshop (EGASP) Updates ENCODE Regions and Genes Description This track shows updated versions of gene predictions submitted for the ENCODE Gene Annotation Assessment Project (EGASP) Gene Prediction Workshop 2005. The following gene predictions are included: Augustus Exogean FGenesh++ GeneID-U12 Jigsaw SGP2-U12 Yale pseudogenes The original EGASP submissions are displayed in the companion tracks, EGASP Full and EGASP Partial. Display Conventions and Configuration Data for each gene prediction method within this composite annotation track are displayed in separate subtracks. See the top of the track description page for a complete list of the subtracks available for this annotation. To display only selected subtracks, uncheck the boxes next to the tracks you wish to hide. The individual subtracks within this annotation follow the display conventions for gene prediction tracks. Display characteristics specific to individual subtracks are described in the Methods section. The track description page offers the option to color and label codons in a zoomed-in display of the subtracks to facilitate validation and comparison of gene predictions. To enable this feature, select the genomic codons option from the "Color track by codons" menu. Click the Help on codon coloring link for more information about this feature. Color differences among the subtracks are arbitrary. They provide a visual cue for distinguishing the different gene prediction methods. Methods Augustus Augustus uses a generalized hidden Markov model (GHMM) that models coding and non-coding sequence, splice sites, the branch point region, the translation start and end, and the lengths of exons and introns. This version has been trained on a set of 1284 human genes. The track contains four sets of predictions: ab initio, EST and protein-based, mouse homology-based, and those using EST/protein and mouse homology evidence as additional input to Augustus for the predictions. The EST and protein evidence was generated by aligning sequences from the dbEST and nr databases to the ENCODE region using wublastn and wublastx. The resulting alignments were used to generate hints about putative splice sites, exons, coding regions, introns, translation start and translation stop. The mouse homology evidence was generated by aligning pairs of human and mouse genomic sequences using the program DIALIGN. Regions conserved at the peptide level were used to generate hints about coding regions. Exogean Exogean produces alternative transcripts by combining mRNA and cross-species sequence alignments using heuristic rules. The program implements a generic framework based on directed acyclic colored multigraphs (DACMs). In Exogean, DACM nodes represent biological objects (mRNA or protein HSPs/transcripts) and multiple edges between nodes represent known relationships between these objects derived from human expertise. Exogean DACMs are succesively built and reduced, leading to increasingly complex objects. This process enables the production of alternative transcripts from initial HSPs. FGenesh++ FGenesh++ predictions are based on hidden Markov models and protein similarity to the NR database. For more information, see the reference below. GeneID-U12 The GeneID program predicts genes in anonymous genomic sequences designed with a hierarchical structure. In the first step, splice sites, start and stop codons are predicted and scored along the sequence using position weight arrays (PWAs). Next, exons are built from the sites. Exons are scored as the sum of the scores of the defining sites plus the the log-likelihood ratio of a Markov model for coding DNA. Finally, the gene structure is assembled from the set of predicted exons, maximizing the sum of the scores of the assembled exons. The modified version of GeneID used to generate the predictions in this track incorporates models for U12-dependent splice signals in addition to U2 splice signals. The GeneID subtrack shows all GeneID genes. Only U12 introns and their flanking exons are displayed in the GeneID U12 subtrack. Exons flanking predicted U12-dependent introns are assigned a type attribute reflecting their splice sites, displayed on the details page of the GeneID U12 subtrack as the "Alternate Name" of the item composed of the intron plus flanking exons. Jigsaw Jigsaw is a gene prediction program that determines genes based on target genomic sequence and output from a gene structure annotation database. Data downloaded from UCSC's annotation database is used as input and includes the following tracks of evidence: Known Genes, Ensembl, RefSeq, GeneID, Genscan, SGP, Twinscan, Human mRNAs, TIGR Gene Index, UniGene, Most Conserved Elements and Non-human RefSeq Genes. GlimmerHMM and GeneZilla, two open source ab initio gene-finding programs based on GHMMs, are also used. SGP2-U12 To predict genes in a genomic query, SGP2 combines GeneID predictions with tblastx comparisons of the genomic query against other genomic sequences. This modified version of SGP2 uses models for U12-dependent splice signals in addition to U2 splice signals. The reference genomic sequence for this data set is the Oct. 2004 release of mouse sequence syntenic to ENCODE regions. The SGP2 and SGP2 U12 tracks follow the same display conventions as the GeneID and GeneID U12 subtracks described above. Yale Pseudogenes For this analysis, pseudogenes were defined as genomic sequences similar to known human genes and with various disablements (premature stop codons or frameshifts) in their "putative" protein-coding regions. The protein sequences of known human genes (as annotated by ENSEMBL) were used to search for similar nongenic sequences in ENCODE regions. The matching sequences were assessed as disabled copies of genes based on the occurrences of premature stop codons or frameshifts. The intron-exon structure of the functional gene was further used to infer whether a pseudogene was duplicated or processed (a duplicated pseudogene keeps the intron-exon structure of its parent functional gene). Small pseudogene sequences were labeled as fragments or other types. All pseudogenes in this track were manually curated. In the browser, the track details page shows the pseudogene type. Credits Augustus was written by Mario Stanke at the Department of Bioinformatics of the University of Göttingen in Germany. Exogean was developed by Sarah Djebali and Hugues Roest Crollius from the Dyogen Lab, Ecole Normale Supérieure (Paris, France) and Franck Delaplace from the Laboratoire de Méthodes Informatiques (LaMI), (Evry, France). The FGenesh++ gene predictions were provided by Victor Solovyev of Softberry Inc. The GeneID-U12 and SGP2-U12 programs were developed by the Grup de Recerca en Informàtica Biomèdica (GRIB) at the Institut Municipal d'Investigació Mèdica (IMIM) in Barcelona. The version of GeneID on which GeneID-U12 is based (geneid_v1.2) was written by Enrique Blanco and Roderic Guigó. The parameter files were constructed by Genis Parra and Francisco Camara. Additional contributions were made by Josep F. Abril, Moises Burset and Xavier Messeguer. Modifications to GeneID that allow for the prediction of U12-dependent splice sites and incorporation of U12 introns into gene models were made by Tyler Alioto. Jigsaw was developed at The Institute for Genomic Research (TIGR) by Jonathan Allen and Steven Salzberg, with computational gene-finder contributions from Mihaela Pertea and William Majoros. Continued maintenance and development of Jigsaw will be provided by the Salzberg group at the Center for Bioinformatics and Computational Biology (CBCB) at the University of Maryland, College Park. The Yale Pseudogenes were generated by the pseudogene annotation group of Mark Gerstein at Yale University. References Augustus Stanke, M. Gene prediction with a hidden Markov model. Ph.D. thesis, Universität Göttingen, Germany (2004). Stanke, M. and Waack, S. Gene prediction with a hidden Markov model and a new intron submodel. Bioinformatics, 19(Suppl. 2), ii215-ii225 (2003). Stanke, M., Steinkamp, R., Waack, S. and Morgenstern, B. AUGUSTUS: a web server for gene finding in eukaryotes. Nucl. Acids Res., 32, W309-W312 (2004). FGenesh++ Solovyev V.V. "Statistical approaches in Eukaryotic gene prediction". In Handbook of Statistical Genetics (eds. Balding D. et al.) (John Wiley & Sons, Inc., 2001). p. 83-127. GeneID Blanco, E., Parra, G. and Guigó, R. "Using geneid to identify genes". In Current Protocols in Bioinformatics, Unit 4.3. (ed. Baxevanis, A.D.) (John Wiley & Sons, Inc., 2002). Guigó, R. Assembling genes from predicted exons in linear time with dynamic programming. J Comput Biol. 5(4), 681-702 (1998). Guigó, R., Knudsen, S., Drake, N. and Smith, T. Prediction of gene structure. J Mol Biol. 226(1), 141-57 (1992). Parra, G., Blanco, E. and Guigó, R. GeneID in Drosophila. Genome Research 10(4), 511-515 (2000). Jigsaw Allen, J.E., Pertea, M. and Salzberg, S.L. Computational gene prediction using multiple sources of evidence. Genome Res., 14(1), 142-8 (2004). Allen, J.E. and Salzberg, S.L. JIGSAW: integration of multiple sources of evidence for gene prediction. Bioinformatics 21(18), 3596-3603 (2005). SGP2 Guigó, R., Dermitzakis, E.T., Agarwal, P., Ponting, C.P., Parra, G., Reymond, A., Abril, J.F., Keibler, E., Lyle, R., Ucla, C. et al. Comparison of mouse and human genomes followed by experimental verification yields an estimated 1,019 additional genes. Proc Natl Acad Sci U S A 100(3), 1140-5 (2003). Parra, G., Agarwal, P., Abril, J.F., Wiehe, T., Fickett, J.W. and Guigó, R. Comparative gene prediction in human and mouse. Genome Res. 13(1), 108-17 (2003). encodeEgaspUpdYalePseudo Yale Pseudo Upd Yale Pseudogene Predictions ENCODE Regions and Genes encodeEgaspUpdSgp2U12 SGP2 U12 Update SGP2 U12 Intron Predictions ENCODE Regions and Genes encodeEgaspUpdSgp2 SGP2 Update SGP2 Gene Predictions ENCODE Regions and Genes encodeEgaspUpdJigsaw Jigsaw Update Jigsaw Gene Predictions ENCODE Regions and Genes encodeEgaspUpdGeneIdU12 GeneID U12 Upd GeneID U12 Intron Predictions ENCODE Regions and Genes encodeEgaspUpdGeneId GeneID Update GeneID Gene Predictions ENCODE Regions and Genes encodeEgaspUpdFgenesh FGenesh++ Upd Fgenesh++ Gene Predictions ENCODE Regions and Genes encodeEgaspUpdExogean Exogean Update Exogean Gene Predictions ENCODE Regions and Genes encodeEgaspUpdAugustusAny August/EST/Ms Upd Augustus + EST/Protein Evidence + Mouse Homology Gene Predictions ENCODE Regions and Genes encodeEgaspUpdAugustusDual August/Mouse Upd Augustus + Mouse Homology Gene Predictions ENCODE Regions and Genes encodeEgaspUpdAugustusEst Augustus/EST Upd Augustus + EST/Protein Evidence Gene Predictions ENCODE Regions and Genes encodeEgaspUpdAugustusAbinitio Augustus Update Augustus Ab Initio Gene Predictions ENCODE Regions and Genes snpRecombRate SNP Recomb Rates Recombination Rates from SNP Genotyping Variation and Repeats Description This track shows recombination rates measured in centiMorgans per Megabase. It is based on the HapMap Phase I data, release 16a, and Perlegen data (Hinds et al., 2005). Observations from sperm studies (Jeffreys et al., 2001) and patterns of genetic variation (McVean et al., 2004; Crawford et al., 2004) show that recombination rates in the human genome vary extensively over kilobase scales and that much recombination occurs in recombination hotspots. This provides an explanation for the apparent block-like structure of linkage disequilibrium (Daly et al., 2001; Gabriel et al., 2002). Fine-scale recombination rate estimates provide a new route to understanding the molecular mechanisms underlying human recombination. A better understanding of the genomic landscape of human recombination rate variation would facilitate the efficient design and analysis of disease association studies and greatly improve inferences from polymorphism data about selection and human demographic history. Display Conventions and Configuration This annotation track may be configured in a variety of ways to highlight different aspects of the displayed data. The graphical configuration options are shown at the top of the track description page. For more information, click the Graph configuration help link. Methods Fine-scale recombination rates are estimated using the reversible-jump Markov chain Monte Carlo (MCMC) method (McVean et al., 2004). This approach explores the posterior distribution of fine-scale recombination rate profiles, where the state-space considered is the distribution of piece-wise constant recombination maps. The Markov chain explores the distribution of both the number and location of change-points, in addition to the rates for each segment. A prior is set on the number of change-points that increases the smoothing effect of trans-dimensional MCMC, which is necessary because of the composite-likelihood scheme employed. This method is implemented in the package LDhat, which includes full details of installation and implementation. A block-penalty of five was used (calibrated by simulation and comparison to data from sperm-typing studies). Each region was analyzed as a single run with 10,000,000 iterations, sampling every 5000th iteration and discarding the first third of all samples as burn-in. The mean posterior rate for each SNP interval is the value reported. Because of the non-independence of the composite likelihood scheme, the quantiles of the sampling distribution do not reflect true uncertainty and are therefore not given. Estimates were generated separately from each of the four HapMap populations, and then combined to give a single figure. Differences between populations are not significant. Validation This approach has been validated in three ways: by extensive simulation studies and by comparisons with independent estimates of recombination rates, both over large scales from the genetic map and over fine scales from sperm analysis. Full details of validation can be found in McVean et al. (2004) and Winckler et al. (2005). Credits The HapMap data are based on HapMap release 16a; the Perlegen data are from Hinds et al. (2005). The recombination rates were ascertained by Simon Myers from the Mathematical Genetics Group at the University of Oxford. References Crawford, D.C., Bhangale, T., Li, N., Hellenthal, G., Rieder, M.J., Nickerson, D.A. and Stephens, M. Evidence for substantial fine-scale variation in recombination rates across the human genome. Nat Genet. 36(7), 700-6 (2004). Daly, M.J., Rioux, J.D., Schaffner, S.F., Hudson, T.J. and Lander, E.S. High-resolution haplotype structure in the human genome. Nat Genet. 29(2), 229-32 (2001). Gabriel, S.B., Schaffner, S.F., Nguyen, H., Moore, J.M., Roy, J., Blumenstiel, B., Higgins, J., DeFelice, M., Lochner, A., Faggart, M. et al. The structure of haplotype blocks in the human genome. Science 296(5576), 2225-9 (2002). Hinds, D.A., Stuve, L.L., Nilsen, G.B., Halperin, E., Eskin, E., Ballinger, D.G., Frazer, K.A., Cox, D.R. Whole-Genome Patterns of Common DNA Variation in Three Human Populations. Science 307(5712), 1072-1079 (2005). Jeffreys, A.J,. Kauppi, L. and Neumann, R. Intensely punctate meiotic recombination in the class II region of the major histocompatibility complex. Nat Genet. 29(2), 217-22 (2001). McVean, G.A., Myers, S.R., Hunt, S., Deloukas, P., Bentley, D.R. and Donnelly, P. The fine-scale structure of recombination rate variation in the human genome. Science 304(5670), 581-4 (2004). Winckler, W., Myers, S.R., Richter, D.J., Onofrio, R.C., McDonald, G.J., Bontrop, R.E., McVean, G.A., Gabriel, S.B., Reich, D., Donnelly, P. et al. Comparison of fine-scale recombination rates in humans and chimpanzees. Science 308(5718), 107-11 (2005). snpRecombRatePerlegen Perlegen Oxford Recombination Rates from Perlegen Data Variation and Repeats snpRecombRateHapmap HapMap Phase I Oxford Recombination Rates from HapMap Phase I Release 16c.1 Variation and Repeats cnp Structural Var Structural Variation Variation and Repeats Description This annotation shows regions detected as putative copy number polymorphisms (CNP) and sites of detected intermediate-sized structural variation (ISV). The CNPs and ISVs were determined by various methods, displayed in individual subtracks within the annotation: BAC microarray analysis (Sharp): 154 putative CNP regions detected by BAC microarray analysis in a population of 47 individuals comprised of 8 Chinese, 4 Japanese, 10 Czech, 2 Druze, 7 Biaka, 9 Mbuti, and 7 Amerindians. BAC microarray analysis (Iafrate): 249 putative CNP regions detected by BAC microarray analysis in a population of 55 individuals, 16 of which had previously-characterized chromosomal abnormalities. The group consisted of 10 Caucasians, 4 Amerindians, 2 Chinese, 2 Indo-Pakistani, 2 Sub-Saharan African, and 35 of unknown ethnic origin. Representational oligonucleotide microarray analysis (ROMA) (Sebat): 72 putative CNP regions detected by ROMA in a population of 20 normal individuals comprised of 1 Biaka, 1 Mbuti, 1 Druze, 1 Melanesian, 4 French, 1 Venezualan, 1 Cambodian, 1 Mayan and 9 of unknown ethnicity. Fosmid mapping (Tuzun): 297 ISV sites detected by mapping paired-end sequences from a human fosmid DNA library. Deletions from genotype analysis (McCarroll): 538 deletions detected by analysis of SNP genotypes, using the HapMap Phase I data, release 16a. Deletions from genotype analysis (Conrad): 910 deletions detected by analysis of SNP genotypes, using the HapMap Phase I data, release 16c.1, CEU and YRI samples. Deletions from haploid hybridization analysis (Hinds): 100 deletions from haploid hybridization analysis in 24 unrelated individuals from the Polymorphism Discovery Resource, selected for SNP LD study. SNP and BAC microarray analysis of HapMap data (Redon): 1,447 copy number variable regions found in the HapMap Phase II data. Display Conventions and Configuration CNP and ISV regions are indicated by solid blocks that are color-coded to indicated the type of variation detected: Green: gain (duplications) Red: loss (deletions) Blue: gain and loss (both deletion and duplication) Black: inversion Gray: gain or loss (unknown direction) Note that display IDs are not preserved between assemblies. Sharp subtrack On the details pages for elements in this subtrack, the table shows value/threshold data for each individual in the population. "Value" is defined as the log2 ratio of fluorescence intensity of test versus reference DNA. "Threshold" is defined as 2 standard deviations from the mean log2 ratio of all autosomal clones per hybridization. The "Disease Percent" value reflects the percent of the BAC that lies within a "rearrangement hotspot", as defined in Sharp et al. (2005). A rearrangement hotspot is defined by the presence of flanking intrachromosomal duplications >10 kb in length with >95% similarity and separated by 50 kb - 10 Mb of intervening sequence. Tuzun subtrack Items are labeled using the following naming convention: First letter: rearrangement type (D=deletion, I=insertion, V=inversion). Second letter: association with repeat or duplication (R=human-specific repeat, D=duplication, N=neither (unique)). Third letter: second haplotype support (N=variant site lacking support from the human genome reference, S=variant site with support from the human genome reference). Conrad subtrack The method used to identify these deletions approximates the breakpoints of each event; therefore, a set of minimal and maximal endpoints is associated with each deletion. Thick lines delineate the minimally deleted region; thin lines delineate the maximally deleted region. Methods Sharp BAC microarray analysis All hybridizations were performed in duplicate incorporating a dye-reversal using a custom array consisting of 2,194 end-sequence or FISH-confirmed BACs, targeted to regions of the genome flanked by segmental duplications. The false positive rate was estimated at ~3 clones per 4,000 tested. Note that CNP intervals, as detailed by Sharp et al., were converted from the July 2003 human genome assembly (NCBI Build 34) to the May 2004 assembly (NCBI Build 35) using BLAT alignments of BAC End pairs and the UCSC liftOver tool. Iafrate BAC microarray analysis All hybridizations were performed in duplicate incorporating a dye-reversal using proprietary 1 Mb GenomeChip V1.2 Human BAC Arrays consisting of 2,632 BAC clones (Spectral Genomics, Houston, TX). The false positive rate was estimated at ~1 clone per 5,264 tested. Further information is available from the Database of Genomic Variants website. Note that CNP intervals, as detailed by Iafrate et al., were converted from the July 2003 human genome assembly (NCBI Build 34) to the May 2004 assembly (NCBI Build 35) using the UCSC liftOver tool. Sebat ROMA Following digestion with BglII or HindIII, genomic DNA was hybridized to a custom array consisting of 85,000 oligonucleotide probes. The probes were selected to be free of common repeats and have unique homology within the human genome. The average resolution of the array was ~35 kb; however, only intervals in which three consecutive probes showed concordant signals were scored as CNPs. All hybridizations were performed in duplicate incorporating a dye-reversal, with the false positive rate estimated to be ~6%. Note that CNP intervals, as detailed by Sebat et al., were converted from the April 2003 human genome assembly (NCBI Build 33) to the July 2003 assembly (NCBI Build 34) and the May 2004 assembly (NCBI Build 35) using the UCSC liftOver tool. Tuzun fosmid mapping Paired-end sequences from a human fosmid DNA library were mapped to the assembly. The average resolution of this technique was ~8 kb, and included 56 sites of inversion not detectable by the array-based approaches. However, because of the physical constraints of fosmid insert size, this technique was unable to detect insertions greater than 40 kb in size. McCarroll genotype analysis A segregating deletion can leave "footprints" in SNP genotype data, including apparent deviations from Mendelian inheritance, apparent deviations from Hardy-Weinberg equilibrium and null genotypes. Using these clues to discover true variants is challenging, however, because the vast majority of such observations represent technical artifacts and genotyping errors. To determine whether a subset of "failed" SNP genotyping assays in the HapMap data might reflect structural variation, the authors examined whether such failures were physically clustered in a manner that is specific to individuals. Consistent with this hypothesis, the rate of Mendelian-inconsistent genotypes was elevated near other Mendelian-inconsistent genotypes in the same individual but was unrelated to Mendelian inconsistencies in other individuals. The authors systematically looked for regions of the genome in which the same failure profile appeared repeatedly at nearby markers in a manner that was statistically unexpected based on chance. A set of statistical thresholds was tailored to each mode of failure, genotyping center and genotyping platform used in the project. The same procedure could readily apply to dense SNP data from any platform or study. Note that deletions as detailed by McCarroll et al. were converted from the July 2003 human genome assembly (NCBI Build 34) to the May 2004 assembly (NCBI Build 35) using the UCSC liftOver tool. Conrad genotype analysis SNPs in regions that are hemizygous for a deletion are generally miscalled as homozygous for the allele that is present. Hence, when a deletion is transmitted from parent to child, the genotypes at SNPs within the deletion region will often appear to violate the rules of Mendelian transmission. The authors developed a simple algorithm for scanning trio data for unusual runs of consecutive SNPs that, in a single family, have genotype configurations consistent with the presence of a deletion. Note that deletions as detailed by Conrad et al. were converted from the July 2003 human genome assembly (NCBI Build 34) to the May 2004 assembly (NCBI Build 35) using the UCSC liftOver tool. Hinds haploid hybridization analysis Approximately 600 Mb of genomic DNA from 24 unrelated individuals were obtained from the Polymorphism Discovery Resource. Haploid hybridization was used to identify genomic intervals showing a reduced hybridization signal in comparison to the reference assembly. PCR amplification was performed on 215 candidate deletions. 100 deletions were selected that were unambiguously confirmed. Redon analysis of HapMap data Experiments were performed with the International HapMap DNA and cell-line collection using two technologies: comparative analysis of hybridization intensities on Affymetric GeneChip Human Mapping 500K early access arrays (500K EA) and comparative genomic hybridization with a Whole Genome TilePath (WGTP) array. Validation McCarroll genotype analysis Four methods of validation were used: fluorescent in situ hybridization (FISH), two-color fluorescence intensity measurements, PCR amplification and quantitative PCR. The authors performed fluorescent in situ hybridization for five candidate deletions large enough to span available FISH probes. In all five cases, FISH assays confirmed the deletions in the predicted individuals. The authors examined two-color allele-specific fluorescence data from SNP genotyping assays from a data subset available at the Broad Institute, looking for a reduction in fluorescence intensity in individuals predicted to carry a deletion. At most SNPs in the genome, fluorescence intensity measurements clustered into two or three discrete groups corresponding to homozygous and hetrozygous genotypes. At 15 of 17 candidate deletion loci, fluorescence intensity data for one or more SNPs clustered into additional groups that corresponded to the predicted deletion genotypes. The authors used PCR amplification to query 60 loci for which the pattern of genotypes suggested multiple individuals with homozygous deletions. Variants were considered confirmed if the pattern of amplification success and failure matched prediction across a set of 12-24 individuals. The authors confirmed 51 of 60 candidate variants by this criterion. The authors performed quantitative PCR in all 269 HapMap DNA samples for 11 candidate deletions that overlapped the coding exons of genes and that were discovered in many individuals. At 10/11 loci, the authors observed three discrete clusters, identifying individuals with zero, one and two gene copies. All 60 trios displayed Mendelian inheritance for the ten deletions, as well as Hardy-Weinberg equilibrium in all four populations surveyed, and transmission rates close to 50%. This suggests that the deletions behave as a stable, heritable genetic polymorphism. Conrad genotype analysis The authors first tested 12 predicted deletions using quantitative PCR. For all 12 deletions, DNA concentrations consistent with transmission of a deletion from parent to child were observed. To provide more extensive validation by comparative genome hybridization (CGH), the authors designed a custom oligonucleotide microarray comprised of 380,000 probes that tile across all 134 candidate deletions identified in 9 HapMap offspring (8 YRI and 1 CEU). The results of this CGH analysis indicate that the majority (about 85%) of candidate deletions detected by the method are real. Redon analysis of HapMap data The authors utilized numerous quality meaures, including repeated experiments on the WGTP array for 82 individual and on the 500K EA array for 15 individuals. The average false-positive rate per experiment was held beneath 5%. Aberrant chromosomes were removed from the analysis. Further details are available in the Nature paper cited below. References Conrad, D., Andrews, T.D., Carter, N.P., Hurles, M.E., Pritchard, J.K. A high-resolution survey of deletion polymorphism in the human genome. Nature Genet 38(1), 75-81 (2006). Hinds, D., Kloek, A.P., Jen, M., Chen, X., Frazer, K.A. Common deletions and SNPs are in linkage disequilibrium in the human genome. Nature Genet 38(1), 82-85 (2006). Iafrate, J.A., Feuk, L., Rivera, M.N., Listewnik, M.L., Donahoe, P.K., Qi, Y., Scherer, S.W. and Lee, C. Detection of large-scale variation in the human genome. Nature Genet 36(9), 949-51 (2004). McCarroll, S.A., Hadnott, T.N., Perry, G.H., Sabeti, P.C., Zody, M.C., Barrett, J.C., Dallaire, S., Gabriel, S., Lee, C., Daly, M.J., Altshuler, D.M. Common deletion polymorphisms in the human genome. Nature Genet 38(1), 86-92 (2006). Redon, R., Ishikawa, S., Fitch, K., Feuk, L., Perry, G., Andrews, T., Fiegler, H., Lee, C., Jones, K., Scherer, S., Hurles, M. et al. Global variation in copy number in the human genome. Nature 444(7118), 444-454 (2006). Sebat, J., Lakshmi, B., Troge, J., Alexander, J., Young, J., Lundin, P., Maner, S., Massa, H., Walker, M., Chi, M. et al. Large-scale copy number polymorphism in the human genome. Science 305(5683), 525-8 (2004). Sharp, A.J., Locke, D.P., McGrath, S.D., Cheng, Z., Bailey, J.A., Samonte, R.V., Pertz, L.M., Clark, R.A., Schwartz, S., Segraves, R. et al. Segmental duplications and copy number variation in the human genome. Am J Hum Genet 77(1), 78-88 (2005). Tuzun, E., Sharp, A.J., Bailey, J.A., Kaul, R., Morrison, V.A., Pertz, L.M., Haugen, E., Hayden, H., Albertson, D. Pinkel, D. et al. Fine-scale structural variation of the human genome. Nature Genet 37(7), 727-32 (2005). cnpRedon Redon CNPs Copy Number Polymorphisms from SNP and BAC microarrays (Redon) Variation and Repeats delHinds Hinds Dels Deletions from Haploid Hybridization Analysis (Hinds) Variation and Repeats delConrad Conrad Dels Deletions from Genotype Analysis (Conrad) Variation and Repeats delMccarroll McCarroll Dels Deletions from Genotype Analysis (McCarroll) Variation and Repeats cnpFosmid Tuzun Fosmids Structural Variation identified by Fosmids (Tuzun) Variation and Repeats cnpSebat Sebat CNPs Copy Number Polymorphisms from ROMA (Sebat) Variation and Repeats Description This track shows 81 regions detected as putative copy number polymorphisms by representational oligonucleotide microarray analysis (ROMA) in a population of 20 normal individuals. Methods Following digestion of with BglII or HindIII, genomic DNA was hybridized to a custom array consisting of 85,000 oligonucleotide probes, probes were selected to be free of common repeats and have unique homology within the human genome. The average resolution of the array is ~35 kb, however only intervals in which 3 consecutive probes showed concordant signal were scored as CNPs. All hybridizations were performed in duplicate incorporating a dye-reversal, with the false positive rate estimated to be ~6%. Note that CNP intervals as detailed by Sebat et al. (2004) were converted from the April 2003 (build33) into the July 2003 (build34) assembly using liftover. References Sebat J, Lakshmi B, Troge J, Alexander J, Young J, Lundin P, Maner S, Massa H, Walker M, Chi M, Navin N, Lucito R, Healy J, Hicks J, Ye K, Reiner A, Gilliam TC, Trask B, Patterson N, Zetterberg A, Wigler M (2004) Large-Scale Copy Number Polymorphism in the Human Genome. Science 305:525-528 cnpIafrate Iafrate CNPs Copy Number Polymorphisms from BAC Microarray Analysis (Iafrate) Variation and Repeats Description This track shows 255 regions detected as putative copy number polymorphisms by BAC microarray analysis in a population of 55 individuals, 16 of which had previously characterized chromosome abnormalities. Methods Hybridizations were all performed in duplicate incorporating a dye-reversal using proprietary 1 Mb GenomeChip V1.2 Human BAC Arrays consisting of 2,632 BAC clones (Spectral Genomics, Houston, TX). The false positive rate was estimated as ~1 clone per 5,264 tested. Further information is available at http://projects.tcag.ca/variation. References Iafrate JA, Feuk L, Rivera MN, Listewnik ML, Donahoe PK, Qi Y, Scherer SW, Lee C (2004) Detection of large-scale variation in the human genome. Nature Genet 36:949-951 cnpSharp Sharp CNPs Copy Number Polymorphisms from BAC Microarray Analysis (Sharp) Variation and Repeats Description This track shows 160 regions detected as putative copy number polymorphisms by BAC microarray analysis in a population of 47 individuals, comprising 8 Chinese, 4 Japanese, 10 Czech, 2 Druze, 7 Biaka, 9 Mbuti, and 7 Amerindians. Methods Hybridizations were all performed in duplicate incorporating a dye-reversal using a custom array consisting of 2194 end-sequence or FISH confirmed BACs, targeted to regions of the genome flanked by segmental duplications. The false positive rate was estimated as ~3 clones per 4,000 tested. References Sharp, A.J., Locke D..P, McGrath S.D., Cheng Z., Bailey J.A., Samonte R.V., Pertz L.M., Clark R.A., Schwartz S., Segraves R., Oseroff V.V., Albertson D.G., Pinkel D. and Eichler E..E Segmental duplications and copy number variation in the human genome. Am J Hum Genet 77(1), 78-88 (2005). tajD Tajima's D Tajima's D Variation and Repeats Description This track shows Tajima's D (Tajima, 1989), a measure of nucleotide diversity, estimated from the Perlegen data set (Hinds et al., 2005). Tajima's D is a statistic used to compare an observed nucleotide diversity against the expected diversity under the assumption that all polymorphisms are selectively neutral and constant population size. Methods Tajima's D was estimated in 100 kbp sliding windows across the autosomal genome, reporting the Tajima's D measure at the central 10 kbp of the window and stepping by 10 kbp. Thus, the Tajima's D for the window chr1:100,001-200,000 is reported at coordinates chr1:145,001-155,000, the Tajima's D for the window chr1:110,001-210,000 is reported at coordinates chr1:155,001-165,000, and so forth. The theoretical distribution of Tajima's D (95% c.i. between -2 and +2) assumes that polymorphism ascertainment is independent of allele frequency. High values of Tajima's D suggest an excess of common variation in a region, which can be consistent with balancing selection, population contraction. Negative values of Tajima's D, on the other hand, indicate an excess of rare variation, consistent with population growth, or positive selection. Population admixture can lead to either high or low Tajima's D values in theory. Demographic parameters would be expected to affect the genome more evenly than selective pressures, so previous analyses have suggested that using the empiric distribution of Tajima's D from a collection of regions across the genome provides advantages in assessing whether selection or demography might explain an observed deviation from expectation. Because of the ascertainment bias toward common polymorphism in the Perlegen data set, positive Tajima's D values are difficult to interpret, and modeling ascertainment is difficult. However, given that the ascertainment bias raises the mean of the distribution, extreme negative values in extended regions can be useful in qualitatively identifying interesting regions for full resequencing and more rigorous theoretical analysis of nucleotide diversity. For further discussion, see Carlson et al. (2005). In full display mode, this track shows the nucleotide diversity across three human populations: 23 individuals of African American Descent (AD), 24 individuals of European Descent (ED) and 24 individuals of Chinese Descent (XD), as well as the polymorphic sites within each population used to estimate nucleotide diversity. Only SNPs observed to be polymorphic within each subpopulation were used in the Tajima's D calculation. Nucleotide diversity is shown in dense display mode using a grayscale density gradient, with light colors indicating low diversity. Credits This track was created at the University of Washington using gfetch from the Nickerson Laboratory and the R statistical software package. References Tajima, F. Statistical method for testing the neutral mutation hypothesis by DNA polymorphism. Genetics 123, 585-595 (1989). Carlson, C.S., Thomas, D.J., Eberle, M., Livingston, R., Rieder, M. Nickerson, D.A. Genomic regions exhibiting positive selection identified from dense genotype data. Genome Res 15, 1553-65 (2005). tajdXd Tajima's D XD Tajima's D from Chinese Descent Variation and Repeats tajdEd Tajima's D ED Tajima's D from European Descent Variation and Repeats tajdAd Tajima's D AD Tajima's D from African Descent Variation and Repeats tajdSnp Tajima's D SNPs Tajima's D SNPs Variation and Repeats Description This track shows the SNPs that were used in the calculation of Tajima's D (Tajima, 1989), a measure of nucleotide diversity, estimated from the Perlegen data set (Hinds et al., 2005). Tajima's D is a statistic used to compare an observed nucleotide diversity against the expected diversity under the assumption that all polymorphisms are selectively neutral and constant population size. Methods See the Tajima's D track or Carlson et al. for more details on the use of this track. Credits This track was created at the University of Washington using gfetch from the Nickerson Laboratory and the R statistical software package. References Tajima, F. Statistical method for testing the neutral mutation hypothesis by DNA polymorphism. Genetics 123, 585-595 (1989). Carlson, C.S., Thomas, D.J., Eberle, M., Livingston, R., Rieder, M. Nickerson, D.A. Genomic regions exhibiting positive selection identified from dense genotype data. Genome Res 15, 1553-65 (2005). tajdSnpXd SNPs XD SNPs from Chinese Descent Variation and Repeats tajdSnpEd SNPs ED SNPs from European Descent Variation and Repeats tajdSnpAd SNPs AD SNPs from African Descent Variation and Repeats encodeYaleMASPlacRNATransMap Yale MAS RNA Yale Maskless Array Synthesizer, RNA Transcript Map ENCODE Transcript Levels Description This track shows the forward (+) and reverse (-) strand transcript map of intensity scores (estimating RNA abundance) for human NB4 cell total RNA, and human placental Poly(A)+ RNA, hybridized to the Yale MAS (Maskless Array Synthesizer) ENCODE oligonucleotide microarray, transcription mapping design #1. This array has 36-mer oligonucleotide probes approximately every 36 bp (i.e. end-to-end) covering all the non-repetitive DNA sequence of the ENCODE regions ENm001-ENm012. See NCBI GEO GPL2105 for details of this array design. This transcript map is a combined signal from three biological replicates, each with at least two technical replicates. Arrays were hybridized using either the standard Nimblegen protocol or the protocol described in Bertone et al. (2004). The label of each subtrack in this annotation indicates the specific protocol used for that particular data set. Display Conventions and Configuration This annotation follows the display conventions for composite tracks. The subtracks within this annotation may be configured in a variety of ways to highlight different aspects of the displayed data. The graphical configuration options are shown at the top of the track description page, followed by a list of subtracks. To display only selected subtracks, uncheck the boxes next to the tracks you wish to hide. For more information about the graphical configuration options, click the Graph configuration help link. Methods A score was assigned to each oligonucleotide probe position by combining two or more technical replicates and by using a sliding window approach. Within a sliding window of 160 bp (corresponding to 5 oligos), the hybridization intensities for all replicates of each oligonucleotide probe were compared to their respective array median score. Within the window and across all the replicates, the number of probes above and below their respective median were counted. Using the sign test, a one-sided P-value was then calculated and a score defined as score=-log(P-value) was assigned to the oligo in the center of the window. Three independent biological replicates were generated and each was hybridized to at least 2 different arrays (technical replicates). Verification Reasonable correlation coefficients between replicates were ensured. Additionally, transcribed regions (TARs/transfrags) were called and compared between technical and biological replicates to ensure significant overlap. Credits These data were generated and analyzed by the labs of Michael Snyder, Mark Gerstein and Sherman Weissman at Yale University. References Bertone, P., Stolc, V., Royce, T.E., Rozowsky, J.S., Urban, A.E., Zhu, X., Rinn, J.L., Tongprasit, W., Samanta, M. et al. Global identification of human transcribed sequences with genome tiling arrays. Science 306(5705), 2242-6 (2004). Cheng, J., Kapranov, P., Drenkow, J., Dike, S., Brubaker, S., Patel, S., Long, J., Stern, D., Tammana, H. et al. Transcriptional maps of 10 human chromosomes at 5-nucleotide resolution. Science 308(5725), 1149-54 (2005). Kapranov, P., Cawley, S.E., Drenkow, J., Bekiranov, S., Strausberg, R.L., Fodor, S.P. and Gingeras, T.R. Large-scale transcriptional activity in chromosomes 21 and 22. Science 296(5569), 916-9 (2002). Kluger, Y., Tuck, D.P., Chang, J.T., Nakayama, Y., Poddar, R., Kohya, N., Lian, Z., Ben Nasr, A., Halaban, H.R. et al. Lineage specificity of gene expression patterns. Proc Natl Acad Sci U S A 101(17), 6508-13 (2004). Rinn, J.L., Euskirchen, G., Bertone, P., Martone, R., Luscombe, N.M., Hartman, S., Harrison, P.M., Nelson, F.K., Miller, P. et al. The transcriptional activity of human Chromosome 22. Genes Dev 17(4), 529-40 (2003). encodeYaleRnaSuper Yale RNA Yale RNA (Neutrophil, Placenta and NB4 cells) ENCODE Transcript Levels Overview This super-track combines related tracks from Yale Transcript Map analysis. These tracks contain transcriptome data from different cell lines and biological samples as well as analysis of transcriptionally active regions (TARs). Experiments were performed with Yale MAS (Maskless Array Synthesizer) ENCODE oligonucleotide microarray (see NCBI GEO GPL2105 for details of this array design) as well as the Affymetrix ENCODE oligonucleotide microarray. Multiple biological samples were assayed, such as total RNA from human NB4 cells. Experiments also included chemical treatments such as retinoic acid (RA) treatments. Credits Yale MAS RNA, Yale MAS TAR These data were generated and analyzed by the the labs of Michael Snyder, Mark Gerstein and Sherman Weissman at Yale University. Yale RNA, Yale TAR These data were generated and analyzed by the Yale/Affymetrix collaboration among the labs of Michael Snyder, Mark Gerstein and Sherman Weissman at Yale University and Tom Gingeras at Affymetrix. Yale RACE These data were generated and analyzed by the lab of Mark Gerstein at Yale University. References Bertone P, Stolc V, Royce TE, Rozowsky JS, Urban AE, Zhu X, Rinn JL, Tongprasit W, Samanta M et al. Global identification of human transcribed sequences with genome tiling arrays. Science. 2004 Dec 24;306(5705):2242-6. Cheng J, Kapranov P, Drenkow J, Dike S, Brubaker S, Patel S, Long J, Stern D, Tammana H et al. Transcriptional maps of 10 human chromosomes at 5-nucleotide resolution. Science. 2005 May 20;308(5725):1149-54. Kapranov P, Cawley SE, Drenkow J, Bekiranov S, Strausberg RL, Fodor SP, Gingeras TR. Large-scale transcriptional activity in chromosomes 21 and 22. Science. 2002 May 3;296(5569):916-9. Kluger Y, Tuck DP, Chang JT, Nakayama Y, Poddar R, Kohya N, Lian Z, Ben Nasr A, Halaban HR et al. Lineage specificity of gene expression patterns. Proc Natl Acad Sci U S A. 2004 Apr 27;101(17):6508-13. Rinn JL, Euskirchen G, Bertone P, Martone R, Luscombe NM, Hartman S, Harrison PM, Nelson FK, Miller P et al. The transcriptional activity of human Chromosome 22. Genes Dev. 2003 Feb 15;17(4):529-40. encodeYaleMASPlacRNATransMapRevMless36mer36bp Yale Plc BtR RNA Yale Placenta RNA Trans Map, MAS Array, Reverse Direction, Bertone Protocol ENCODE Transcript Levels encodeYaleMASPlacRNATransMapFwdMless36mer36bp Yale Plc BtF RNA Yale Placenta RNA TransMap, MAS array, Forward Direction, Bertone Protocol ENCODE Transcript Levels encodeYaleMASPlacRNANprotTMREVMless36mer36bp Yale Plc NgR RNA Yale Placenta RNA Trans Map, MAS Array, Reverse Direction, NimbleGen Protocol ENCODE Transcript Levels encodeYaleMASPlacRNANprotTMFWDMless36mer36bp Yale Plc NgF RNA Yale Placenta RNA Trans Map, MAS Array, Forward Direction, NimbleGen Protocol ENCODE Transcript Levels encodeYaleMASNB4RNANprotTMREVMless36mer36bp Yale NB4 NgR RNA Yale NB4 RNA Trans Map, MAS Array, Reverse Direction, NimbleGen Protocol ENCODE Transcript Levels encodeYaleMASNB4RNANprotTMFWDMless36mer36bp Yale NB4 NgF RNA Yale NB4 RNA Trans Map, MAS Array, Forward Direction, NimbleGen Protocol ENCODE Transcript Levels encodeYaleMASPlacRNATars Yale MAS TAR Yale Maskless Array Synthesizer, RNA Transcriptionally Active Regions ENCODE Transcript Levels Description This track shows the locations of forward (+) and reverse (-) strand transcriptionally-active regions (TARs)/transcribed fragments (transfrags), for human NB4 cell total RNA and for human placenta Poly(A)+ RNA, hybridized to the Yale Maskless Array Synthesizer (MAS) ENCODE oligonucleotide microarray, transcription mapping design #1. This array has 36-mer oligonucleotide probes approximately every 36 bp (i.e. end-to-end) covering all the non-repetitive DNA sequence of the ENCODE regions ENm001 - ENm012. See NCBI GEO accession GPL2105 for details of this array design. These TARs/transfrags are based on a transcript map combining hybridization intensities from three biological replicates, each with at least two technical replicates. Arrays were hybridized using either Nimblegen standard protocol, or the protocol described in Bertone et al. (2004). The label of each subtrack in this annotation indicates the specific protocol used for that particular data set. Methods A score was assigned to each oligonucleotide probe position by combining two or more technical replicates and by using a sliding window approach. Within a sliding window of 160 bp (corresponding to 5 oligos), the hybridization intensities for all replicates of each oligonucleotide probe were compared to their respective array median intensity. Within the window and across all the replicates, the number of probes above and below their respective median was counted. Using the sign test, a one-sided P-value was then calculated and a score defined as score=-log(p-value) was assigned to the oligo in the center of the window. Three independent biological replicates were generated, and each was hybridized to at least two different arrays (technical replicates). Transcribed regions (TARs/transfrags) were then identified using a score threshold of 95th percentile as well as a maximum gap of 80 bp and a minimum run of 50 bp (between oligonucleotide positions), effectively allowing a gap of one oligo and demanding the TAR/transfrag to encompass at least 3 oligos. Verification Transcribed regions (TARs/transfrags), as determined by individual biological samples, were compared to ensure significant overlap. Credits These data were generated and analyzed by the the labs of Michael Snyder, Mark Gerstein and Sherman Weissman at Yale University. References Kapranov P, Cawley SE, Drenkow J, Bekiranov S, Strausberg RL, Fodor SP, Gingeras TR, Large-scale transcriptional activity in chromosomes 21 and 22, Science. 2002 May 3;296(5569):916-9. Rinn JL, Euskirchen G, Bertone P, Martone R, Luscombe NM, Hartman S, Harrison PM, Nelson FK, Miller P, Gerstein M, Weissman S, Snyder M, The transcriptional activity of human Chromosome 22, Genes Dev, 2003 Feb 15;17(4):529-40. Bertone P, Stolc V, Royce TE, Rozowsky JS, Urban AE, Zhu X, Rinn JL, Tongprasit W, Samanta M, Weissman S, Gerstein M, Snyder M, Global identification of human transcribed sequences with genome tiling arrays, Science. 2004 Dec 24;306(5705):2242-6. Epub 2004 Nov 11. Cheng J, Kapranov P, Drenkow J, Dike S, Brubaker S, Patel S, Long J, Stern D, Tammana H, Helt G, Sementchenko V, Piccolboni A, Bekiranov S, Bailey DK, Ganesh M, Ghosh S, Bell I, Gerhard DS, Gingeras TR, Transcriptional maps of 10 human chromosomes at 5-nucleotide resolution, Science. 2005 May 20;308(5725):1149-54. Epub 2005 Mar 24. encodeYaleMASPlacRNATarsRevMless36mer36bp Yale Plc BtR TAR Yale Placenta RNA TARs, MAS array, Reverse Direction, Bertone Protocol ENCODE Transcript Levels encodeYaleMASPlacRNATarsFwdMless36mer36bp Yale Plc BtF TAR Yale Placenta RNA TARs, MAS array, Forward Direction, Bertone Protocol ENCODE Transcript Levels encodeYaleMASPlacRNANprotTarsREVMless36mer36bp Yale Plc NgR TAR Yale Placenta RNA TARs, MAS array, Reverse Direction, NimbleGen Protocol ENCODE Transcript Levels encodeYaleMASPlacRNANprotTarsFWDMless36mer36bp Yale Plc NgF TAR Yale Placenta RNA TARs, MAS array, Forward Direction, NimbleGen Protocol ENCODE Transcript Levels encodeYaleMASNB4RNANProtTarsREVMless36mer36bp Yale NB4 NgR TAR Yale NB4 RNA TARs, MAS array, Reverse Direction, NimbleGen Protocol ENCODE Transcript Levels encodeYaleMASNB4RNANProtTarsFWDMless36mer36bp Yale NB4 NgF TAR Yale NB4 RNA TARs, MAS array, Forward Direction, NimbleGen Protocol ENCODE Transcript Levels encodeAffyEcSites Affy EC Sites Affymetrix ENCODE Extension Transcription Sites ENCODE Transcript Levels Description This track shows the location of sites showing transcription (transfrags) for chromosomes 21 and 22 for 5 cell lines and 11 tissues. The 5 cell lines used were: GM06990, HepG2, K562, HeLaS3 and Tert-BJ; the 11 tissues used were: cerebellum, brain frontal lobe, hippocampus, hypothalamus, fetal spleen, fetal kidney, fetal thymus, ovary, placenta, prostate and testis. Purified cytosolic polyA+ RNA from GM06990, HepG2 and Tert-BJ cell lines, as well as purified polyA+ RNA from whole-cell extracts of the remaining cell lines and tissues, were hybridized to Affymetrix Chromosome 21_22_v2 oligonucleotide tiling arrays, which have 25-mer probes spaced on average every 17 bp (center-center of each 25mer) in the non-repetitive regions of human chromosomes 21 and 22. Clustered sites are shown in separate subtracks for each cell and tissue types. Data for all biological replicates can be downloaded from Affymetrix in wig, BED, and cel formats. Display Conventions and Configuration The subtracks within this composite annotation track may be configured in a variety of ways to highlight different aspects of the displayed data. The graphical configuration options for the subtracks are shown at the top of the track description page, followed by a list of subtracks. To show only selected subtracks, uncheck the boxes next to the tracks that you wish to hide. Methods The data from replicate arrays were quantile-normalized (Bolstad et al., 2003) and all arrays were scaled to a median array intensity of 330. Using two different approaches: i) no sliding window ii) sliding 51-bp window centered on each probe, an estimate of RNA abundance (signal) was computed by calculating the median of all pairwise average PM-MM values, where PM is a perfect match and MM is a mismatch. Both Kapranov et al. (2002) and Cawley et al. (2004) are good references for the experimental methods. The latter also describes the analytical methods. Verification Single biological replicates were generated and hybridized to duplicate arrays (two technical replicates). Transcribed regions (see the Affy RNA Signal track) were generated from the composite signal track by merging genomic positions to which probes are mapped. This merging was based on a 5% false positive rate cutoff in negative bacterial controls, a maximum gap (MaxGap) of 25 basepairs and minimum run (MinRun) of 25 basepairs. Credits These data were generated and analyzed by the collaboration of the following groups: the Tom Gingeras group at Affymetrix, Roderic Guigo group at Centre de Regulacio Genomica, Alexandre Reymond group at the University of Lausanne and Stylianos Antonarakis group at the University of Geneva. References Please see the Affymetrix Transcriptome site for a project overview and additional references to Affymetrix tiling array publications. Bolstad BM, Irizarry RA, Astrand M, Speed TP. A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics. 2003 Jan 22;19(2):185-93. Cawley S, Bekiranov S, Ng HH, Kapranov P, Sekinger EA, Kampa D, Piccolboni A, Sementchenko V, Cheng J, Williams AJ et al. Unbiased mapping of transcription factor binding sites along human chromosomes 21 and 22 points to widespread regulation of noncoding RNAs. Cell. 2004 Feb 20;116(4):499-509. Kapranov P, Cawley SE, Drenkow J, Bekiranov S, Strausberg RL, Fodor SP, Gingeras TR. Large-scale transcriptional activity in chromosomes 21 and 22. Science. 2002 May 3;296(5569):916-9. encodeAffyEcSuper Affy EC Affymetrix ENCODE Extension Transcription ENCODE Transcript Levels Overview This super-track combines related tracks of the ENCODE Extension data generated by Affymetrix. There are two member tracks: Affymetrix ENCODE Extension Transcription Sites: the transcribed fragments (transfrags) based on the signal. Affymetrix ENCODE Extension Transcription Signal: RNA abundance signal. Methods The data from replicate arrays were quantile-normalized (Bolstad et al., 2003) and all arrays were scaled to a median array intensity of 330. Using two different approaches: i) no sliding window ii) sliding 51-bp window centered on each probe, an estimate of RNA abundance (signal) was computed by calculating the median of all pairwise average PM-MM values, where PM is a perfect match and MM is a mismatch. Both Kapranov et al. (2002) and Cawley et al. (2004) are good references for the experimental methods. The latter also describes the analytical methods. Verification Single biological replicates were generated and hybridized to duplicate arrays (two technical replicates). Transcribed regions were generated from the composite signal track by merging genomic positions to which probes are mapped. This merging was based on a 5% false positive rate cutoff in negative bacterial controls, a maximum gap (MaxGap) of 25 basepairs and minimum run (MinRun) of 25 basepairs (see the Affy TransFrags track for the merged regions). Credits These data were generated and analyzed by the collaboration of the following groups: the Tom Gingeras group at Affymetrix, Roderic Guigo group at Centre de Regulacio Genomica, Alexandre Reymond group at the University of Lausanne and Stylianos Antonarakis group at the University of Geneva. References Please see the Affymetrix Transcriptome site for a project overview and additional references to Affymetrix tiling array publications. Bolstad BM, Irizarry RA, Astrand M, Speed TP. A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics. 2003 Jan 22;19(2):185-93. Cawley S, Bekiranov S, Ng HH, Kapranov P, Sekinger EA, Kampa D, Piccolboni A, Sementchenko V, Cheng J, Williams AJ et al. Unbiased mapping of transcription factor binding sites along human chromosomes 21 and 22 points to widespread regulation of noncoding RNAs. Cell. 2004 Feb 20;116(4):499-509. Kapranov P, Cawley SE, Drenkow J, Bekiranov S, Strausberg RL, Fodor SP, Gingeras TR. Large-scale transcriptional activity in chromosomes 21 and 22. Science. 2002 May 3;296(5569):916-9. encodeAffyEc51TertBJSites EC51 Site TertBJ Affy Ext Trans Sites (51-base window) (Tert-BJ) ENCODE Transcript Levels encodeAffyEc1TertBJSites EC1 Sites TertBJ Affy Ext Trans Sites (1-base window) (Tert-BJ) ENCODE Transcript Levels encodeAffyEc51K562Sites EC51 Site K562 Affy Ext Trans Sites (51-base window) (K562) ENCODE Transcript Levels encodeAffyEc1K562Sites EC1 Sites K562 Affy Ext Trans Sites (1-base window) (K562) ENCODE Transcript Levels encodeAffyEc51HepG2Sites EC51 Site HepG2 Affy Ext Trans Sites (51-base window) (HepG2) ENCODE Transcript Levels encodeAffyEc1HepG2Sites EC1 Sites HepG2 Affy Ext Trans Sites (1-base window) (HepG2) ENCODE Transcript Levels encodeAffyEc51GM06990Sites EC51 Site GM0699 Affy Ext Trans Sites (51-base window) (GM06990) ENCODE Transcript Levels encodeAffyEc1GM06990Sites EC1 Sites GM0699 Affy Ext Trans Sites (1-base window) (GM06990) ENCODE Transcript Levels encodeAffyEc51HeLaC1S3Sites EC51 Site HeLa Affy Ext Trans Sites (51-base window) (HeLa C1S3) ENCODE Transcript Levels encodeAffyEc1HeLaC1S3Sites EC1 Sites HeLa Affy Ext Trans Sites (1-base window) (HeLa C1S3) ENCODE Transcript Levels encodeAffyEc51OvarySites EC51 Site Ovary Affy Ext Trans Sites (51-base window) (Ovary) ENCODE Transcript Levels encodeAffyEc1OvarySites EC1 Sites Ovary Affy Ext Trans Sites (1-base window) (Ovary) ENCODE Transcript Levels encodeAffyEc51ProstateSites EC51 Site Prost Affy Ext Trans Sites (51-base window) (Prostate) ENCODE Transcript Levels encodeAffyEc1ProstateSites EC1 Sites Prost Affy Ext Trans Sites (1-base window) (Prostate) ENCODE Transcript Levels encodeAffyEc51FetalTestisSites EC51 Site FetalT Affy Ext Trans Sites (51-base window) (Fetal Testis) ENCODE Transcript Levels encodeAffyEc1FetalTestisSites EC1 Sites FetalT Affy Ext Trans Sites (1-base window) (Fetal Testis) ENCODE Transcript Levels encodeAffyEc51TestisSites EC51 Site Testis Affy Ext Trans Sites (51-base window) (Testis) ENCODE Transcript Levels encodeAffyEc1TestisSites EC1 Sites Testis Affy Ext Trans Sites (1-base window) (Testis) ENCODE Transcript Levels encodeAffyEc51PlacentaSites EC51 Site Placen Affy Ext Trans Sites (51-base window) (Placenta) ENCODE Transcript Levels encodeAffyEc1PlacentaSites EC1 Sites Placen Affy Ext Trans Sites (1-base window) (Placenta) ENCODE Transcript Levels encodeAffyEc51FetalSpleenSites EC51 Site Spleen Affy Ext Trans Sites (51-base window) (Fetal Spleen) ENCODE Transcript Levels encodeAffyEc1FetalSpleenSites EC1 Sites Spleen Affy Ext Trans Sites (1-base window) (Fetal Spleen) ENCODE Transcript Levels encodeAffyEc51FetalKidneySites EC51 Site FetalK Affy Ext Trans Sites (51-base window) (Fetal Kidney) ENCODE Transcript Levels encodeAffyEc1FetalKidneySites EC1 Sites FetalK Affy Ext Trans Sites (1-base window) (Fetal Kidney) ENCODE Transcript Levels encodeAffyEc51BrainHypothalamusSites EC51 Sites BrainH Affy Ext Trans Sites (51-base window) (Brain Hypothalamus) ENCODE Transcript Levels encodeAffyEc1BrainHypothalamusSites EC1 Sites BrainH Affy Ext Trans Sites (1-base window) (Brain Hypothalamus) ENCODE Transcript Levels encodeAffyEc51BrainHippocampusSites EC51 Site Hippoc Affy Ext Trans Sites (51-base window) (Brain Hippocampus) ENCODE Transcript Levels encodeAffyEc1BrainHippocampusSites EC1 Sites Hippoc Affy Ext Trans Sites (1-base window) (Brain Hippocampus) ENCODE Transcript Levels encodeAffyEc51BrainFrontalLobeSites EC51 Site BrainF Affy Ext Trans Sites (51-base window) (Brain Frontal Lobe) ENCODE Transcript Levels encodeAffyEc1BrainFrontalLobeSites EC1 Sites BrainF Affy Ext Trans Sites (1-base window) (Brain Frontal Lobe) ENCODE Transcript Levels encodeAffyEc51BrainCerebellumSites EC51 Sites BrainC Affy Ext Trans Sites (51-base window) (Brain Cerebellum) ENCODE Transcript Levels encodeAffyEc1BrainCerebellumSites EC1 Sites BrainC Affy Ext Trans Sites (1-base window) (Brain Cerebellum) ENCODE Transcript Levels encodeAffyEcSignal Affy EC Signal Affymetrix ENCODE Extension Transcription Signal ENCODE Transcript Levels Description This track shows an estimate of RNA abundance (transcription) for chromosomes 21 and 22 for 5 cell lines and 11 tissues. The 5 cell lines used were: GM06990, HepG2, K562, HeLaS3 and Tert-BJ; the 11 tissues used were: cerebellum, brain frontal lobe, hippocampus, hypothalamus, fetal spleen, fetal kidney, fetal thymus, ovary, placenta, prostate and testis. Purified cytosolic polyA+ RNA from GM06990, HepG2 and Tert-BJ cell lines, as well as purified polyA+ RNA from whole cell extracts of the remaining cell lines and tissues, were hybridized to Affymetrix Chromosome 21_22_v2 oligonucleotide tiling arrays, which have 25-mer probes spaced on average every 17 bp (center-center of each 25mer) in the non-repetitive regions of human chromosomes 21 and 22. Composite signals are shown in separate subtracks for each cell and tissue types. Data for all biological replicates can be downloaded from Affymetrix in wig, BED, and cel formats. Display Conventions and Configuration The subtracks within this composite annotation track may be configured in a variety of ways to highlight different aspects of the displayed data. The graphical configuration options for the subtracks are shown at the top of the track description page, followed by a list of subtracks. To show only selected subtracks, uncheck the boxes next to the tracks that you wish to hide. For more information about the graphical configuration options, click the Graph configuration help link. Methods The data from replicate arrays were quantile-normalized (Bolstad et al., 2003) and all arrays were scaled to a median array intensity of 330. Using two different approaches: i) no sliding window ii) sliding 51-bp window centered on each probe, an estimate of RNA abundance (signal) was computed by calculating the median of all pairwise average PM-MM values, where PM is a perfect match and MM is a mismatch. Both Kapranov et al. (2002) and Cawley et al. (2004) are good references for the experimental methods. The latter also describes the analytical methods. Verification Single biological replicates were generated and hybridized to duplicate arrays (two technical replicates). Transcribed regions were generated from the composite signal track by merging genomic positions to which probes are mapped. This merging was based on a 5% false positive rate cutoff in negative bacterial controls, a maximum gap (MaxGap) of 25 basepairs and minimum run (MinRun) of 25 basepairs (see the Affy TransFrags track for the merged regions). Credits These data were generated and analyzed by the collaboration of the following groups: the Tom Gingeras group at Affymetrix, Roderic Guigo group at Centre de Regulacio Genomica, Alexandre Reymond group at the University of Lausanne and Stylianos Antonarakis group at University of Geneva. References Please see the Affymetrix Transcriptome site for a project overview and additional references to Affymetrix tiling array publications. Bolstad, B. M., Irizarry, R. A., Astrand, M., and Speed, T. P. A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics 19(2), 185-193 (2003). Cawley, S., Bekiranov, S., Ng, H. H., Kapranov, P., Sekinger, E. A., Kampa, D., Piccolboni, A., Sementchenko, V., Cheng, J., Williams, A. J., et al. Unbiased mapping of transcription factor binding sites along human chromosomes 21 and 22 points to widespread regulation of noncoding RNAs. Cell 116(4), 499-509 (2004). Kapranov, P., Cawley, S. E., Drenkow, J., Bekiranov, S., Strausberg, R. L., Fodor, S. P., and Gingeras, T. R. Large-scale transcriptional activity in chromosomes 21 and 22. Science 296(5569), 916-919 (2002). encodeAffyEc51TertBJSignal EC51 Sgnl TertBJ Affy Ext Trans Signal (51-base window) (Tert-BJ) ENCODE Transcript Levels encodeAffyEc1TertBJSignal EC1 Sgnl TertBJ Affy Ext Trans Signal (1-base window) (Tert-BJ) ENCODE Transcript Levels encodeAffyEc51K562Signal EC51 Sgnl K562 Affy Ext Trans Signal (51-base window) (K562) ENCODE Transcript Levels encodeAffyEc1K562Signal EC1 Sgnl K562 Affy Ext Trans Signal (1-base window) (K562) ENCODE Transcript Levels encodeAffyEc51HepG2Signal EC51 Sgnl HepG2 Affy Ext Trans Signal (51-base window) (HepG2) ENCODE Transcript Levels encodeAffyEc1HepG2Signal EC1 Sgnl HepG2 Affy Ext Trans Signal (1-base window) (HepG2) ENCODE Transcript Levels encodeAffyEc51GM06990Signal EC51 Sgnl GM0699 Affy Ext Trans Signal (51-base window) (GM06990) ENCODE Transcript Levels encodeAffyEc1GM06990Signal EC1 Sgnl GM0699 Affy Ext Trans Signal (1-base window) (GM06990) ENCODE Transcript Levels encodeAffyEc51HeLaC1S3Signal EC51 Sgnl HeLa Affy Ext Trans Signal (51-base window) (HeLa C1S3) ENCODE Transcript Levels encodeAffyEc1HeLaC1S3Signal EC1 Sgnl HeLa Affy Ext Trans Signal (1-base window) (HeLa C1S3) ENCODE Transcript Levels encodeAffyEc51OvarySignal EC51 Sgnl Ovary Affy Ext Trans Signal (51-base window) (Ovary) ENCODE Transcript Levels encodeAffyEc1OvarySignal EC1 Sgnl Ovary Affy Ext Trans Signal (1-base window) (Ovary) ENCODE Transcript Levels encodeAffyEc51ProstateSignal EC51 Sgnl Prost Affy Ext Trans Signal (51-base window) (Prostate) ENCODE Transcript Levels encodeAffyEc1ProstateSignal EC1 Sgnl Prost Affy Ext Trans Signal (1-base window) (Prostate) ENCODE Transcript Levels encodeAffyEc51FetalTestisSignal EC51 Sgnl FetalT Affy Ext Trans Signal (51-base window) (Fetal Testis) ENCODE Transcript Levels encodeAffyEc1FetalTestisSignal EC1 Sgnl FetalT Affy Ext Trans Signal (1-base window) (Fetal Testis) ENCODE Transcript Levels encodeAffyEc51TestisSignal EC51 Sgnl Testis Affy Ext Trans Signal (51-base window) (Testis) ENCODE Transcript Levels encodeAffyEc1TestisSignal EC1 Sgnl Testis Affy Ext Trans Signal (1-base window) (Testis) ENCODE Transcript Levels encodeAffyEc51PlacentaSignal EC51 Sgnl Placen Affy Ext Trans Signal (51-base window) (Placenta) ENCODE Transcript Levels encodeAffyEc1PlacentaSignal EC1 Sgnl Placen Affy Ext Trans Signal (1-base window) (Placenta) ENCODE Transcript Levels encodeAffyEc51FetalSpleenSignal EC51 Sgnl Spleen Affy Ext Trans Signal (51-base window) (Fetal Spleen) ENCODE Transcript Levels encodeAffyEc1FetalSpleenSignal EC1 Sgnl Spleen Affy Ext Trans Signal (1-base window) (Fetal Spleen) ENCODE Transcript Levels encodeAffyEc51FetalKidneySignal EC51 Sgnl FetalK Affy Ext Trans Signal (51-base window) (Fetal Kidney) ENCODE Transcript Levels encodeAffyEc1FetalKidneySignal EC1 Sgnl FetalK Affy Ext Trans Signal (1-base window) (Fetal Kidney) ENCODE Transcript Levels encodeAffyEc51BrainHypothalamusSignal EC51 Sgnl BrainH Affy Ext Trans Signal (51-base window) (Brain Hypothalamus) ENCODE Transcript Levels encodeAffyEc1BrainHypothalamusSignal EC1 Sgnl BrainH Affy Ext Trans Signal (1-base window) (Brain Hypothalamus) ENCODE Transcript Levels encodeAffyEc51BrainHippocampusSignal EC51 Sgnl Hippoc Affy Ext Trans Signal (51-base window) (Brain Hippocampus) ENCODE Transcript Levels encodeAffyEc1BrainHippocampusSignal EC1 Sgnl Hippoc Affy Ext Trans Signal (1-base window) (Brain Hippocampus) ENCODE Transcript Levels encodeAffyEc51BrainFrontalLobeSignal EC51 Sgnl BrainF Affy Ext Trans Signal (51-base window) (Brain Frontal Lobe) ENCODE Transcript Levels encodeAffyEc1BrainFrontalLobeSignal EC1 Sgnl BrainF Affy Ext Trans Signal (1-base window) (Brain Frontal Lobe) ENCODE Transcript Levels encodeAffyEc51BrainCerebellumSignal EC51 Sgnl BrainC Affy Ext Trans Signal (51-base window) (Brain Cerebellum) ENCODE Transcript Levels encodeAffyEc1BrainCerebellumSignal EC1 Sgnl BrainC Affy Ext Trans Signal (1-base window) (Brain Cerebellum) ENCODE Transcript Levels acembly AceView Genes AceView Gene Models With Alt-Splicing Genes and Gene Predictions Description This track shows AceView gene models constructed from mRNA, EST and genomic evidence by Danielle and Jean Thierry-Mieg and Vahan Simonyan using the Acembly program. Display Conventions and Configuration This track follows the display conventions for gene prediction tracks. Gene models that fall into the "main" prediction class are displayed in purple; "putative" genes are displayed in pink. The track description page offers the following filter and configuration options: Gene Class filter: Select the main or putative option to filter the display by prediction class. Color track by codons: Select the genomic codons option to color and label each codon in a zoomed-in display to facilitate validation and comparison of gene predictions. Click the "Codon coloring help" link on the track description page for more information about this feature. Methods AceView attempts to find the best alignment of each mRNA/EST against the genome, and clusters the alignments into the least possible number of alternatively spliced transcripts. The reconstructed transcripts are then clustered into genes by simple transitive contact. To see the evidence that supports each transcript, click the "Outside Link" on an individual transcript's details page to access the NCBI AceView web site. Each AceView transcript model has a gene cluster designation (alternate name) that is categorized into a prediction class of either main or putative. Prediction Class: main Class of genes that includes the protein coding genes (defined here by CDS > 100 amino acids) and all genes with at least one well-defined standard intron, i.e., an intron with a GT-AG or GC-AG boundary, supported by at least one clone matching exactly, with no ambiguous bases, and the 8 bases on either side of the intron identical to the genome. Genes with a CDS smaller than 100 amino acids are included in this class if they meet one of the following conditions: they have a NCBI RefSeq sequence (NM_#) or an OMIM identifier, or they encode a protein with BlastP homology (< 1e-3) to a cDNA-supported nematode AceView protein. Prediction Class: putative Class of genes that have no standard intron and do not encode CDS of more than 100 amino acids, yet may be sufficiently useful to justify not disregarding them completely. Putative genes may be of two types: either those supported by more than six cDNA clones or those that encode a putative protein with an interesting annotation. Examples include a PFAM motif, a BlastP hit to a species other than itself (< 1e-3), a transmembrane domain or other rare and meaningful domains identified by Psort2, or a highly probable localization in a cell compartment (excluding cytoplasm and nucleus). Credits Thanks to Danielle and Jean Thierry-Mieg at NIH for providing this track. References Thierry-Mieg D, Thierry-Mieg J. AceView: a comprehensive cDNA-supported gene and transcripts annotation. Genome Biol. 2006;7 Suppl 1:S12.1-14. encodeAffyChIpHl60Pval Affy pVal Affymetrix ChIP-chip (retinoic acid-treated HL-60 cells) P-Value ENCODE Chromatin Immunoprecipitation Description This track shows regions that co-precipitate with antibodies against each of ten factors in all ENCODE regions, in retinoic-acid stimulated HL-60 cells harvested after 0, 2, 8, and 32 hours. Median P-values are shown in separate subtracks for each of the ten antibodies: Brg1 - Brahma-related Gene 1 CEBPe - CCAAT-enhancer binding protein-epsilon CTCF - CCTC binding factor H3K27me3 (H3K27T) - Histone H3 tri-methylated lysine 27 H4Kac4 (HisH4) - Histone H4 tetra-acetylated lysine P300 - E1A-binding protein, 300-KD PU1 - Spleen focus forming virus proviral integration oncogene Pol2 - RNA Polymerase II (8WG16 ab against pre-initiation complex form) RARA (RARecA) - Retinoic Acid Receptor-Alpha SIRT1 - Sirtuin-1 Retinoic acid-stimulated HL-60 cells were harvested and whole cell extracts (control) were made. An antibody was used to immunoprecipitate bound chromatin fragments (treatment). DNA was purified from these samples and hybridized to Affymetrix ENCODE oligonucleotide tiling arrays, which have 25-mer probes tiled every 22 bp on average in the non-repetitive ENCODE regions. Only median P-values are displayed; data for all biological replicates can be downloaded from Affymetrix in wiggle, cel, and soft formats. Display Conventions and Configuration The subtracks within this composite annotation track may be configured in a variety of ways to highlight different aspects of the displayed data. The graphical configuration options for the subtracks are shown at the top of the track description page, followed by a list of subtracks. For more information about the graphical configuration options, click the Graph configuration help link. Color differences among the subtracks are arbitrary. They provide a visual cue for finding the same antibody in different timepoint tracks. Methods The data from replicate arrays were quantile-normalized (Bolstad et al., 2003) and all arrays were scaled to a median array intensity of 22. Within a sliding 1001 bp window centered on each probe, a signal estimator S = ln[max(PM - MM, 1)] (where PM is perfect match and MM is mismatch) was computed for each biological replicate treatment- and all replicate control-probe pairs. An estimate of the significance of the enrichment of treatment signal for each replicate over control signal in each window was given by the P-value computed using the Wilcoxon Rank Sum test over each biological replicate treatment and all control signal estimates in that window. The median of the log transformed P-value (-10 log[10] P) across processed replicate data is displayed. Several independent biological replicates (four each for Brg1, CEBPe, CTCF, PU1, and SIRT1; five each for H3K27me3, H4Kac4, P300, Pol2 and RARA) were generated and hybridized to duplicate arrays (two technical replicates). Reproducible enriched regions were generated from the signal by first applying a cutoff of 20 to the log transformed P-values, a maxGap and minRun of 500 and 0 basepairs respectively, to each biological replicate. Since each region or site may be comprised of more than one probe, a median based on the distribution of log transformed P-values was computed per site for each of the respective replicates. These seed sites were then ranked individually within each of the replicates. If a site was absent in a replicate, the maximum or worst rank of the distribution was assigned to it. The following three values were computed for each site by combining data from all biological replicates: average of all ranks computed among biological replicates sum of all pairwise differences in these ranks computed among biological replicates a combined P-value, using a chi square distribution, across all replicates The final sites were selected when all of the above three metrics were relatively low, where "low" corresponds to the top 25 percentile of the distribution. Verification Using the P-values from the biological replicates, all pairwise rank correlation coefficients were computed among biological replicates. Data sets showing both consistent pairwise correlation coefficients and at least weak positive correlation across all pairs were considered reproducible. Credits These data were generated and analyzed by the Gingeras/Struhl collaboration with the Tom Gingeras group at Affymetrix and Kevin Struhl's group at Harvard Medical School. References Please see the Affymetrix Transcriptome site for a project overview and additional references to Affymetrix tiling array publications. Bolstad, B. M., Irizarry, R. A., Astrand, M., and Speed, T. P. A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics 19(2), 185-193 (2003). Cawley, S., Bekiranov, S., Ng, H. H., Kapranov, P., Sekinger, E. A., Kampa, D., Piccolboni, A., Sementchenko, V., Cheng, J., Williams, A. J., et al. Unbiased mapping of transcription factor binding sites along human chromosomes 21 and 22 points to widespread regulation of noncoding RNAs. Cell 116(4), 499-509 (2004). encodeAffyChipSuper Affy ChIP Affymetrix ChIP-chip ENCODE Chromatin Immunoprecipitation Overview This super-track combines related tracks of ChIP-chip data generated by the Affymetrix/Harvard ENCODE collaboration. ChIP-chip, also known as genome-wide location analysis, is a technique for isolation and identification of DNA sequences bound by specific proteins in cells. These tracks contain ChIP-chip data of multiple transcription factors, RNA polymerase II and histones, in multiple cell lines, including HL-60 (leukemia) and ME-180 (cervical carcinoma), and at different time points after drug cell treatment. Binding was assayed on Affymetrix ENCODE tiling arrays. Data are displayed as signals, median p-values, "strict" p-values and sites. Credits These data were generated and analyzed by collaboration of the Tom Gingeras group at Affymetrix and the Kevin Struhl lab at Harvard Medical School. References Please see the Affymetrix Transcriptome site for a project overview and additional references to Affymetrix tiling array publications. Bolstad BM, Irizarry RA, Astrand M, and Speed TP. A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics. 2003 Jan 22;19(2):185-93. Cawley S, Bekiranov S, Ng HH, Kapranov P, Sekinger EA, Kampa D, Piccolboni A, Sementchenko V, Cheng J, Williams AJ et al. Unbiased mapping of transcription factor binding sites along human chromosomes 21 and 22 points to widespread regulation of noncoding RNAs. Cell. 2004 Feb 20;116(4):499-509. encodeAffyChIpHl60PvalTfiibHr32 Affy TFIIB RA 32h Affymetrix ChIP-chip (TFIIB retinoic acid-treated HL-60, 32hrs) P-Value ENCODE Chromatin Immunoprecipitation encodeAffyChIpHl60PvalSirt1Hr32 Affy SIRT1 RA 32h Affymetrix ChIP-chip (SIRT1 retinoic acid-treated HL-60, 32hrs) P-Value ENCODE Chromatin Immunoprecipitation encodeAffyChIpHl60PvalSirt1Hr08 Affy SIRT1 RA 8h Affymetrix ChIP-chip (SIRT1 retinoic acid-treated HL-60, 8hrs) P-Value ENCODE Chromatin Immunoprecipitation encodeAffyChIpHl60PvalSirt1Hr02 Affy SIRT1 RA 2h Affymetrix ChIP-chip (SIRT1 retinoic acid-treated HL-60, 2hrs) P-Value ENCODE Chromatin Immunoprecipitation encodeAffyChIpHl60PvalSirt1Hr00 Affy SIRT1 RA 0h Affymetrix ChIP-chip (SIRT1 retinoic acid-treated HL-60, 0hrs) P-Value ENCODE Chromatin Immunoprecipitation encodeAffyChIpHl60PvalRaraHr32 Affy RARA RA 32h Affymetrix ChIP-chip (RARA retinoic acid-treated HL-60, 32hrs) P-Value ENCODE Chromatin Immunoprecipitation encodeAffyChIpHl60PvalRaraHr08 Affy RARA RA 8h Affymetrix ChIP-chip (RARA retinoic acid-treated HL-60, 8hrs) P-Value ENCODE Chromatin Immunoprecipitation encodeAffyChIpHl60PvalRaraHr02 Affy RARA RA 2h Affymetrix ChIP-chip (RARA retinoic acid-treated HL-60, 2hrs) P-Value ENCODE Chromatin Immunoprecipitation encodeAffyChIpHl60PvalRaraHr00 Affy RARA RA 0h Affymetrix ChIP-chip (RARA retinoic acid-treated HL-60, 0hrs) P-Value ENCODE Chromatin Immunoprecipitation encodeAffyChIpHl60PvalRnapHr32 Affy Pol2 RA 32h Affymetrix ChIP-chip (Pol2 8WG16 antibody, retinoic acid-treated HL-60, 32hrs) P-Value ENCODE Chromatin Immunoprecipitation encodeAffyChIpHl60PvalRnapHr08 Affy Pol2 RA 8h Affymetrix ChIP-chip (Pol2 8WG16 antibody, retinoic acid-treated HL-60, 8hrs) P-Value ENCODE Chromatin Immunoprecipitation encodeAffyChIpHl60PvalRnapHr02 Affy Pol2 RA 2h Affymetrix ChIP-chip (Pol2 8WG16 antibody, retinoic acid-treated HL-60, 2hrs) P-Value ENCODE Chromatin Immunoprecipitation encodeAffyChIpHl60PvalRnapHr00 Affy Pol2 RA 0h Affymetrix ChIP-chip (Pol2 8WG16 antibody, retinoic acid-treated HL-60, 0hrs) P-Value ENCODE Chromatin Immunoprecipitation encodeAffyChIpHl60PvalPu1Hr32 Affy PU1 RA 32h Affymetrix ChIP-chip (PU1 retinoic acid-treated HL-60, 32hrs) P-Value ENCODE Chromatin Immunoprecipitation encodeAffyChIpHl60PvalPu1Hr08 Affy PU1 RA 8h Affymetrix ChIP-chip (PU1 retinoic acid-treated HL-60, 8hrs) P-Value ENCODE Chromatin Immunoprecipitation encodeAffyChIpHl60PvalPu1Hr02 Affy PU1 RA 2h Affymetrix ChIP-chip (PU1 retinoic acid-treated HL-60, 2hrs) P-Value ENCODE Chromatin Immunoprecipitation encodeAffyChIpHl60PvalPu1Hr00 Affy PU1 RA 0h Affymetrix ChIP-chip (PU1 retinoic acid-treated HL-60, 0hrs) P-Value ENCODE Chromatin Immunoprecipitation encodeAffyChIpHl60PvalP300Hr32 Affy P300 RA 32h Affymetrix ChIP-chip (P300 retinoic acid-treated HL-60, 32hrs) P-Value ENCODE Chromatin Immunoprecipitation encodeAffyChIpHl60PvalP300Hr08 Affy P300 RA 8h Affymetrix ChIP-chip (P300 retinoic acid-treated HL-60, 8hrs) P-Value ENCODE Chromatin Immunoprecipitation encodeAffyChIpHl60PvalP300Hr02 Affy P300 RA 2h Affymetrix ChIP-chip (P300 retinoic acid-treated HL-60, 2hrs) P-Value ENCODE Chromatin Immunoprecipitation encodeAffyChIpHl60PvalP300Hr00 Affy P300 RA 0h Affymetrix ChIP-chip (P300 retinoic acid-treated HL-60, 0hrs) P-Value ENCODE Chromatin Immunoprecipitation encodeAffyChIpHl60PvalH4Kac4Hr32 Affy H4Kac4 RA 32h Affymetrix ChIP-chip (H4Kac4 retinoic acid-treated HL-60, 32hrs) P-Value ENCODE Chromatin Immunoprecipitation encodeAffyChIpHl60PvalH4Kac4Hr08 Affy H4Kac4 RA 8h Affymetrix ChIP-chip (H4Kac4 retinoic acid-treated HL-60, 8hrs) P-Value ENCODE Chromatin Immunoprecipitation encodeAffyChIpHl60PvalH4Kac4Hr02 Affy H4Kac4 RA 2h Affymetrix ChIP-chip (H4Kac4 retinoic acid-treated HL-60, 2hrs) P-Value ENCODE Chromatin Immunoprecipitation encodeAffyChIpHl60PvalH4Kac4Hr00 Affy H4Kac4 RA 0h Affymetrix ChIP-chip (H4Kac4 retinoic acid-treated HL-60, 0hrs) P-Value ENCODE Chromatin Immunoprecipitation encodeAffyChIpHl60PvalH3K27me3Hr32 Affy H3K27me3 RA 32h Affymetrix ChIP-chip (H3K27me3 retinoic acid-treated HL-60, 32hrs) P-Value ENCODE Chromatin Immunoprecipitation encodeAffyChIpHl60PvalH3K27me3Hr08 Affy H3K27me3 RA 8h Affymetrix ChIP-chip (H3K27me3 retinoic acid-treated HL-60, 8hrs) P-Value ENCODE Chromatin Immunoprecipitation encodeAffyChIpHl60PvalH3K27me3Hr02 Affy H3K27me3 RA 2h Affymetrix ChIP-chip (H3K27me3 retinoic acid-treated HL-60, 2hrs) P-Value ENCODE Chromatin Immunoprecipitation encodeAffyChIpHl60PvalH3K27me3Hr00 Affy H3K27me3 RA 0h Affymetrix ChIP-chip (H3K27me3 retinoic acid-treated HL-60, 0hrs) P-Value ENCODE Chromatin Immunoprecipitation encodeAffyChIpHl60PvalCtcfHr32 Affy CTCF RA 32h Affymetrix ChIP-chip (CTCF retinoic acid-treated HL-60, 32hrs) P-Value ENCODE Chromatin Immunoprecipitation encodeAffyChIpHl60PvalCtcfHr08 Affy CTCF RA 8h Affymetrix ChIP-chip (CTCF retinoic acid-treated HL-60, 8hrs) P-Value ENCODE Chromatin Immunoprecipitation encodeAffyChIpHl60PvalCtcfHr02 Affy CTCF RA 2h Affymetrix ChIP-chip (CTCF retinoic acid-treated HL-60, 2hrs) P-Value ENCODE Chromatin Immunoprecipitation encodeAffyChIpHl60PvalCtcfHr00 Affy CTCF RA 0h Affymetrix ChIP-chip (CTCF retinoic acid-treated HL-60, 0hrs) P-Value ENCODE Chromatin Immunoprecipitation encodeAffyChIpHl60PvalCebpeHr32 Affy CEBPe RA 32h Affymetrix ChIP-chip (CEBPe retinoic acid-treated HL-60, 32hrs) P-Value ENCODE Chromatin Immunoprecipitation encodeAffyChIpHl60PvalCebpeHr08 Affy CEBPe RA 8h Affymetrix ChIP-chip (CEBPe retinoic acid-treated HL-60, 8hrs) P-Value ENCODE Chromatin Immunoprecipitation encodeAffyChIpHl60PvalCebpeHr02 Affy CEBPe RA 2h Affymetrix ChIP-chip (CEBPe retinoic acid-treated HL-60, 2hrs) P-Value ENCODE Chromatin Immunoprecipitation encodeAffyChIpHl60PvalCebpeHr00 Affy CEBPe RA 0h Affymetrix ChIP-chip (CEBPe retinoic acid-treated HL-60, 0hrs) P-Value ENCODE Chromatin Immunoprecipitation encodeAffyChIpHl60PvalBrg1Hr32 Affy Brg1 RA 32h Affymetrix ChIP-chip (Brg1 retinoic acid-treated HL-60, 32hrs) P-Value ENCODE Chromatin Immunoprecipitation encodeAffyChIpHl60PvalBrg1Hr08 Affy Brg1 RA 8h Affymetrix ChIP-chip (Brg1 retinoic acid-treated HL-60, 8hrs) P-Value ENCODE Chromatin Immunoprecipitation encodeAffyChIpHl60PvalBrg1Hr02 Affy Brg1 RA 2h Affymetrix ChIP-chip (Brg1 retinoic acid-treated HL-60, 2hrs) P-Value ENCODE Chromatin Immunoprecipitation encodeAffyChIpHl60PvalBrg1Hr00 Affy Brg1 RA 0h Affymetrix ChIP-chip (Brg1 retinoic acid-treated HL-60, 0hrs) P-Value ENCODE Chromatin Immunoprecipitation encodeAffyChIpHl60Sites Affy Sites Affymetrix ChIP-chip (retinoic acid-treated HL-60 cells) Sites ENCODE Chromatin Immunoprecipitation Description This track shows regions that co-precipitate with antibodies against each of ten factors in all ENCODE regions, in retinoic-acid stimulated HL-60 cells harvested after 0, 2, 8, and 32 hours. Clustered sites are shown in separate subtracks for each of the ten antibodies: Brg1 - Brahma-related Gene 1 CEBPe - CCAAT-enhancer binding protein-epsilon CTCF - CCTC binding factor H3K27me3 (H3K27T) - Histone H3 tri-methylated lysine 27 H4Kac4 (HisH4) - Histone H4 tetra-acetylated lysine P300 - E1A-binding protein, 300-KD PU1 - Spleen focus forming virus proviral integration oncogene Pol2 - RNA Polymerase II (8WG16 ab against pre-initiation complex form) RARA (RARecA) - Retinoic Acid Receptor-Alpha SIRT1 - Sirtuin-1 Retinoic acid-stimulated HL-60 cells were harvested and whole cell extracts (control) were made. An antibody was used to immunoprecipitate bound chromatin fragments (treatment). DNA was purified from these samples and hybridized to Affymetrix ENCODE oligonucleotide tiling arrays, which have 25-mer probes tiled every 22 bp on average in the non-repetitive ENCODE regions. Display Conventions and Configuration The subtracks within this composite annotation track may be configured in a variety of ways to highlight different aspects of the displayed data. The graphical configuration options for the subtracks are shown at the top of the track description page, followed by a list of subtracks. For more information about the graphical configuration options, click the Graph configuration help link. Color differences among the subtracks are arbitrary. They provide a visual cue for finding the same antibody in different timepoint tracks. Methods The data from replicate arrays were quantile-normalized (Bolstad et al., 2003) and all arrays were scaled to a median array intensity of 22. Within a sliding 1001 bp window centered on each probe, a signal estimator S = ln[max(PM - MM, 1)] (where PM is perfect match and MM is mismatch) was computed for each biological replicate treatment- and all replicate control-probe pairs. An estimate of the significance of the enrichment of treatment signal for each replicate over control signal in each window was given by the P-value computed using the Wilcoxon Rank Sum test over each biological replicate treatment and all control signal estimates in that window. The median of the log transformed P-value (-10 log10 P) across processed replicate data is displayed. Several independent biological replicates (four each for Brg1, CEBPe, CTCF, PU1, and SIRT1; five each for H3K27me3, H4Kac4, P300, Pol2 and RARA) were generated and hybridized to duplicate arrays (two technical replicates). Reproducible enriched regions were generated from the signal by first applying a cutoff of 20 to the log transformed P-values, a maxGap and minRun of 500 and 0 basepairs respectively, to each biological replicate. Since each region or site may be comprised of more than one probe, a median based on the distribution of log transformed P-values was computed per site for each of the respective replicates. These seed sites were then ranked individually within each of the replicates. If a site was absent in a replicate, the maximum or worst rank of the distribution was assigned to it. The following three values were computed for each site by combining data from all biological replicates: average of all ranks computed among biological replicates sum of all pairwise differences in these ranks computed among biological replicates a combined P-value, using a chi square distribution, across all replicates The final sites were selected when all of the above three metrics were relatively low, where "low" corresponds to the top 25 percentile of the distribution. Verification Using the P-values from the biological replicates, all pairwise rank correlation coefficients were computed among biological replicates. Data sets showing both consistent pairwise correlation coefficients and at least weak positive correlation across all pairs were considered reproducible. Credits These data were generated and analyzed by the Gingeras/Struhl collaboration with the Tom Gingeras group at Affymetrix and Kevin Struhl's group at Harvard Medical School. References Please see the Affymetrix Transcriptome site for a project overview and additional references to Affymetrix tiling array publications. Bolstad, B. M., Irizarry, R. A., Astrand, M., and Speed, T. P. A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics 19(2), 185-193 (2003). Cawley, S., Bekiranov, S., Ng, H. H., Kapranov, P., Sekinger, E. A., Kampa, D., Piccolboni, A., Sementchenko, V., Cheng, J., Williams, A. J., et al. Unbiased mapping of transcription factor binding sites along human chromosomes 21 and 22 points to widespread regulation of noncoding RNAs. Cell 116(4), 499-509 (2004). encodeAffyChIpHl60SitesTfiibHr32 Affy TFIIB RA 32h Affymetrix ChIP-chip (TFIIB retinoic acid-treated HL-60, 32hrs) Sites ENCODE Chromatin Immunoprecipitation encodeAffyChIpHl60SitesSirt1Hr32 Affy SIRT1 RA 32h Affymetrix ChIP-chip (SIRT1 retinoic acid-treated HL-60, 32hrs) Sites ENCODE Chromatin Immunoprecipitation encodeAffyChIpHl60SitesSirt1Hr08 Affy SIRT1 RA 8h Affymetrix ChIP-chip (SIRT1 retinoic acid-treated HL-60, 8hrs) Sites ENCODE Chromatin Immunoprecipitation encodeAffyChIpHl60SitesSirt1Hr02 Affy SIRT1 RA 2h Affymetrix ChIP-chip (SIRT1 retinoic acid-treated HL-60, 2hrs) Sites ENCODE Chromatin Immunoprecipitation encodeAffyChIpHl60SitesSirt1Hr00 Affy SIRT1 RA 0h Affymetrix ChIP-chip (SIRT1 retinoic acid-treated HL-60, 0hrs) Sites ENCODE Chromatin Immunoprecipitation encodeAffyChIpHl60SitesRaraHr32 Affy RARA RA 32h Affymetrix ChIP-chip (RARA retinoic acid-treated HL-60, 32hrs) Sites ENCODE Chromatin Immunoprecipitation encodeAffyChIpHl60SitesRaraHr08 Affy RARA RA 8h Affymetrix ChIP-chip (RARA retinoic acid-treated HL-60, 8hrs) Sites ENCODE Chromatin Immunoprecipitation encodeAffyChIpHl60SitesRaraHr02 Affy RARA RA 2h Affymetrix ChIP-chip (RARA retinoic acid-treated HL-60, 2hrs) Sites ENCODE Chromatin Immunoprecipitation encodeAffyChIpHl60SitesRaraHr00 Affy RARA RA 0h Affymetrix ChIP-chip (RARA retinoic acid-treated HL-60, 0hrs) Sites ENCODE Chromatin Immunoprecipitation encodeAffyChIpHl60SitesRnapHr32 Affy Pol2 RA 32h Affymetrix ChIP-chip (Pol2 8WG16 antibody, retinoic acid-treated HL-60, 32hrs) Sites ENCODE Chromatin Immunoprecipitation encodeAffyChIpHl60SitesRnapHr08 Affy Pol2 RA 8h Affymetrix ChIP-chip (Pol2 8WG16 antibody, retinoic acid-treated HL-60, 8hrs) Sites ENCODE Chromatin Immunoprecipitation encodeAffyChIpHl60SitesRnapHr02 Affy Pol2 RA 2h Affymetrix ChIP-chip (Pol2 8WG16 antibody, retinoic acid-treated HL-60, 2hrs) Sites ENCODE Chromatin Immunoprecipitation encodeAffyChIpHl60SitesRnapHr00 Affy Pol2 RA 0h Affymetrix ChIP-chip (Pol2 8WG16 antibody, retinoic acid-treated HL-60, 0hrs) Sites ENCODE Chromatin Immunoprecipitation encodeAffyChIpHl60SitesPu1Hr32 Affy PU1 RA 32h Affymetrix ChIP-chip (PU1 retinoic acid-treated HL-60, 32hrs) Sites ENCODE Chromatin Immunoprecipitation encodeAffyChIpHl60SitesPu1Hr08 Affy PU1 RA 8h Affymetrix ChIP-chip (PU1 retinoic acid-treated HL-60, 8hrs) Sites ENCODE Chromatin Immunoprecipitation encodeAffyChIpHl60SitesPu1Hr02 Affy PU1 RA 2h Affymetrix ChIP-chip (PU1 retinoic acid-treated HL-60, 2hrs) Sites ENCODE Chromatin Immunoprecipitation encodeAffyChIpHl60SitesPu1Hr00 Affy PU1 RA 0h Affymetrix ChIP-chip (PU1 retinoic acid-treated HL-60, 0hrs) Sites ENCODE Chromatin Immunoprecipitation encodeAffyChIpHl60SitesP300Hr32 Affy P300 RA 32h Affymetrix ChIP-chip (P300 retinoic acid-treated HL-60, 32hrs) Sites ENCODE Chromatin Immunoprecipitation encodeAffyChIpHl60SitesP300Hr08 Affy P300 RA 8h Affymetrix ChIP-chip (P300 retinoic acid-treated HL-60, 8hrs) Sites ENCODE Chromatin Immunoprecipitation encodeAffyChIpHl60SitesP300Hr02 Affy P300 RA 2h Affymetrix ChIP-chip (P300 retinoic acid-treated HL-60, 2hrs) Sites ENCODE Chromatin Immunoprecipitation encodeAffyChIpHl60SitesP300Hr00 Affy P300 RA 0h Affymetrix ChIP-chip (P300 retinoic acid-treated HL-60, 0hrs) Sites ENCODE Chromatin Immunoprecipitation encodeAffyChIpHl60SitesH4Kac4Hr32 Affy H4Kac4 RA 32h Affymetrix ChIP-chip (H4Kac4 retinoic acid-treated HL-60, 32hrs) Sites ENCODE Chromatin Immunoprecipitation encodeAffyChIpHl60SitesH4Kac4Hr08 Affy H4Kac4 RA 8h Affymetrix ChIP-chip (H4Kac4 retinoic acid-treated HL-60, 8hrs) Sites ENCODE Chromatin Immunoprecipitation encodeAffyChIpHl60SitesH4Kac4Hr02 Affy H4Kac4 RA 2h Affymetrix ChIP-chip (H4Kac4 retinoic acid-treated HL-60, 2hrs) Sites ENCODE Chromatin Immunoprecipitation encodeAffyChIpHl60SitesH4Kac4Hr00 Affy H4Kac4 RA 0h Affymetrix ChIP-chip (H4Kac4 retinoic acid-treated HL-60, 0hrs) Sites ENCODE Chromatin Immunoprecipitation encodeAffyChIpHl60SitesH3K27me3Hr32 Affy H3K27me3 RA 32h Affymetrix ChIP-chip (H3K27me3 retinoic acid-treated HL-60, 32hrs) Sites ENCODE Chromatin Immunoprecipitation encodeAffyChIpHl60SitesH3K27me3Hr08 Affy H3K27me3 RA 8h Affymetrix ChIP-chip (H3K27me3 retinoic acid-treated HL-60, 8hrs) Sites ENCODE Chromatin Immunoprecipitation encodeAffyChIpHl60SitesH3K27me3Hr02 Affy H3K27me3 RA 2h Affymetrix ChIP-chip (H3K27me3 retinoic acid-treated HL-60, 2hrs) Sites ENCODE Chromatin Immunoprecipitation encodeAffyChIpHl60SitesH3K27me3Hr00 Affy H3K27me3 RA 0h Affymetrix ChIP-chip (H3K27me3 retinoic acid-treated HL-60, 0hrs) Sites ENCODE Chromatin Immunoprecipitation encodeAffyChIpHl60SitesCtcfHr32 Affy CTCF RA 32h Affymetrix ChIP-chip (CTCF retinoic acid-treated HL-60, 32hrs) Sites ENCODE Chromatin Immunoprecipitation encodeAffyChIpHl60SitesCtcfHr08 Affy CTCF RA 8h Affymetrix ChIP-chip (CTCF retinoic acid-treated HL-60, 8hrs) Sites ENCODE Chromatin Immunoprecipitation encodeAffyChIpHl60SitesCtcfHr02 Affy CTCF RA 2h Affymetrix ChIP-chip (CTCF retinoic acid-treated HL-60, 2hrs) Sites ENCODE Chromatin Immunoprecipitation encodeAffyChIpHl60SitesCtcfHr00 Affy CTCF RA 0h Affymetrix ChIP-chip (CTCF retinoic acid-treated HL-60, 0hrs) Sites ENCODE Chromatin Immunoprecipitation encodeAffyChIpHl60SitesCebpeHr32 Affy CEBPe RA 32h Affymetrix ChIP-chip (CEBPe retinoic acid-treated HL-60, 32hrs) Sites ENCODE Chromatin Immunoprecipitation encodeAffyChIpHl60SitesCebpeHr08 Affy CEBPe RA 8h Affymetrix ChIP-chip (CEBPe retinoic acid-treated HL-60, 8hrs) Sites ENCODE Chromatin Immunoprecipitation encodeAffyChIpHl60SitesCebpeHr02 Affy CEBPe RA 2h Affymetrix ChIP-chip (CEBPe retinoic acid-treated HL-60, 2hrs) Sites ENCODE Chromatin Immunoprecipitation encodeAffyChIpHl60SitesCebpeHr00 Affy CEBPe RA 0h Affymetrix ChIP-chip (CEBPe retinoic acid-treated HL-60, 0hrs) Sites ENCODE Chromatin Immunoprecipitation encodeAffyChIpHl60SitesBrg1Hr32 Affy Brg1 RA 32h Affymetrix ChIP-chip (Brg1 retinoic acid-treated HL-60, 32hrs) Sites ENCODE Chromatin Immunoprecipitation encodeAffyChIpHl60SitesBrg1Hr08 Affy Brg1 RA 8h Affymetrix ChIP-chip (Brg1 retinoic acid-treated HL-60, 8hrs) Sites ENCODE Chromatin Immunoprecipitation encodeAffyChIpHl60SitesBrg1Hr02 Affy Brg1 RA 2h Affymetrix ChIP-chip (Brg1 retinoic acid-treated HL-60, 2hrs) Sites ENCODE Chromatin Immunoprecipitation encodeAffyChIpHl60SitesBrg1Hr00 Affy Brg1 RA 0h Affymetrix ChIP-chip (Brg1 retinoic acid-treated HL-60, 0hrs) Sites ENCODE Chromatin Immunoprecipitation encodeAffyChIpHl60PvalStrict Affy Strict pVal Affymetrix ChIP-chip (HL-60 and ME-180 cells) Strict P-Value ENCODE Chromatin Immunoprecipitation Description This track shows regions that co-precipitate with antibodies against each of 4 factors in all ENCODE regions, in retinoic-acid stimulated HL-60 (leukemia) cells harvested after 0, 2, 8, and 32 hours, and in a fifth factor tested in ME-180 cervical carcinoma cells. Median of the transformed P-value (-10 log[10] P) across processed replicate data is displayed as separate subtracks for each antibody: H4Kac4 (HisH4) - Histone H4 tetra-acetylated lysine H3K9K14ac2 (H3K9K14D) - Histone H3 K9 K14 Di-Acetylated Pol2 - RNA Polymerase II (8WG16 ab against pre-initiation complex form) p63_ActD - p63, in actinomycin-D treated ME-180 cells p63_mActD - p63 in untreated ME-180 cells Retinoic acid-stimulated HL-60 cells and ME-180 cells (actinomycin-D treated or untreated) were harvested and whole cell extracts (control) were made. An antibody was used to immunoprecipitate bound chromatin fragments (treatment). DNA was purified from these samples and hybridized to Affymetrix ENCODE oligonucleotide tiling arrays, which have 25-mer probes tiled every 22 bp on average in the non-repetitive ENCODE regions. Only the median of the transformed P-value (-10 log[10] P) is displayed; data for all biological replicates can be downloaded from Affymetrix in wiggle, cel, and soft formats. Display Conventions and Configuration The subtracks within this composite annotation track may be configured in a variety of ways to highlight different aspects of the displayed data. The graphical configuration options for the subtracks are shown at the top of the track description page, followed by a list of subtracks. For more information about the graphical configuration options, click the Graph configuration help link. Color differences among the subtracks are arbitrary. They provide a visual cue for finding the same antibody in different timepoint tracks. Methods The data from replicate arrays were quantile-normalized (Bolstad et al., 2003) and all arrays were scaled to a median array intensity of 22. Within a sliding 1001 bp window centered on each probe, a signal estimator S = ln[max(PM - MM, 1)] (where PM is perfect match and MM is mismatch) was computed for each biological replicate treatment- and all replicate control-probe pairs. An estimate of the significance of the enrichment of treatment signal for each replicate over control signal in each window was given by the P-value computed using the Wilcoxon Rank Sum test over each biological replicate treatment and all control signal estimates in that window. The median of the transformed P-value (-10 log[10] P) across processed replicate data is displayed. Verification Using the P-values from the biological replicates, all pairwise rank correlation coefficients were computed among biological replicates. Data sets showing both consistent pairwise correlation coefficients and at least weak positive correlation across all pairs were considered reproducible. Credits These data were generated and analyzed by the Gingeras/Struhl collaboration with the Tom Gingeras group at Affymetrix and Kevin Struhl's group at Harvard Medical School. References Please see the Affymetrix Transcriptome site for a project overview and additional references to Affymetrix tiling array publications. Bolstad, B. M., Irizarry, R. A., Astrand, M., and Speed, T. P. A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics 19(2), 185-193 (2003). Cawley, S., Bekiranov, S., Ng, H. H., Kapranov, P., Sekinger, E. A., Kampa, D., Piccolboni, A., Sementchenko, V., Cheng, J., Williams, A. J., et al. Unbiased mapping of transcription factor binding sites along human chromosomes 21 and 22 points to widespread regulation of noncoding RNAs. Cell 116(4), 499-509 (2004). Yang A, Zhu Z, Kapranov P, McKeon F, Church GM, Gingeras TR, Struhl K. Relationships between p63 binding, DNA sequence, transcription activity, and biological function in human cells. Mol. Cell. 24(4), 593-602 (2006). encodeAffyChIpHl60PvalStrictp63_mActD Affy p63 ME-180 Affymetrix ChIP-chip (p63, ME-180) Strict P-Value ENCODE Chromatin Immunoprecipitation encodeAffyChIpHl60PvalStrictp63_ActD Affy p63 ME-180+ Affymetrix ChIP-chip (p63, actinomycin-D treated ME-180) Strict P-Value ENCODE Chromatin Immunoprecipitation encodeAffyChIpHl60PvalStrictPol2Hr32 Affy Pol2 32h Affymetrix ChIP-chip (Pol2, retinoic acid-treated HL-60, 32hrs) Strict P-Value ENCODE Chromatin Immunoprecipitation encodeAffyChIpHl60PvalStrictPol2Hr08 Affy Pol2 8h Affymetrix ChIP-chip (Pol2, retinoic acid-treated HL-60, 8hrs) Strict P-Value ENCODE Chromatin Immunoprecipitation encodeAffyChIpHl60PvalStrictPol2Hr02 Affy Pol2 2h Affymetrix ChIP-chip (Pol2, retinoic acid-treated HL-60, 2hrs) Strict P-Value ENCODE Chromatin Immunoprecipitation encodeAffyChIpHl60PvalStrictPol2Hr00 Affy Pol2 0h Affymetrix ChIP-chip (Pol2, retinoic acid-treated HL-60, 0hrs) Strict P-Value ENCODE Chromatin Immunoprecipitation encodeAffyChIpHl60PvalStrictHisH4Hr32 Affy H4Kac4 32h Affymetrix ChIP-chip (H4Kac4, retinoic acid-treated HL-60, 32hrs) Strict P-Value ENCODE Chromatin Immunoprecipitation encodeAffyChIpHl60PvalStrictHisH4Hr08 Affy H4Kac4 8h Affymetrix ChIP-chip (H4Kac4, retinoic acid-treated HL-60, 8hrs) Strict P-Value ENCODE Chromatin Immunoprecipitation encodeAffyChIpHl60PvalStrictHisH4Hr02 Affy H4Kac4 2h Affymetrix ChIP-chip (H4Kac4, retinoic acid-treated HL-60, 2hrs) Strict P-Value ENCODE Chromatin Immunoprecipitation encodeAffyChIpHl60PvalStrictHisH4Hr00 Affy H4Kac4 0h Affymetrix ChIP-chip (H4Kac4, retinoic acid-treated HL-60, 0hrs) Strict P-Value ENCODE Chromatin Immunoprecipitation encodeAffyChIpHl60PvalStrictH3K9K14DHr32 Affy H3K9ac2 32h Affymetrix ChIP-chip (H3K9K14ac2, retinoic acid-treated HL-60, 32hrs) Strict P-Value ENCODE Chromatin Immunoprecipitation encodeAffyChIpHl60PvalStrictH3K9K14DHr08 Affy H3K9ac2 8h Affymetrix ChIP-chip (H3K9K14ac2, retinoic acid-treated HL-60, 8hrs) Strict P-Value ENCODE Chromatin Immunoprecipitation encodeAffyChIpHl60PvalStrictH3K9K14DHr02 Affy H3K9ac2 2h Affymetrix ChIP-chip (H3K9K14ac2, retinoic acid-treated HL-60, 2hrs) Strict P-Value ENCODE Chromatin Immunoprecipitation encodeAffyChIpHl60PvalStrictH3K9K14DHr00 Affy H3K9ac2 0h Affymetrix ChIP-chip (H3K9K14ac2, retinoic acid-treated HL-60, 0hrs) Strict P-Value ENCODE Chromatin Immunoprecipitation encodeAffyChIpHl60SignalStrict Affy Strict Sig Affymetrix ChIP-chip (HL-60 and ME-180 cells) Strict Signal ENCODE Chromatin Immunoprecipitation Description This track shows regions that co-precipitate with antibodies against each of 4 factors in all ENCODE regions, in retinoic-acid stimulated HL-60 (leukemia) cells harvested after 0, 2, 8, and 32 hours, and in a fifth factor tested in ME-180 cervical carcinoma cells. Median of the signal estimate across processed replicate data is displayed as separate subtracks for each antibody: H4Kac4 (HisH4) - Histone H4 tetra-acetylated lysine H3K9K14ac2 (H3K9K14D) - Histone H3 K9 K14 Di-Acetylated Pol2 - RNA Polymerase II (8WG16 ab against pre-initiation complex form) p63_ActD - p63, in actinomycin-D treated ME-180 cells p63_mActD - p63 in untreated ME-180 cells Retinoic acid-stimulated HL-60 cells and ME-180 cells (actinomycin-D treated or untreated) were harvested and whole cell extracts (control) were made. An antibody was used to immunoprecipitate bound chromatin fragments (treatment). DNA was purified from these samples and hybridized to Affymetrix ENCODE oligonucleotide tiling arrays, which have 25-mer probes tiled every 22 bp on average in the non-repetitive ENCODE regions. Only the median of the signal estimate across processed replicate data is displayed; data for all biological replicates can be downloaded from Affymetrix in wiggle, cel, and soft formats. Display Conventions and Configuration The subtracks within this composite annotation track may be configured in a variety of ways to highlight different aspects of the displayed data. The graphical configuration options for the subtracks are shown at the top of the track description page, followed by a list of subtracks. For more information about the graphical configuration options, click the Graph configuration help link. Color differences among the subtracks are arbitrary. They provide a visual cue for finding the same antibody in different timepoint tracks. Methods The data from replicate arrays were quantile-normalized (Bolstad et al., 2003) and all arrays were scaled to a median array intensity of 22. Within a sliding 1001 bp window centered on each probe, a signal estimator S = ln[max(PM - MM, 1)] (where PM is perfect match and MM is mismatch) was computed for each biological replicate treatment- and all replicate control-probe pairs. An estimate of the significance of the enrichment of treatment signal for each replicate over control signal in each window was given by the P-value computed using the Wilcoxon Rank Sum test over each biological replicate treatment and all control signal estimates in that window. The median of the signal estimate across processed replicate data is displayed. Verification Using the P-values from the biological replicates, all pairwise rank correlation coefficients were computed among biological replicates. Data sets showing both consistent pairwise correlation coefficients and at least weak positive correlation across all pairs were considered reproducible. Credits These data were generated and analyzed by the Gingeras/Struhl collaboration with the Tom Gingeras group at Affymetrix and Kevin Struhl's group at Harvard Medical School. References Please see the Affymetrix Transcriptome site for a project overview and additional references to Affymetrix tiling array publications. Bolstad, B. M., Irizarry, R. A., Astrand, M., and Speed, T. P. A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics 19(2), 185-193 (2003). Cawley, S., Bekiranov, S., Ng, H. H., Kapranov, P., Sekinger, E. A., Kampa, D., Piccolboni, A., Sementchenko, V., Cheng, J., Williams, A. J., et al. Unbiased mapping of transcription factor binding sites along human chromosomes 21 and 22 points to widespread regulation of noncoding RNAs. Cell 116(4), 499-509 (2004). Yang A, Zhu Z, Kapranov P, McKeon F, Church GM, Gingeras TR, Struhl K. Relationships between p63 binding, DNA sequence, transcription activity, and biological function in human cells. Mol. Cell. 24(4), 593-602 (2006). encodeAffyChIpHl60SignalStrictp63_mActD Affy p63 ME-180 Affymetrix ChIP-chip (p63, ME-180) Strict Signal ENCODE Chromatin Immunoprecipitation encodeAffyChIpHl60SignalStrictp63_ActD Affy p63 ME-180+ Affymetrix ChIP-chip (p63, actinomycin-D treated ME-180) Strict Signal ENCODE Chromatin Immunoprecipitation encodeAffyChIpHl60SignalStrictPol2Hr32 Affy Pol2 32h Affymetrix ChIP-chip (Pol2, retinoic acid-treated HL-60, 32hrs) Strict Signal ENCODE Chromatin Immunoprecipitation encodeAffyChIpHl60SignalStrictPol2Hr08 Affy Pol2 8h Affymetrix ChIP-chip (Pol2, retinoic acid-treated HL-60, 8hrs) Strict Signal ENCODE Chromatin Immunoprecipitation encodeAffyChIpHl60SignalStrictPol2Hr02 Affy Pol2 2h Affymetrix ChIP-chip (Pol2, retinoic acid-treated HL-60, 2hrs) Strict Signal ENCODE Chromatin Immunoprecipitation encodeAffyChIpHl60SignalStrictPol2Hr00 Affy Pol2 0h Affymetrix ChIP-chip (Pol2, retinoic acid-treated HL-60, 0hrs) Strict Signal ENCODE Chromatin Immunoprecipitation encodeAffyChIpHl60SignalStrictHisH4Hr32 Affy H4Kac4 32h Affymetrix ChIP-chip (H4Kac4, retinoic acid-treated HL-60, 32hrs) Strict Signal ENCODE Chromatin Immunoprecipitation encodeAffyChIpHl60SignalStrictHisH4Hr08 Affy H4Kac4 8h Affymetrix ChIP-chip (H4Kac4, retinoic acid-treated HL-60, 8hrs) Strict Signal ENCODE Chromatin Immunoprecipitation encodeAffyChIpHl60SignalStrictHisH4Hr02 Affy H4Kac4 2h Affymetrix ChIP-chip (H4Kac4, retinoic acid-treated HL-60, 2hrs) Strict Signal ENCODE Chromatin Immunoprecipitation encodeAffyChIpHl60SignalStrictHisH4Hr00 Affy H4Kac4 0h Affymetrix ChIP-chip (H4Kac4, retinoic acid-treated HL-60, 0hrs) Strict Signal ENCODE Chromatin Immunoprecipitation encodeAffyChIpHl60SignalStrictH3K9K14DHr32 Affy H3K9ac2 32h Affymetrix ChIP-chip (H3K9K14ac2, retinoic acid-treated HL-60, 32hrs) Strict Signal ENCODE Chromatin Immunoprecipitation encodeAffyChIpHl60SignalStrictH3K9K14DHr08 Affy H3K9ac2 8h Affymetrix ChIP-chip (H3K9K14ac2, retinoic acid-treated HL-60, 8hrs) Strict Signal ENCODE Chromatin Immunoprecipitation encodeAffyChIpHl60SignalStrictH3K9K14DHr02 Affy H3K9ac2 2h Affymetrix ChIP-chip (H3K9K14ac2, retinoic acid-treated HL-60, 2hrs) Strict Signal ENCODE Chromatin Immunoprecipitation encodeAffyChIpHl60SignalStrictH3K9K14DHr00 Affy H3K9ac2 0h Affymetrix ChIP-chip (H3K9K14ac2, retinoic acid-treated HL-60, 0hrs) Strict Signal ENCODE Chromatin Immunoprecipitation encodeAffyChIpHl60SitesStrict Affy Strict Sites Affymetrix ChIP-chip (HL-60 and ME-180 cells) Strict Sites ENCODE Chromatin Immunoprecipitation Description This track shows regions that co-precipitate with antibodies against each of 4 factors in all ENCODE regions, in retinoic-acid stimulated HL-60 (leukemia) cells harvested after 0, 2, 8, and 32 hours, and in a fifth factor tested in ME-180 cervical carcinoma cells. Clustered sites are shown in separate subtracks for each antibody: H4Kac4 (HisH4) - Histone H4 tetra-acetylated lysine H3K9K14ac2 (H3K9K14D) - Histone H3 K9 K14 Di-Acetylated Pol2 - RNA Polymerase II (8WG16 ab against pre-initiation complex form) p63_ActD - p63, in actinomycin-D treated ME-180 cells p63_mActD - p63 in untreated ME-180 cells Retinoic acid-stimulated HL-60 cells and ME-180 cells (actinomycin-D treated or untreated) were harvested and whole cell extracts (control) were made. An antibody was used to immunoprecipitate bound chromatin fragments (treatment). DNA was purified from these samples and hybridized to Affymetrix ENCODE oligonucleotide tiling arrays, which have 25-mer probes tiled every 22 bp on average in the non-repetitive ENCODE regions. Data for all biological replicates can be downloaded from Affymetrix in wiggle, cel, and soft formats. Display Conventions and Configuration The subtracks within this composite annotation track may be configured in a variety of ways to highlight different aspects of the displayed data. The graphical configuration options for the subtracks are shown at the top of the track description page, followed by a list of subtracks. For more information about the graphical configuration options, click the Graph configuration help link. Color differences among the subtracks are arbitrary. They provide a visual cue for finding the same antibody in different timepoint tracks. Methods Three independent biological replicates were generated and hybridized to duplicate arrays (two technical replicates). Reproducible enriched regions were generated from the signal, by first applying a cutoff of 0.693(ln(2)=0.693) to the signal estimate, a maxgap and minrun of 500 and 0 basepairs respectively, to each biological replicate. Since each region or site can comprise of more than a single probe, a median based on the distribution of log transformed P-values was computed per site for each of the respective replicates. These seed sites were then ranked individually within each of the replicates. If a site was absent in a replicate the maximum or worst rank of the distribution was assigned to it. The following three values were computed for each site by combining data from all biological replicates: average of all ranks computed among biological replicates sum of all pairwise differences in these ranks computed among biological replicates a combined P-value, using a chi square distribution, across all replicates A final signal estimate based filter was applied, where sites with median signal estimate of at least 0.693/(total number of individual replcates) were considered. This was to ensure that if a site was not detected consistently in all replicates but was detected at a significant signal level in a subset of the replicates its detection level would be weighted accordingly in the final selection of sites. The final sites were selected when all of the above three metrics were relatively low, where "low" corresponds to the top 25 percentile of the distribution. Verification Using the P-values from the biological replicates, all pairwise rank correlation coefficients were computed among biological replicates. Data sets showing both consistent pairwise correlation coefficients and at least weak positive correlation across all pairs were considered reproducible. Credits These data were generated and analyzed by the Gingeras/Struhl collaboration with the Tom Gingeras group at Affymetrix and Kevin Struhl's group at Harvard Medical School. References Please see the Affymetrix Transcriptome site for a project overview and additional references to Affymetrix tiling array publications. Bolstad, B. M., Irizarry, R. A., Astrand, M., and Speed, T. P. A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics 19(2), 185-193 (2003). Cawley, S., Bekiranov, S., Ng, H. H., Kapranov, P., Sekinger, E. A., Kampa, D., Piccolboni, A., Sementchenko, V., Cheng, J., Williams, A. J., et al. Unbiased mapping of transcription factor binding sites along human chromosomes 21 and 22 points to widespread regulation of noncoding RNAs. Cell 116(4), 499-509 (2004). Yang A, Zhu Z, Kapranov P, McKeon F, Church GM, Gingeras TR, Struhl K. Relationships between p63 binding, DNA sequence, transcription activity, and biological function in human cells. Mol. Cell. 24(4), 593-602 (2006). encodeAffyChIpHl60SitesStrictP63_mActD Affy p63 ME-180 Affymetrix ChIP-chip (p63, ME-180) Strict Sites ENCODE Chromatin Immunoprecipitation encodeAffyChIpHl60SitesStrictP63_ActD Affy p63 ME-180+ Affymetrix ChIP-chip (p63, actinomycin-D treated ME-180) Strict Sites ENCODE Chromatin Immunoprecipitation encodeAffyChIpHl60SitesStrictRnapHr32 Affy Pol2 32h Affymetrix ChIP-chip (Pol2, retinoic acid-treated HL-60, 32hrs) Strict Sites ENCODE Chromatin Immunoprecipitation encodeAffyChIpHl60SitesStrictRnapHr08 Affy Pol2 8h Affymetrix ChIP-chip (Pol2, retinoic acid-treated HL-60, 8hrs) Strict Sites ENCODE Chromatin Immunoprecipitation encodeAffyChIpHl60SitesStrictRnapHr02 Affy Pol2 2h Affymetrix ChIP-chip (Pol2, retinoic acid-treated HL-60, 2hrs) Strict Sites ENCODE Chromatin Immunoprecipitation encodeAffyChIpHl60SitesStrictRnapHr00 Affy Pol2 0h Affymetrix ChIP-chip (Pol2, retinoic acid-treated HL-60, 0hrs) Strict Sites ENCODE Chromatin Immunoprecipitation encodeAffyChIpHl60SitesStrictHisH4Hr32 Affy H4Kac4 32h Affymetrix ChIP-chip (H4Kac4, retinoic acid-treated HL-60, 32hrs) Strict Sites ENCODE Chromatin Immunoprecipitation encodeAffyChIpHl60SitesStrictHisH4Hr08 Affy H4Kac4 8h Affymetrix ChIP-chip (H4Kac4, retinoic acid-treated HL-60, 8hrs) Strict Sites ENCODE Chromatin Immunoprecipitation encodeAffyChIpHl60SitesStrictHisH4Hr02 Affy H4Kac4 2h Affymetrix ChIP-chip (H4Kac4, retinoic acid-treated HL-60, 2hrs) Strict Sites ENCODE Chromatin Immunoprecipitation encodeAffyChIpHl60SitesStrictHisH4Hr00 Affy H4Kac4 0h Affymetrix ChIP-chip (H4Kac4, retinoic acid-treated HL-60, 0hrs) Strict Sites ENCODE Chromatin Immunoprecipitation encodeAffyChIpHl60SitesStrictH3K9K14DHr32 Affy H3K9ac2 32h Affymetrix ChIP-chip (H3K9K14ac2, retinoic acid-treated HL-60, 32hrs) Strict Sites ENCODE Chromatin Immunoprecipitation encodeAffyChIpHl60SitesStrictH3K9K14DHr08 Affy H3K9ac2 8h Affymetrix ChIP-chip (H3K9K14ac2, retinoic acid-treated HL-60, 8hrs) Strict Sites ENCODE Chromatin Immunoprecipitation encodeAffyChIpHl60SitesStrictH3K9K14DHr02 Affy H3K9ac2 2h Affymetrix ChIP-chip (H3K9K14ac2, retinoic acid-treated HL-60, 2hrs) Strict Sites ENCODE Chromatin Immunoprecipitation encodeAffyChIpHl60SitesStrictH3K9K14DHr00 Affy H3K9ac2 0h Affymetrix ChIP-chip (H3K9K14ac2, retinoic acid-treated HL-60, 0hrs) Strict Sites ENCODE Chromatin Immunoprecipitation encodeLIChIP LI ChIP Various Ludwig Institute/UCSD ChIP-chip: Pol2 8WG16, TAF1, H3ac, H3K4me2, H3K27me3 antibodies ENCODE Chromatin Immunoprecipitation Description ENCODE region-wide location analyses were conducted of binding to the initiation-complex form of RNA polymerase II (Pol2), TATA-associated factor (TAF1), acetylated histone H3 (H3ac), lysine-4-dimethylated H3 (H3K4me2), suppressor of zeste 12 protein homolog (SUZ12), and lysine-27-tri-methylated H3 (H3K27me3). The analyses used chromatin extracted from IMR90 (lung fibroblast), HCT116 (colon epithelial carcinoma), HeLa (cervix epithelial adenocarcinoma), and THP1 (blood monocyte leukemia) cells. The initiation-complex form of Pol2 is associated with the transcription start site, as is TAF1. Both H3ac and H3K4me2 are associated with transcriptionally-active "open" chromatin. Display Conventions and Configuration This annotation follows the display conventions for composite tracks. Data for each antibody/cell line pair is displayed in a separate subtrack. See the top of the track description page for a complete list of the subtracks available for this annotation. The subtracks may be configured in a variety of ways to highlight different aspects of the displayed data. The graphical configuration options are shown at the top of the track description page, followed by the list of subtracks. To display only selected subtracks, uncheck the boxes next to the tracks you wish to hide. For more information about the graphical configuration options, click the Graph configuration help link. Methods Chromatin from each of the four cell lines was separately cross-linked, precipitated with antibody to one of the six proteins, sheared, amplified and hybridized to a PCR DNA tiling array produced at the Ren Lab at UC San Diego. The array was composed of 24,537 non-repetitive sequences within the 44 ENCODE regions. For each marker, there were three biological replicates. Each experiment was normalized using the median values. The P-value and R-value were calculated using the modified single array error model (Li, Z. et al., 2003). The P-value and R-value were then derived from the weighted average results of the replicates. The displayed values were scaled to 0 - 16, corresponding to negative log base 10 of the P-value. Verification Each of the experiments has three biological replicates. The array platform, the raw and normalized data for each experiment, and the image files have all been deposited at the NCBI GEO Microarray Database. Credits The data for this track were generated at the Ren Lab, Ludwig Institute for Cancer Research at UC San Diego. References Kim, T., Barrera, L.O., Qu, C., van Calcar, S., Trinklein, N., Cooper, S., Luna, R., Glass, C.K., Rosenfeld, M.G., Myers, R., Ren, B. Direct isolation and identification of promoters in the human genome. Genome Research 15,830-839 (2005). Li, Z., Van Calcar, S., Qu, C., Cavenee, W.K., Zhang, M.Z., and Ren, B. A global transcriptional regulatory role for c-Myc in Burkitt's lymphoma cells. Proc. Natl. Acad. Sci. 100(14), 8164-8169 (2003). Ren, B., Robert, F., Wyrick, J. W., Aparicio, O., Jennings, E. G., Simon, I., Zeitlinger, J., Schreiber, J., Hannett, N., Kanin, E., Volkert , T. L., Wilson, C., Bell, S. P. and Young, R. A. Genome-wide location and function of DNA-associated proteins Science 290(5500), 2306-2309 (2000). encodeUcsdChipSuper LI/UCSD ChIP Ludwig Institute/UC San Diego ChIP-chip ENCODE Chromatin Immunoprecipitation Overview This super-track combines related tracks of ChIP-chip data generated by the Ludwig Institute/UCSD ENCODE group. ChIP-chip, also known as genome-wide location analysis, is a technique for isolation and identification of DNA sequences bound by specific proteins in cells, including histones. Histone methylation and acetylation serves as a stable genomic imprint that regulates gene expression and other epigenetic phenomena. These histones are found in transcriptionally active domains called euchromatin. These tracks contain ChIP-chip data for transcription initiation complex (such as Pol2 and TAF1) and H3, H4 histones in multiple cell lines, including HeLa (cervical carcinoma), IMR90 (human fibroblast), and HCT116 (colon epithelial carcinoma), with some experiments including interferon-gamma induction. Credits The data for this track were generated at the Ren Lab, Ludwig Institute for Cancer Research at UC San Diego. References Kim TH, Barrera LO, Qu C, Van Calcar S, Trinklein ND, Cooper SJ, Luna RM, Glass CK, Rosenfeld MG, Myers RM, Ren B. Direct isolation and identification of promoters in the human genome. Genome Res. 2005 Jun;15(6):830-9. Li Z, Van Calcar S, Qu C, Cavenee WK, Zhang MQ, Ren B. A global transcriptional regulatory role for c-Myc in Burkitt's lymphoma cells. Proc Natl Acad Sci U S A. 2003 Jul 8;100(14):8164-9. Ren B, Robert F, Wyrick JJ, Aparicio O, Jennings EG, Simon I, Zeitlinger J, Schreiber J, Hannett N, Kanin E et al. Genome-wide location and function of DNA-associated proteins. Science. 2000 Dec 22;290(5500):2306-9. Kim TH, Barrera LO, Zheng M, Qu C, Singer MA, Richmond TA, Wu Y, Green RD, Ren B. A high-resolution map of active promoters in the human genome. Nature. 2005 Aug 11;436(7052):876-80. encodeUcsdChipH3K27me3 LI H3K27me3 HeLa Ludwig Institute ChIP-chip: H3K27me3 ab, HeLa cells ENCODE Chromatin Immunoprecipitation encodeUcsdChipH3K27me3Suz12 LI SUZ12 HeLa Ludwig Institute ChIP-chip: SUZ12 protein ab, HeLa cells ENCODE Chromatin Immunoprecipitation encodeUcsdChipMeh3k4Imr90_f LI H3K4me2 IMR90 Ludwig Institute ChIP-chip: H3K4me2 ab, IMR90 cells ENCODE Chromatin Immunoprecipitation encodeUcsdChipAch3Imr90_f LI H3ac IMR90 Ludwig Institute ChIP-chip: H3ac ab, IMR90 cells ENCODE Chromatin Immunoprecipitation encodeUcsdChipTaf250Hct116_f LI TAF1 HCT116 Ludwig Institute ChIP-chip: TAF1 ab, HCT116 cells ENCODE Chromatin Immunoprecipitation encodeUcsdChipTaf250Imr90_f LI TAF1 IMR90 Ludwig Institute ChIP-chip: TAF1 ab, IMR90 cells ENCODE Chromatin Immunoprecipitation encodeUcsdChipTaf250Thp1_f LI TAF1 THP1 Ludwig Institute ChIP-chip: TAF1 ab, THP1 cells ENCODE Chromatin Immunoprecipitation encodeUcsdChipTaf250Hela_f LI TAF1 HeLa Ludwig Institute ChIP-chip: TAF1 ab, HeLa cells ENCODE Chromatin Immunoprecipitation encodeUcsdChipRnapHct116_f LI Pol2 HCT116 Ludwig Institute ChIP-chip: Pol2 8WG16 ab, HCT116 cells ENCODE Chromatin Immunoprecipitation encodeUcsdChipRnapImr90_f LI Pol2 IMR90 Ludwig Institute ChIP-chip: Pol2 8WG16 ab, IMR90 cells ENCODE Chromatin Immunoprecipitation encodeUcsdChipRnapThp1_f LI Pol2 THP1 Ludwig Institute ChIP-chip: Pol2 8WG16 ab, THP1 cells ENCODE Chromatin Immunoprecipitation encodeUcsdChipRnapHela_f LI Pol2 HeLa Ludwig Institute ChIP-chip: Pol2 8WG16 ab, HeLa cells ENCODE Chromatin Immunoprecipitation encodeLIChIPgIF LI gIF ChIP Ludwig Institute/UCSD ChIP-chip - Gamma Interferon Experiments ENCODE Chromatin Immunoprecipitation Description ENCODE region-wide location analysis of histones H3 and H4 with antibodies H3K4me2, H3K4me3, H3ac, H4ac, STAT1, RNA polymerase II and TAF1 was conducted with ChIP-chip, using chromatin extracted from HeLa cells induced for 30 min with interferon-gamma as well as uninduced cells. The H3K4me2, H3K4me3, H3ac form of histone H3, and H4ac form of histone H4 are associated with up-regulation of gene expression. STAT1 (signal transducer and activator of transcription) binds to DNA and activates transcription in response to various cytokines, including interferon-gamma. Display Conventions and Configuration This annotation follows the display conventions for composite "wiggle" tracks. The subtracks within this annotation may be configured in a variety of ways to highlight different aspects of the displayed data. The graphical configuration options are shown at the top of the track description page, followed by a list of subtracks. To display only selected subtracks, uncheck the boxes next to the tracks you wish to hide. For more information about the graphical configuration options, click the Graph configuration help link. Methods Chromatin from both induced and uninduced cells was separately cross-linked, precipitated with the antibodies, sheared, amplified and hybridized to a PCR DNA tiling array produced at the Ren Lab at UC San Diego. The array was composed of 24,537 non-repetitive sequences within the 44 ENCODE regions. Each state had three or more biological replicates. Each experiment was loess-normalized using R. The P-value and R-value were calculated using the modified single array error model (Li, Z. et al., 2003). The P-value and R-value were then derived from the weighted average results of the replicates. The displayed values were scaled to 0 - 16, corresponding to negative log base 10 of the P-value. Verification Each of the two experiments has three biological replicates. The array platform, the raw and normalized data for each experiment, and the image files have all been deposited at the NCBI GEO Microarray Database (pending approval). Credits The data for this track were generated at the Ren Lab, Ludwig Institute for Cancer Research at UC San Diego. References Kim, T., Barrera, L.O., Qu, C., van Calcar, S., Trinklein, N., Cooper, S., Luna, R., Glass, C.K., Rosenfeld, M.G., Myers, R., Ren, B. Direct isolation and identification of promoters in the human genome. Genome Research 15,830-839 (2005). Li, Z., Van Calcar, S., Qu, C., Cavenee, W.K., Zhang, M.Z., and Ren, B. A global transcriptional regulatory role for c-Myc in Burkitt's lymphoma cells. Proc. Natl. Acad. Sci. 100(14), 8164-8169 (2003). Ren, B., Robert, F., Wyrick, J. W., Aparicio, O., Jennings, E. G., Simon, I., Zeitlinger, J., Schreiber, J., Hannett, N., Kanin, E., Volkert , T. L., Wilson, C., Bell, S. P. and Young, R. A. Genome-wide location and function of DNA-associated proteins Science 290(5500), 2306-2309 (2000). encodeUcsdChipHeLaH3H4TAF250_p30 LI TAF1 +gIF Ludwig Institute ChIP-chip: TAF1, HeLa cells, 30 min. after gamma interferon ENCODE Chromatin Immunoprecipitation encodeUcsdChipHeLaH3H4TAF250_p0 LI TAF1 -gIF Ludwig Institute ChIP-chip: TAF1, HeLa cells, no gamma interferon ENCODE Chromatin Immunoprecipitation encodeUcsdChipHeLaH3H4RNAP_p30 LI Pol2 +gIF Ludwig Institute ChIP-chip: RNA Pol2, HeLa cells, 30 min. after gamma interferon ENCODE Chromatin Immunoprecipitation encodeUcsdChipHeLaH3H4RNAP_p0 LI Pol2 -gIF Ludwig Institute ChIP-chip: RNA Pol2, HeLa cells, no gamma interferon ENCODE Chromatin Immunoprecipitation encodeUcsdChipHeLaH3H4stat1_p30 LI STAT1 +gIF Ludwig Institute ChIP-chip: STAT1 ab, HeLa cells, 30 min. after gamma interferon ENCODE Chromatin Immunoprecipitation encodeUcsdChipHeLaH3H4stat1_p0 LI STAT1 -gIF Ludwig Institute ChIP-chip: STAT1 ab, HeLa cells, no gamma interferon ENCODE Chromatin Immunoprecipitation encodeUcsdChipHeLaH3H4acH4_p30 LI H4ac +gIF Ludwig Institute ChIP-chip: H4ac ab, HeLa cells, 30 min. after gamma interferon ENCODE Chromatin Immunoprecipitation encodeUcsdChipHeLaH3H4acH4_p0 LI H4ac -gIF Ludwig Institute ChIP-chip: H4ac ab, HeLa cells, no gamma interferon ENCODE Chromatin Immunoprecipitation encodeUcsdChipHeLaH3H4acH3_p30 LI H3ac +gIF Ludwig Institute ChIP-chip: H3ac ab, HeLa cells, 30 min. after gamma interferon ENCODE Chromatin Immunoprecipitation encodeUcsdChipHeLaH3H4acH3_p0 LI H3ac -gIF Ludwig Institute ChIP-chip: H3ac ab, HeLa cells, no gamma interferon ENCODE Chromatin Immunoprecipitation encodeUcsdChipHeLaH3H4tmH3K4_p30 LI H3K4me3 +gIF Ludwig Institute ChIP-chip: H3K4me3 ab, HeLa cells, 30 min. after gamma interferon ENCODE Chromatin Immunoprecipitation encodeUcsdChipHeLaH3H4tmH3K4_p0 LI H3K4me3 -gIF Ludwig Institute ChIP-chip: H3K4me3 ab, HeLa cells, no gamma interferon ENCODE Chromatin Immunoprecipitation encodeUcsdChipHeLaH3H4dmH3K4_p30 LI H3K4me2 +gIF Ludwig Institute ChIP-chip: H3K4me2 ab, HeLa cells, 30 min. after gamma interferon ENCODE Chromatin Immunoprecipitation encodeUcsdChipHeLaH3H4dmH3K4_p0 LI H3K4me2 -gIF Ludwig Institute ChIP-chip: H3K4me2 ab, HeLa cells, no gamma interferon ENCODE Chromatin Immunoprecipitation encodeStanfordChip Stanf ChIP Stanford ChIP-chip (HCT116, Jurkat, K562 cells; Sp1, Sp3 ChIP) ENCODE Chromatin Immunoprecipitation Description This track displays regions bound by Sp1 and Sp3, in the following three cell lines, assayed by ChIP and microarray hybridization: Cell LineClassificationIsolated From HCT 116colorectal carcinomacolon Jurkat, Clone E6-1acute T cell leukemiaT lymphocyte K-562chronic myelogenous leukemia (CML)bone marrow Display Conventions and Configuration This annotation follows the display conventions for composite tracks. The subtracks within this annotation may be configured in a variety of ways to highlight different aspects of the displayed data. The graphical configuration options are shown at the top of the track description page, followed by a list of subtracks. To display only selected subtracks, uncheck the boxes next to the tracks you wish to hide. For more information about the graphical configuration options, click the Graph configuration help link. Methods Chromatin IP was performed as described in Trinklein et al. (2004). Amplified and labeled ChIP DNA was hybridized to oligo tiling arrays produced by NimbleGen, along with a total genomic reference sample. The data for each array were median subtracted (log 2 ratios) and normalized (divided by the standard deviation). The value given for each probe is the transformed mean ratio of ChIP DNA:Total DNA. Verification Three biological replicates and two technical replicates were performed. The Myers lab is currently testing the specificity and sensitivity using real-time PCR. Credits These data were generated in the Richard M. Myers lab at Stanford University (now at HudsonAlpha Institute for Biotechnology). References Trinklein, N.D., Chen, W.C., Kingston, R.E. and Myers, R.M. The role of heat shock transcription factor 1 in the genome-wide regulation of the mammalian heat shock response. Mol. Biol. Cell 15(3), 1254-61 (2004). encodeStanfordChipSuper Stanf ChIP Stanford ChIP-chip ENCODE Chromatin Immunoprecipitation Overview This super-track combines related tracks of ChIP-chip data generated by the Stanford ENCODE group. ChIP-chip, also known as genome-wide location analysis, is a technique for isolation and identification of DNA sequences bound by specific proteins in cells. These tracks contain data for the Sp1 and Sp3 transcription factors in multiple cell lines, including HCT116 (colon epithelial carcinoma), Jurkat (T-cell lymphoblast), and K562 (myeloid leukemia). Credits The Sp1 and Sp3 data were generated in the Richard M. Myers lab at Stanford University (now at HudsonAlpha Institute for Biotechnology). References Mikkelsen TS, Ku M, Jaffe DB, Issac B, Lieberman E, Giannoukos G, Alvarez P, Brockman W, Kim TK, Koche RP et al. Genome-wide maps of chromatin state in pluripotent and lineage-committed cells. Nature. 2007 Aug 2;448, 553-60. Trinklein ND, Murray JI, Hartman SJ, Botstein D, Myers RM. The role of heat shock transcription factor 1 in the genome-wide regulation of the mammalian heat shock response. Mol. Biol. Cell. 2004 Mar;15(3):1254-61. encodeStanfordChipK562Sp3 Stan K562 Sp3 Stanford ChIP-chip (K562 cells, Sp3 ChIP) ENCODE Chromatin Immunoprecipitation encodeStanfordChipK562Sp1 Stan K562 Sp1 Stanford ChIP-chip (K562 cells, Sp1 ChIP) ENCODE Chromatin Immunoprecipitation encodeStanfordChipJurkatSp3 Stan Jurkat Sp3 Stanford ChIP-chip (Jurkat cells, Sp3 ChIP) ENCODE Chromatin Immunoprecipitation encodeStanfordChipJurkatSp1 Stan Jurkat Sp1 Stanford ChIP-chip (Jurkat cells, Sp1 ChIP) ENCODE Chromatin Immunoprecipitation encodeStanfordChipHCT116Sp3 Stan HCT116 Sp3 Stanford ChIP-chip (HCT116 cells, Sp3 ChIP) ENCODE Chromatin Immunoprecipitation encodeStanfordChipHCT116Sp1 Stan HCT116 Sp1 Stanford ChIP-chip (HCT116 cells, Sp1 ChIP) ENCODE Chromatin Immunoprecipitation encodeStanfordChipSmoothed Stanf ChIP Score Stanford ChIP-chip Smoothed Score ENCODE Chromatin Immunoprecipitation Description This track displays smoothed (sliding-window mean) scores for regions bound by Sp1 and Sp3 in the following three cell lines, assayed by ChIP and microarray hybridization: Cell LineClassificationIsolated From HCT 116colorectal carcinomacolon Jurkat, Clone E6-1acute T cell leukemiaT lymphocyte K-562chronic myelogenous leukemia (CML)bone marrow Display Conventions and Configuration This annotation follows the display conventions for composite tracks. The subtracks within this annotation may be configured in a variety of ways to highlight different aspects of the displayed data. The graphical configuration options are shown at the top of the track description page, followed by a list of subtracks. To display only selected subtracks, uncheck the boxes next to the tracks you wish to hide. For more information about the graphical configuration options, click the Graph configuration help link. Methods Chromatin IP was performed as described in Trinklein et al. (2004). Amplified and labeled ChIP DNA was hybridized to oligo tiling arrays produced by NimbleGen along with a total genomic reference sample. The data for each array were median subtracted (log 2 ratios) and normalized (divided by the standard deviation). The transformed mean ratios of ChIP DNA:Total DNA for all probes were then smoothed by calculating a sliding-window mean. Windows of six neighboring probes (sliding two probes at a time) were used; within each window, the highest and lowest value were dropped, and the remaining 4 values were averaged. To increase the contrast between high and low values for visual display, the average was converted to a score by the formula: score = 8^(average) * 10. These scores are for visualization purposes; for all analyses, the raw ratios, which are available in the Stanf ChIP track, should be used. Verification Three biological replicates and two technical replicates were performed. The Myers lab is currently testing the specificity and sensitivity using real-time PCR. Credits These data were generated in the Richard M. Myers lab at Stanford University (now at HudsonAlpha Institute for Biotechnology). References Trinklein, N.D., Chen, W.C., Kingston, R.E. and Myers, R.M. The role of heat shock transcription factor 1 in the genome-wide regulation of the mammalian heat shock response. Mol. Biol. Cell 15(3), 1254-61 (2004). encodeStanfordChipSmoothedK562Sp3 Stan Sc K562 Sp3 Stanford ChIP-chip Smoothed Score (K562 cells, Sp3 ChIP) ENCODE Chromatin Immunoprecipitation encodeStanfordChipSmoothedK562Sp1 Stan Sc K562 Sp1 Stanford ChIP-chip Smoothed Score (K562 cells, Sp1 ChIP) ENCODE Chromatin Immunoprecipitation encodeStanfordChipSmoothedJurkatSp3 Stan Sc Jurkat Sp3 Stanford ChIP-chip Smoothed Score (Jurkat cells, Sp3 ChIP) ENCODE Chromatin Immunoprecipitation encodeStanfordChipSmoothedJurkatSp1 Stan Sc Jurkat Sp1 Stanford ChIP-chip Smoothed Score (Jurkat cells, Sp1 ChIP) ENCODE Chromatin Immunoprecipitation encodeStanfordChipSmoothedHCT116Sp3 Stan Sc HCT116 Sp3 Stanford ChIP-chip Smoothed Score (HCT116 cells, Sp3 ChIP) ENCODE Chromatin Immunoprecipitation encodeStanfordChipSmoothedHCT116Sp1 Stan Sc HCT116 Sp1 Stanford ChIP-chip Smoothed Score (HCT116 cells, Sp1 ChIP) ENCODE Chromatin Immunoprecipitation encodeUtexChip UT-Austin ChIP University of Texas, Austin ChIP-chip ENCODE Chromatin Immunoprecipitation Description ChIP-chip analysis of c-Myc and E2F4 was performed using 2091 foreskin fibroblasts and HeLa cells. ChIP was carried out from normally-growing HeLa cells and from 2091 quiescent (0.1% serum FBS), as well as serum-stimulated (10% FBS, 4hrs), fibroblasts. Microarray hybridizations were performed using NimbleGen ENCODE arrays and protocols. Display Conventions and Configuration This annotation follows the display conventions for composite tracks. The subtracks within this annotation may be configured in a variety of ways to highlight different aspects of the displayed data. The graphical configuration options are shown at the top of the track description page, followed by a list of subtracks. To display only selected subtracks, uncheck the boxes next to the tracks you wish to hide. For more information about the graphical configuration options, click the Graph configuration help link. Methods Chromatin from each cell line under a given condition was cross-linked with 1% formaldehyde, sheared, precipitated with antibody, and reverse cross-linked to obtain enriched DNA fragments. ChIP material was amplified and hybridized to a NimbleGen ENCODE region array. The raw and processed files reflect fold enrichment over the mock ChIP sample, which was used as a reference in the hybridization. Verification Each of the four experiments has three independent biological replicates. Data from all three replicates were averaged to generate a single data file. The NimbleGen method for hit identification was used to generate the peaks at a false positive rate of <= 0.05. Credits These data were contributed by Jonghwan Kim, Akshay Bhinge, and Vishy Iyer from the Iyer lab at the University of Texas at Austin, in collaboration with Mike Singer, Nan Jiang, and Roland Green of NimbleGen Systems, Inc. Reference Kim, J., Bhinge, A., Morgan, X.C. and Iyer, V.R. Mapping DNA-protein interactions in large genomes by sequence tag analysis of genomic enrichment. Nature Methods 2, 47-53 (2005). encodeUtexChipSuper UT-Austin ChIP University of Texas, Austin ChIP-chip and STAGE ENCODE Chromatin Immunoprecipitation Overview This super-track combines related tracks of ChIP data generated by the Iyer laboratory at The University of Texas at Austin. Two technologies are presented in this super-track: ChIP-chip and ChIP-STAGE. ChIP-chip, also known as genome-wide location analysis, is a technique for isolation and identification of DNA sequences bound by specific proteins in cells. Instead of detecting bound fragments by microarray, ChIP-STAGE uses Sequence Tag Analysis of Genomic Enrichment, or STAGE, technology by cloning STAGE tags, sequencing and mapping to the human genome. These tracks contain ChIP data for several transcription factors, including c-Myc, E2F4 and STAT1, in cell lines including 2091 (foreskin fibroblast) and HeLa (cervical carcinoma). Credits ChIP-chip data were contributed by Jonghwan Kim, Akshay Bhinge, and Vishy Iyer from the Iyer lab at The University of Texas at Austin, in collaboration with Mike Singer, Nan Jiang, and Roland Green of NimbleGen Systems, Inc. ChIP-STAGE data were contributed by Jonghwan Kim, Akshay Bhinge, and Vishy Iyer from the Iyer lab, and by Ghia Euskirchen and Michael Snyder of the Snyder lab at Yale University. References Bhinge AA, Kim J, Euskirchen G, Snyder M, Iyer VR. Mapping the chromosomal targets of STAT1 by Sequence Tag Analysis of Genomic Enrichment (STAGE). Genome Res. 2007 Jun;17(6):910-6. Kim J, Bhinge A, Morgan XC, Iyer VR. Mapping DNA-protein interactions in large genomes by sequence tag analysis of genomic enrichment. Nat Methods. 2005 Jan;2(1):47-53. encodeUtexChip2091fibE2F4Peaks UT E2F4 st-Fb Pk University of Texas, Austin ChIP-chip (E2F4, 2091 fibroblasts) Peaks ENCODE Chromatin Immunoprecipitation encodeUtexChip2091fibMycStimPeaks UT Myc st-Fb Pk University of Texas, Austin ChIP-chip (c-Myc, FBS-stimulated 2091 fibroblasts) Peaks ENCODE Chromatin Immunoprecipitation encodeUtexChip2091fibMycPeaks UT Myc Fb Pk University of Texas, Austin ChIP-chip (c-Myc, 2091 fibroblasts) Peaks ENCODE Chromatin Immunoprecipitation encodeUtexChipHeLaMycPeaks UT Myc HeLa Pk University of Texas, Austin ChIP-chip (c-Myc, HeLa) Peaks ENCODE Chromatin Immunoprecipitation encodeUtexChip2091fibE2F4Raw UT E2F4 Fb University of Texas, Austin ChIP-chip (E2F4, 2091 fibroblasts) ENCODE Chromatin Immunoprecipitation encodeUtexChip2091fibMycStimRaw UT Myc st-Fb University of Texas, Austin ChIP-chip (c-Myc, FBS-stimulated 2091 fibroblasts) ENCODE Chromatin Immunoprecipitation encodeUtexChip2091fibMycRaw UT Myc Fb University of Texas, Austin ChIP-chip (c-Myc, 2091 fibroblasts) ENCODE Chromatin Immunoprecipitation encodeUtexChipHeLaMycRaw UT Myc HeLa University of Texas, Austin ChIP-chip (c-Myc, HeLa) ENCODE Chromatin Immunoprecipitation encodeUtexStage UT-Austin STAGE University of Texas, Austin STAGE (Sequence Tag Analysis of Genomic Enrichment) ENCODE Chromatin Immunoprecipitation Description This track shows putative binding loci of c-Myc and STAT1 as determined by Sequence Tag Analysis of Genomic Enrichment (STAGE). The c-Myc (cellular myelocytomatosis) protein is a transcription factor associated with cell proliferation, differentiation, and neoplastic disease. STAT1 is a signal transducer and transcription factor that binds to IFN-gamma activating sequence. STAGE was performed in HeLa cells under normal growth conditions (10% Fetal Bovine Serum) with anti-Myc, or in IFN-gamma stimulated cells with anti-STAT1 antibody. Cloned STAGE tags were sequenced and mapped to the human genome as described in Kim et al. (2005), referenced below. The Tags subtrack shows all STAGE tags within the ENCODE region and thus represents the raw data. The Peaks subtrack shows high confidence c-Myc binding regions derived from the STAGE tags. Display Conventions and Configuration This annotation follows the display conventions for composite tracks. To display only one of the subtracks, uncheck the boxes next to the track you wish to hide. Methods Each tag was assigned a probability of enrichment calculated from the frequency of occurrence of the tag in the STAGE sequencing pool and the number of times the tag is present in the genome, assuming a binomial distribution. Generally, tags that have a low frequency of occurrence in the sequencing pool and a high genomic frequency were assigned low probabilities of enrichment. Peaks were determined by using a 500 bp window to scan across each chromosome. Each window was assigned a probability based on the tags mapped within that window as described in Bhinge et al. referenced below. Verification For c-Myc, scores generated from the real data were compared to simulations where similar numbers of tags were randomly sampled from the genome. Calculating probabilities as above, a probability cut-off of 0.8 gave a false positive rate of less than 0.05. For STAT1, scores generated from the real data were compared to simulations where similar numbers of tags were randomly sampled from the genome. Calculating probabilities as described, a probability cut-off of 0.95 gave a false positive rate of less than 0.01. Additionally, 10 STAGE-detected STAT1 binding sites were assayed by qPCR analysis and 9 out 10 were confirmed as true positives, so the false positive rate is estimated at 10%. Credits These data were contributed by Jonghwan Kim, Akshay Bhinge, and Vishy Iyer from the Iyer lab at the University of Texas at Austin, and by Ghia Euskirchen and Michael Snyder of the Snyder lab at Yale University Reference Kim, J., Bhinge, A., Morgan, X.C. and Iyer, V.R. Mapping DNA-protein interactions in large genomes by sequence tag analysis of genomic enrichment. Nature Methods 2, 47-53 (2005). Bhinge A. et al. Mapping the chromosomal targets of STAT1 by Sequence Tag Analysis of Genomic Enrichment (STAGE). Genome Research (accepted). encodeUtexStageMycHelaPeaks UT Myc HeLa Pk University of Texas, Austin STAGE (c-Myc, HeLa) Peaks ENCODE Chromatin Immunoprecipitation encodeUtexStageCMycHelaTags UT Myc HeLa Tags University of Texas, Austin STAGE (c-Myc, HeLa) Tags ENCODE Chromatin Immunoprecipitation encodeUtexStageStat1HelaTags UT STAT1 HeLa Tags University of Texas, Austin STAGE (STAT1, HeLa) Tags ENCODE Chromatin Immunoprecipitation encodeUvaDnaRep UVa DNA Rep University of Virginia Temporal Profiling of DNA Replication ENCODE Chromosome, Chromatin and DNA Structure Description The five subtracks in this annotation correspond to five different time points relative to the start of the DNA synthesis phase (S-phase) of the cell cycle. Display Conventions and Configuration Regions that are replicated during the given time interval are shown in green. Varying shades of green are used to distinguish one subtrack from another. To display only selected subtracks, uncheck the boxes next to the tracks you wish to hide. Methods The experimental strategy adopted to map this profile involved isolation of replication products from HeLa cells synchronized at the G1-S boundary by thymidine-aphidicolin double block. Cells released from the block were labeled with BrdU at every two-hour interval of the 10 hours of S-phase and DNA was isolated from them. The heavy-light(H/L) DNA representing the pool of DNA replicated during each two-hour labeling period was separated from the unlabeled DNA by double cesium chloride density gradient centrifugation. The purified heavy-light DNA was then hybridized to a high-density genome-tiling Affymetrix array comprised of all unique probes within the ENCODE regions. The raw data generated by the microarray experiments was processed by computing the enrichment of signal in a particular part of the S-phase relative to the entirety of the S-phase (10 hours). High confidence regions (P-value = 1E-04) of replication were mapped by applying the Wilcoxon Rank Sum test in a sliding window of size 10 kb using the standard Affymetrix data analysis tools and the April 2003 (hg15) version of the human genome assembly. These coordinates were then mapped to the July 2003 (hg17) assembly by UCSC using the liftOver tool. Verification The submitted data are from two biological experimental sets. Regions of significant enrichment were included from both of the biological replicates. Credits Data generation and analysis for this track were performed by the DNA replication group in the Dutta Lab at the University of Virginia: Neerja Karnani, Christopher Taylor, Hakkyun Kim, Louis Lim, Ankit Malhotra, Gabe Robins and Anindya Dutta. Neerja Karnani and Christopher Taylor prepared the data for presentation in the UCSC Genome Browser. References Jeon, Y., Bekiranov, S., Karnani, N., Kapranov, P., Ghosh, S., MacAlpine, D., Lee, C., Hwang, D.S., Gingeras, T.R. and Dutta, A. Temporal profile of replication of human chromosomes. Proc Natl Acad Sci U S A 102(18), 6419-24 (2005). encodeUvaDnaRepSuper UVa DNA Rep University of Virginia DNA Replication Timing and Origins ENCODE Chromosome, Chromatin and DNA Structure Overview This super-track combines related tracks of DNA replication data from the University of Virginia. DNA replication is carefully coordinated, both across the genome and with respect to development. Earlier replication in S-phase is broadly correlated with gene density and transcriptional activity. These tracks contain temporal profiling of DNA replication and origin of DNA replication in multiple cell lines, such as HeLa cells (cervix carcinoma). Replication timing was measured by analyzing Brd-U-labeled fractions from synchronized cells on tiling arrays. Credits Data generation and analysis for this track were performed by the DNA replication group in the Dutta Lab at the University of Virginia: Neerja Karnani, Christopher Taylor, Hakkyun Kim, Louis Lim, Ankit Malhotra, Gabe Robins and Anindya Dutta. Neerja Karnani and Christopher Taylor prepared the data for presentation in the UCSC Genome Browser. References Giacca M, Pelizon C, Falaschi A. Mapping replication origins by quantifying relative abundance of nascent DNA strands using competitive polymerase chain reaction. Methods. 1997 Nov;13(3):301-12. Mesner LD, Crawford EL, Hamlin JL. Isolating apparently pure libraries of replication origins from complex genomes. Mol Cell. 2006 Mar 3;21(5):719-26. Jeon Y, Bekiranov S, Karnani N, Kapranov P, Ghosh S, MacAlpine D, Lee C, Hwang DS, Gingeras TR, Dutta A. Temporal profile of replication of human chromosomes. Proc Natl Acad Sci U S A. 2005 May 3;102(18):6419-24. encodeUvaDnaRep8 UVa DNA Rep 8h University of Virginia Temporal Profiling of DNA Replication (8-10 hrs) ENCODE Chromosome, Chromatin and DNA Structure encodeUvaDnaRep6 UVa DNA Rep 6h University of Virginia Temporal Profiling of DNA Replication (6-8 hrs) ENCODE Chromosome, Chromatin and DNA Structure encodeUvaDnaRep4 UVa DNA Rep 4h University of Virginia Temporal Profiling of DNA Replication (4-6 hrs) ENCODE Chromosome, Chromatin and DNA Structure encodeUvaDnaRep2 UVa DNA Rep 2h University of Virginia Temporal Profiling of DNA Replication (2-4 hrs) ENCODE Chromosome, Chromatin and DNA Structure encodeUvaDnaRep0 UVa DNA Rep 0h University of Virginia Temporal Profiling of DNA Replication (0-2 hrs) ENCODE Chromosome, Chromatin and DNA Structure encodeUvaDnaRepSeg UVa DNA Rep Seg University of Virginia DNA Replication Temporal Segmentation ENCODE Chromosome, Chromatin and DNA Structure Description The four subtracks in this annotation correspond to replication timing categories for DNA synthesis. Replication is segregated into early specific (Early), mid specific (Mid), late specific (Late), and non-specific (PanS). The first three categories correspond to regions that replicated in a time point-specific manner; the latter category encompasses regions that replicated in a temporally non-specific manner. Display Conventions and Configuration This annotation follows the display conventions for composite tracks. To display only selected subtracks, uncheck the boxes next to the tracks you wish to hide. Methods The experimental strategy adopted to map this profile involved isolation of replication products from HeLa cells synchronized at the G1-S boundary by thymidine-aphidicolin double block. Cells released from the block were labeled with BrDu at every two-hour interval of S-phase and DNA was isolated from them. The heavy-light (H/L) DNA representing the pool of DNA replicated during each two-hour labeling period was separated from unlabeled DNA by double cesium chloride density gradient centrifugation. The purified H/L DNA was then hybridized to a high-density genome-tiling Affymetrix array comprised of all unique probes within the ENCODE regions. The time of replication of 50% (TR50) of each microarray probe was calculated by accumulating the sum over the five time points and linearly interpolating the time when 50% was reached. Each probe was also classified as temporally specific or non-specific based on whether or not at least 50% of the accumulated signal appeared in a single time point. The TR50 data was then analyzed within a 20 kb sliding window to classify regions as specific versus non-specific based on the ratio of specific to non-specific probes within the window. Specific regions were further classified as early, mid, or late replicating based on the average TR50 of specific probes within the window. The resulting regions form a non-overlapping segregation of the replication data into the four given categories of replication timing. Verification The replication experiments were completed for two biological sets in the HeLa-adherent cell line. Credits Data generation and analysis for this track were performed by the DNA replication group in the Dutta Lab at the University of Virginia: Neerja Karnani, Christopher Taylor, Hakkyun Kim, Louis Lim, Ankit Malhotra, Gabe Robins and Anindya Dutta. Neerja Karnani and Christopher Taylor prepared the data for presentation in the UCSC Genome Browser. References Jeon, Y., Bekiranov, S., Karnani, N., Kapranov, P., Ghosh, S., MacAlpine, D., Lee, C., Hwang, D.S., Gingeras, T.R. and Dutta, A. Temporal profile of replication of human chromosomes. Proc Natl Acad Sci U S A 102(18), 6419-24 (2005). encodeUvaDnaRepPanS UVa DNA Rep PanS University of Virginia Temporal Profiling of DNA Replication (PanS) ENCODE Chromosome, Chromatin and DNA Structure encodeUvaDnaRepLate UVa DNA Rep Late University of Virginia Temporal Profiling of DNA Replication (Late) ENCODE Chromosome, Chromatin and DNA Structure encodeUvaDnaRepMid UVa DNA Rep Mid University of Virginia Temporal Profiling of DNA Replication (Mid) ENCODE Chromosome, Chromatin and DNA Structure encodeUvaDnaRepEarly UVa DNA Rep Early University of Virginia Temporal Profiling of DNA Replication (Early) ENCODE Chromosome, Chromatin and DNA Structure encodeUvaDnaRepOrigins UVa DNA Rep Ori University of Virginia DNA Replication Origins ENCODE Chromosome, Chromatin and DNA Structure Description The subtracks within this annotation show replication origins identified using the nascent strand method (Ori-NS), the bubble trapping method (Ori-Bubble) and the TR50 local minima method (Ori-TR50). Tracks are available for HeLa cells (cervix carcinoma) for all methods and GM06990 cells (lymphoblastoid) for Ori-NS. Display Conventions and Configuration This annotation follows the display conventions for composite tracks. To show only selected subtracks within this annotation, uncheck the boxes next to the tracks you wish to hide. Nascent Strand Method (Ori-NS) Description ENCODE region-wide mapping of replication origins was performed. Origin-centered nascent-strands purified from HeLa and GM06990 cell lines were hybridized to Affymetrix ENCODE tiling arrays. Methods Cells in their exponential stage of growth were labeled, in culture, with bromodeoxyuridine (BrdU) for 30 mins. DNA was then isolated from the cells. Nascent strands of 0.5-2.5 kb synthesized with incorporation of BrdU, representing the replication origins, were purified using a sucrose gradient followed by immunoprecipitation with BrdU antibody (Giacca et al., 1997). The purified nascent strands were amplified and then hybridized to Affymetrix ENCODE tiling arrays, which have 25-mer probes tiled every 22 bp, on average, in the non-repetitive sequence of the ENCODE regions. As an experimental control, genomic DNA was hybridized to arrays independently. Replication origins were identified by estimating the significance of the enrichment of nascent strands DNA (treatment) signal over genomic DNA (control) signal in a sliding window of 1000 bp. An estimate of significance in the window was calculated by computing the p-value using the Wilcoxon Rank-Sum test over all three biological replicates and control signal estimates in that window. The origins (Ori-NS) represented in the subtrack are the genomic regions that showed a signal enrichment pValue Verification The origin mapping experiments were completed for three biological sets. Credits Data generation and analysis for the subtracks using the Ori-NS method were performed by the DNA replication group in the Dutta Lab at the University of Virginia: Neerja Karnani, Christopher Taylor, Ankit Malhotra, Gabe Robins and Anindya Dutta. Christopher Taylor and Neerja Karnani prepared the data for presentation in the UCSC Genome Browser. References Giacca M, Pelizon C, Falaschi A. Mapping replication origins by quantifying relative abundance of nascent DNA strands using competitive polymerase chain reaction. Methods. 1997;13(3):301-12. Bubble Trapping Method (Ori-Bubble) Description ENCODE region-wide mapping of replication origins in HeLa cells was performed by the bubble trapping method. Replication origins were identified by hybridization to Affymetrix ENCODE tiling arrays. Methods The bubble trapping method works on the principle that circular plasmids can be trapped in gelling agarose followed by the application of electrical current for a prolonged period of time (see Mesner et al. 2006 for more details). Entrapment occurs by an apparent physical linkage of the circular DNA with the agarose matrix. The circular bubble component of the DNA replication intermediates was therefore enriched by agarose trapping. After recovery from the agarose gel, a library of the entrapped DNA was formed by DNA cloning. Subsequently, DNA from the library was labeled and hybridized to Affymetrix ENCODE tiling arrays, which have 25-mer probes tiled every 22 bp on average in the non-repetitive ENCODE regions. As an experimental control, genomic DNA was hybridized to arrays independently. Replication origins were identified by estimating the significance of the enrichment of the bubble-trapped DNA (treatment) signal over genomic DNA (control) signal in a sliding window of 10,000 bp. An estimate of significance in the window was calculated by computing the p-value using the Wilcoxon Rank-Sum test over all three biological replicates and the control signal estimates in that window. The origins (Ori-Bubble) hence represented in the UCSC browser track are the genomic regions that showed a signal enrichment pValue Verification The origin mapping experiments were completed for two biological sets. Credits Data generation and analysis for the subtrack using the Ori-bubble method were performed by the DNA replication group in the Dutta Lab and Hamlin Lab at the University of Virginia: Neerja Karnani, Larry Mesner, Christopher Taylor, Ankit Malhotra, Gabe Robins, Anindya Dutta and Joyce Hamlin. Neerja Karnani and Christopher Taylor prepared the data for presentation in the UCSC Genome Browser. References Mesner LD, Crawford EL, Hamlin JL. Isolating apparently pure libraries of replication origins from complex genomes. Mol Cell. 2006 Mar 3;21(5):719-26. TR50 local minima method (Ori-TR50) Description ENCODE region-wide mapping of replication origins in HeLa cells was performed by the TR50 local minima method. Replication origins were identified by hybridization to Affymetrix ENCODE tiling arrays. Methods The experimental strategy adopted to map this profile involved isolation of replication products from HeLa cells synchronized at the G1-S boundary by thymidine-aphidicolin double block. Cells released from the block were labeled with BrdU at every two-hour interval of the 10 hours of S-phase. Subsequently, DNA was isolated from the cells. The heavy-light (H/L) DNA representing the pool of DNA replicated during each two-hour labeling period was separated from the unlabeled DNA by double cesium chloride density gradient centrifugation. The purified H/L DNA was then hybridized to a high-density genome-tiling Affymetrix array comprised of all unique probes within the ENCODE regions. The time of replication of 50% (TR50) of each microarray probe was calculated by accumulating the sum over the five time points and linearly interpolating the time when 50% was reached. Each probe was also classified as showing temporally specific replication (all alleles replicating together within a two-hour window) or temporally non-specific replication (at least one allele replicating apart from the others by at least a two hour difference). The TR50 data for the temporally specific probes was then smoothed within a 60 kb window using lowess smoothing. Local minima (within a 30 kb window) on the smoothed TR50 curve were identified which had at least 30 probes in the window on both sides of the minimum to locate possible origins of replication. A confidence value was calculated for each site as the average difference from the value of the local minimum of all TR50 values falling into the 30 kb window. Verification The replication experiments were completed for two biological sets and a technical replicate in the HeLa adherent cell line. Credits Data generation and analysis for the subtrack using the Ori-TR50 method were performed by the DNA replication group in the Dutta Lab at the University of Virginia: Neerja Karnani, Christopher Taylor, Hakkyun Kim, Louis Lim, Ankit Malhotra, Gabe Robins and Anindya Dutta. Neerja Karnani and Christopher Taylor prepared the data for presentation in the UCSC Genome Browser. References Jeon Y, Bekiranov S, Karnani N, Kapranov P, Ghosh S, MacAlpine D, Lee C, Hwang DS, Gingeras TR, Dutta A. Temporal profile of replication of human chromosomes. Proc Natl Acad Sci U S A. 2005 May 3;102(18):6419-24. encodeUvaDnaRepOriginsTR50Hela UVa Ori-TR50 HeLa University of Virginia DNA Replication Origins, Ori-TR50, HeLa ENCODE Chromosome, Chromatin and DNA Structure encodeUvaDnaRepOriginsBubbleHela UVa Ori-Bubble HeLa University of Virginia DNA Replication Origins, Ori-Bubble, HeLa ENCODE Chromosome, Chromatin and DNA Structure encodeUvaDnaRepOriginsNSHela UVa Ori-NS HeLa University of Virginia DNA Replication Origins, Ori-NS, HeLa ENCODE Chromosome, Chromatin and DNA Structure encodeUvaDnaRepOriginsNSGM UVa Ori-NS GM University of Virginia DNA Replication Origins, Ori-NS, GM06990 ENCODE Chromosome, Chromatin and DNA Structure encodeUvaDnaRepTr50 UVa DNA Rep TR50 University of Virginia DNA Smoothed Timing at 50% Replication ENCODE Chromosome, Chromatin and DNA Structure Description This annotation shows smoothed replication timing for DNA synthesis as the time of 50% replication (TR50). Display Conventions and Configuration This annotation follows the display conventions for composite tracks. The subtracks within this annotation may be configured in a variety of ways to highlight different aspects of the displayed data. The graphical configuration options are shown at the top of the track description page, followed by a list of subtracks. To display only selected subtracks, uncheck the boxes next to the tracks you wish to hide. For more information about the graphical configuration options, click the Graph configuration help link. Methods The experimental strategy adopted to map this profile involved isolation of replication products from HeLa cells synchronized at the G1-S boundary by thymidine-aphidicolin double block. Cells released from the block were labeled with BrdU at every two-hour interval of the 10 hours of S-phase and DNA was isolated from them. The heavy-light (H/L) DNA representing the pool of DNA replicated during each two-hour labeling period was separated from the unlabeled DNA by double cesium chloride density gradient centrifugation. The purified H/L DNA was then hybridized to a high-density genome-tiling Affymetrix array comprised of all unique probes within the ENCODE regions. The time of replication of 50% (TR50) of each microarray probe was calculated by accumulating the sum over the five time points and linearly interpolating the time when 50% was reached. Each probe was also classified as temporally specific or non-specific based on whether at least 50% of the accumulated signal appeared in a single time point or not. The TR50 data for all specific probes were then lowess-smoothed within a 60 kb window to provide the profile displayed in the annotation. Verification The replication experiments were completed for two biological sets in the HeLa adherent cell line. Credits Data generation and analysis for this track were performed by the DNA replication group in the Dutta Lab at the University of Virginia: Neerja Karnani, Christopher Taylor, Hakkyun Kim, Louis Lim, Ankit Malhotra, Gabe Robins and Anindya Dutta. Neerja Karnani and Christopher Taylor prepared the data for presentation in the UCSC Genome Browser. References Jeon, Y., Bekiranov, S., Karnani, N., Kapranov, P., Ghosh, S., MacAlpine, D., Lee, C., Hwang, D.S., Gingeras, T.R. and Dutta, A. Temporal profile of replication of human chromosomes. Proc Natl Acad Sci U S A 102(18), 6419-24 (2005). regPotential5X 5x Reg Potential 5-Way Regulatory Potential - Human, Chimp, Dog, Mouse, Rat Regulation Description This track displays regulatory potential (RP) scores computed from alignments of human (hg17), chimpanzee (panTro1), mouse (mm5), rat (rn3), and dog (canFam1). RP scores compare frequencies of short alignment patterns between known regulatory elements and neutral DNA. The sensitivity and specificity of RP scores were calibrated on the hemoglobin beta gene cluster. These results suggest a threshold of ~0.00 for the identification of new putative regulatory elements. The default viewing range for this track is from 0.00 to 0.01. Score values below the 0.00 default lower limit indicate resemblance to alignment patterns typical of neutral DNA, while score values above the 0.01 default upper limit indicate very marked resemblance to alignment patterns typical of regulatory elements in the training set. The range of RP scores from 0.00 to 0.01 contains the prediction threshold suggested by calibration studies, and provides an effective visualization of the score for most genomic loci. However, the user can specify different viewing ranges if desired. Note: Absence of a score value at a given location indicates lack of sufficient alignment -- scores are computed for all regions of the reference genome in which no region of more than 100 bases lacks alignment in at least three non-human species. This track may be configured in a variety of ways to highlight different aspects of the displayed information. Click the Graph configuration help link for an explanation of the configuration options. Methods The comparison employs log-ratios of transitions probabilities from two variable order Markov models. Training the score entails selecting appropriate alphabet (alignment column symbols) and maximal order (length of the longest patterns = order + 1) for the Markov models, and estimating their transition probabilities, based on alignment data from known regulatory elements and ancestral repeats. The scores in this track are computed using a maximal order of 2. In the track, score values are displayed using a system of overlapping windows of size 100 bp along sufficiently alignable portions of the human sequence. Log-ratios are added over positions in a window, and the sum is normalized for length. Credits Work on RP scores is performed by members of the Comparative Genomics and Bioinformatics Center at Penn State University. More information on this research and the collection of known regulatory elements used in training the score can be found at this site. References Blanchette M, Kent WJ, Riemer C, Elnitski L, Smit AFA, Roskin KM, Baertsch R, Rosenbloom K, Clawson H, Green ED, Haussler D, Miller W. Aligning multiple genomic sequences with the threaded blockset aligner. Genome Res. 2004 Apr;14(4):708-15. King DC, Taylor J, Elnitski L, Chiaromonte F, Miller W, Hardison RC. Evaluation of regulatory potential and conservation scores for detecting cis-regulatory modules in aligned mammalian genome sequences.. Genome Res. 2005 Aug;15(8):1051-60. Kolbe D, Taylor J, Elnitski L, Eswara P, Li J, Miller W, Hardison RC, Chiaromonte F. Regulatory potential scores from genome-wide three-way alignments of human, mouse, and rat. Genome Res. 2004 Apr;14(4):700-7. regPotential7X 7X Reg Potential ESPERR Regulatory Potential (7 Species) Regulation Description This track displays regulatory potential (RP) scores computed from alignments of human, chimpanzee (panTro2), macaque (rheMac2), mouse (mm8), rat (rn4), dog (canFam2), and cow (bosTau2). RP scores compare frequencies of short alignment patterns between known regulatory elements and neutral DNA. The sensitivity and specificity of RP scores were calibrated on the hemoglobin beta gene cluster. These results suggest a threshold of ~0.00 for the identification of new putative regulatory elements. The default viewing range for this track is from 0.0 to 0.1. Score values below the 0.0 default lower limit indicate resemblance to alignment patterns typical of neutral DNA, while score values above the 0.1 default upper limit indicate very marked resemblance to alignment patterns typical of regulatory elements in the training set. The range of RP scores from 0.0 to 0.1 contains the prediction threshold suggested by calibration studies, and provides an effective visualization of the score for most genomic loci. However, the user can specify different viewing ranges if desired. Note: Absence of a score value at a given location indicates lack of sufficient alignment -- scores are computed for all regions of the reference genome in which no region of more than 100 bases lacks alignment in at least three non-human species. This track may be configured in a variety of ways to highlight different aspects of the displayed information. Click the Graph configuration help link for an explanation of the configuration options. Methods The comparison employs log-ratios of transitions probabilities from two variable order Markov models. Training the score entails selecting appropriate alphabet (alignment column symbols) and maximal order (length of the longest patterns = order + 1) for the Markov models, and estimating their transition probabilities, based on alignment data from known regulatory elements and ancestral repeats. The scores in this track are computed using a maximal order of 2. In the track, score values are displayed using a system of overlapping windows of size 100 bp along sufficiently alignable portions of the human sequence. Log-ratios are added over positions in a window, and the sum is normalized for length. Credits Work on RP scores is performed by members of the Comparative Genomics and Bioinformatics Center at Penn State University. More information on this research and the collection of known regulatory elements used in training the score can be found at this site. References King DC, Taylor J, Elnitski L, Chiaromonte F, Miller W, Hardison RC. Evaluation of regulatory potential and conservation scores for detecting cis-regulatory modules in aligned mammalian genome sequences.. Genome Res. 2005 Aug;15(8):1051-60. Kolbe D, Taylor J, Elnitski L, Eswara P, Li J, Miller W, Hardison RC, Chiaromonte F. Regulatory potential scores from genome-wide three-way alignments of human, mouse, and rat. Genome Res. 2004 Apr;14(4):700-7. acescan ACEScan ACEScan Alternative Conserved Human-Mouse Exon Predictions Genes and Gene Predictions Description This track identifies predicted Alternative Conserved Exons (human-mouse conservation), as predicted by ACEScan. These are exons that are present in some transcripts, but skipped by alternative splicing in other transcripts in both human and mouse. Alternate use of skipped exons has important consequences during gene expression and in disease. Methods Putative alternative conserved exons on mRNAs were identified using a machine-learning algorithm, Regularized Least-Squares Classification. Characteristics of known exons that have been skipped in both human and mouse mRNAs were determined by considering factors such as exon and intron length, splice-site strength, sequence conservation, and region-specific oligonucleotide composition. A training set was made by comparing known exons that are skipped in some transcripts to exons never skipped. These characteristics were then applied to the whole genome to predict skipped exons in other transcripts. This track displays exons with positive ACEScan scores. For further details of the method used to generate this annotation, please refer to Yeo et al. (2005). Credits Thanks to Gene Yeo at the Crick-Jacobs Center, Salk Institute and Christopher Burge, MIT, for providing this annotation. For additional information on ACEscan predictions please contact geneyeo@salk.edu or cburge@mit.edu. References Yeo GW, Van Nostrand E, Holste D, Poggio T, Burge CB (2005), Identification and analysis of alternative splicing events conserved in human and mouse. Proc Natl Acad Sci U S A. 2005 Feb 22;102(8):2850-5. affyHumanExon Affy All Exon Affymetrix All Exon Chips Expression Methods RNA (from a commercial source) from 11 tissues were hybridized to Affymetrix Human Exon 1.0 ST arrays. For each tissue, 3 replicate experiments were done for a total of 33 arrays. The arrays' raw signal intensity was normalized with a quantile normalization method, then run through the PLIER algorithm. The normalized data were then converted to log-ratios, which are displayed as green for negative log-ratios (underexpression), and red for positive (overexpression). The probe set for this microarray track can be displayed by turning on the Affy HuEx 1.0 track.\ Credits The data for this track was provided and analyzed by Chuck Sugnet at Affymetrix. Links AffyMetrix Human Exon 1.0 ST array web page AffyMetrix Human Exon data PLIER algorithm documentation (PDF). affyGnf1h Affy GNF1H Alignments of Affymetrix Consensus/Exemplars from GNF1H Expression Description This track shows the location of the sequences used for the selection of probes on the Affymetrix GNF1H chips. This contains 11406 predicted genes that do not overlap with the Affy U133A chip. Methods The sequences were mapped to the genome using blat followed by pslReps with the parameters: -minCover=0.3 -minAli=0.95 -nearTop=0.005 Credits Thanks to the Genomics Institute of the Novartis Research Foundation (GNF) for the data underlying this track. References Su AI, Wiltshire T, Batalov S, Lapp H, Ching KA, Block D, Zhang J, Soden R, Hayakawa M, Kreiman G et al. A gene atlas of the mouse and human protein-encoding transcriptomes. Proc Natl Acad Sci U S A. 2004 Apr 20;101(16):6062-7. PMID: 15075390; PMC: PMC395923 affyHuEx1 Affy HuEx 1.0 Affymetrix Human Exon 1.0 Probe Sets Expression Description The Human Exon 1.0 ST GeneChip contains over 1.4 million probe sets designed to interrogate individual exons rather than the 3' ends of transcripts as in traditional GeneChips. Exons were derived from a variety of annotations that have been divided into the classes Core, Extended and Full. Core: RefSeq transcripts, full-length GenBank mRNAs Extended: dbEst alignments, Ensembl annotations, syntenic mRNA from rat and mouse, microRNA annotations, MITOMAP annotations, Vega genes, Vega pseudogenes Full: Geneid genes, Genscan genes, Genscan Subopt, Exoniphy, RNA genes, SGP genes, Twinscan genes Probe sets are colored by class with the Core probe sets being the darkest and the Full being the lightest color. Additionally, probe sets that do not overlap the exons of a transcript cluster, but fall inside of its introns, are considered bounded by that transcript cluster and are colored slightly lighter. Probe sets that overlap the coding portion of the Core class are colored slightly darker. The microarray track using this probe set can be displayed by turning on the Affy All Exon track. Credits and References The exons interrogated by the probe sets displayed in this track are from the Affymetrix Human Exon 1.0 GeneChip and were derived from a number of sources. In addition to the millions of cDNA sequences contributed to the GenBank, dbEst and RefSeq databases by individual labs and scientists, the following annotations were used: Ensembl: Hubbard T, Barker D, Birney E, Cameron G, Chen Y, Clark L, Cox T, Cuff J, Curwen V, Down T et al.. The Ensembl genome database project. Nucleic Acids Research. 2002 Jan 1;30(1):38-41. Exoniphy: Siepel, A., Haussler, D. Computational identification of evolutionarily conserved exons. Proc. 8th Int'l Conf. on Research in Computational Molecular Biology, 177-186 (2004). Geneid Genes: Parra, G., Blanco, E., Guigo, R. Geneid in Drosophila. Genome Res. 10(4), 511-515 (2000). Genscan Genes: Burge, C., Karlin, S. Prediction of Complete Gene Structures in Human Genomic DNA. J. Mol. Biol. 268(1), 78-94 (1997). microRNA: Griffiths-Jones, S. The microRNA Registry. Nucl. Acids Res. 32, D109-D111 (2004). MITOMAP: Brandon, M. C., Lott, M. T., Nguyen, K. C., Spolim, S., Navathe, S. B., Baldi, P. & Wallace, D. C. MITOMAP: a human mitochondrial genome database--2004 update Nucl. Acids Res. 33(Database Issue):D611-613 (2005). RNA Genes: Lowe, T. M., Eddy, S. R. tRNAscan-SE: A Program for Improved Detection of Transfer RNA Genes in Genomic Sequence. Nucleic Acids Res., 25(5), 955-964 (1997). SGP Genes: Wiehe, T., Gebauer-Jung, S., Mitchell-Olds, T., Guigo, R. SGP-1: prediction and validation of homologous genes based on sequence alignments. Genome Res., 11(9), 1574-83 (2001). Twinscan Genes: Korf, I., Flicek, P., Duan, D., Brent, M.R. Integrating genomic homology into gene structure prediction. Bioinformatics 17, S140-148 (2001). Vega Genes and Pseudogenes: The HAVANA group, Wellcome Trust Sanger Institute. encodeAffyRnaSignal Affy RNA Signal Affymetrix PolyA+ RNA Signal ENCODE Transcript Levels Description This track shows an estimate of RNA abundance (transcription) for all ENCODE regions for several cell types. Retinoic acid-stimulated HL-60 cells were harvested after 0, 2, 8, and 32 hours. Purified cytosolic polyA+ RNA from unstimulated GM06990 and HeLa cells, as well as purified polyA+ RNA from the RA-stimulated HL-60 samples, was hybridized to Affymetrix ENCODE oligonucleotide tiling arrays, which have 25-mer probes tiled every 22 bp on average in the non-repetitive ENCODE regions. Composite signals are shown in separate subtracks for each cell type and for each of the four timepoints for RA-stimulated HL-60. Data for all biological replicates can be downloaded from Affymetrix in wiggle, cel, and soft formats. Display Conventions and Configuration The subtracks within this composite annotation track may be configured in a variety of ways to highlight different aspects of the displayed data. The graphical configuration options for the subtracks are shown at the top of the track description page, followed by a list of subtracks. To show only selected subtracks, uncheck the boxes next to the tracks that you wish to hide. For more information about the graphical configuration options, click the Graph configuration help link. Color differences among the subtracks are arbitrary. They provide a visual cue for distinguishing between the different cell types and timepoints. Methods The data from replicate arrays were quantile-normalized (Bolstad et al., 2003) and all arrays were scaled to a median array intensity of 22. Within a sliding 101 bp window centered on each probe, an estimate of RNA abundance (signal) was found by calculating the median of all pairwise average PM-MM values, where PM is a perfect match and MM is a mismatch. Both Kapranov et al. (2002) and Cawley et al. (2004) are good references for the experimental methods; Cawley et al. also describes the analytical methods. Verification Three independent biological replicates were generated and hybridized to duplicate arrays (two technical replicates). Transcribed regions were generated from the composite signal track by merging genomic positions to which probes are mapped. This merging was based on a 5% false positive rate cutoff in negative bacterial controls, a maximum gap (MaxGap) of 50 base-pairs and minimum run (MinRun) of 40 base-pairs (see the Affy TransFrags track for the merged regions). A random subset of transfrags were verified by RACE where the RACE primers were designed based on the sequences of the transfrags. Credits These data were generated and analyzed by the Gingeras/Struhl collaboration with the Tom Gingeras group at Affymetrix and the Kevin Struhl group at Harvard Medical School. References Please see the Affymetrix Transcriptome site for a project overview and additional references to Affymetrix tiling array publications. Bolstad, B. M., Irizarry, R. A., Astrand, M., and Speed, T. P. A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics 19(2), 185-193 (2003). Cawley, S., Bekiranov, S., Ng, H. H., Kapranov, P., Sekinger, E. A., Kampa, D., Piccolboni, A., Sementchenko, V., Cheng, J., Williams, A. J., et al. Unbiased mapping of transcription factor binding sites along human chromosomes 21 and 22 points to widespread regulation of noncoding RNAs. Cell 116(4), 499-509 (2004). Kapranov, P., Cawley, S. E., Drenkow, J., Bekiranov, S., Strausberg, R. L., Fodor, S. P., and Gingeras, T. R. Large-scale transcriptional activity in chromosomes 21 and 22. Science 296(5569), 916-919 (2002). encodeAffyRnaHl60SignalHr32 Affy RNA RA 32h Affymetrix PolyA+ RNA (retinoic acid-treated HL-60, 32hrs) Signal ENCODE Transcript Levels encodeAffyRnaHl60SignalHr08 Affy RNA RA 8h Affymetrix PolyA+ RNA (retinoic acid-treated HL-60, 8hrs) Signal ENCODE Transcript Levels encodeAffyRnaHl60SignalHr02 Affy RNA RA 2h Affymetrix PolyA+ RNA (retinoic acid-treated HL-60, 2hrs) Signal ENCODE Transcript Levels encodeAffyRnaHl60SignalHr00 Affy RNA RA 0h Affymetrix PolyA+ RNA (retinoic acid-treated HL-60, 0hrs) Signal ENCODE Transcript Levels encodeAffyRnaHeLaSignal Affy RNA HeLa Affymetrix PolyA+ RNA (HeLa) Signal ENCODE Transcript Levels encodeAffyRnaGm06990Signal Affy RNA GM06990 Affymetrix PolyA+ RNA (GM06990) Signal ENCODE Transcript Levels encodeAffyRnaTransfrags Affy Transfrags Affymetrix PolyA+ RNA Transfrags ENCODE Transcript Levels Description This track shows the location of sites showing transcription for all ENCODE regions in several cell types, using Affymetrix arrays. Retinoic acid-stimulated HL-60 cells were harvested after 0, 2, 8, and 32 hours. Purified cytosolic polyA+ RNA from unstimulated GM06990 and HeLa cells, as well as purified polyA+ RNA from the RA-stimulated HL-60 samples, was hybridized to Affymetrix ENCODE oligonucleotide tiling arrays, which have 25-mer probes tiled every 22 bp on average in the non-repetitive ENCODE regions. Clustered sites are shown in separate subtracks for each cell type and for each of the four timepoints for RA-stimulated HL-60. Data for all biological replicates can be downloaded from Affymetrix in wiggle, cel, and soft formats. Display Conventions and Configuration To show only selected subtracks, uncheck the boxes next to the tracks that you wish to hide. Color differences among the subtracks are arbitrary. They provide a visual cue for distinguishing between the different cell types and timepoints. Methods The data from replicate arrays were quantile-normalized (Bolstad et al., 2003) and all arrays were scaled to a median array intensity of 22. Within a sliding 101 bp window centered on each probe, an estimate of RNA abundance (signal) was found by calculating the median of all pairwise average PM-MM values, where PM is a perfect match and MM is a mismatch. Both Kapranov et al. (2002) and Cawley et al. (2004) are good references for the experimental methods; Cawley et al. also describes the analytical methods. Verification Three independent biological replicates were generated and hybridized to duplicate arrays (two technical replicates). Transcribed regions (see the Affy RNA Signal track) were generated from the composite signal track by merging genomic positions to which probes are mapped. This merging was based on a 5% false positive rate cutoff in negative bacterial controls, a maximum gap (MaxGap) of 50 base-pairs and minimum run (MinRun) of 40 base-pairs. A random subset of transfrags were verified by RACE where the RACE primers were designed based on the sequences of the transfrags. Credits These data were generated and analyzed by the Gingeras/Struhl collaboration with the Tom Gingeras group at Affymetrix and the Kevin Struhl group at Harvard Medical School. References Please see the Affymetrix Transcriptome site for a project overview and additional references to Affymetrix tiling array publications. Bolstad, B. M., Irizarry, R. A., Astrand, M., and Speed, T. P. A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics 19(2), 185-193 (2003). Cawley, S., Bekiranov, S., Ng, H. H., Kapranov, P., Sekinger, E. A., Kampa, D., Piccolboni, A., Sementchenko, V., Cheng, J., Williams, A. J., et al. Unbiased mapping of transcription factor binding sites along human chromosomes 21 and 22 points to widespread regulation of noncoding RNAs. Cell 116(4), 499-509 (2004). Kapranov, P., Cawley, S. E., Drenkow, J., Bekiranov, S., Strausberg, R. L., Fodor, S. P., and Gingeras, T. R. Large-scale transcriptional activity in chromosomes 21 and 22. Science 296(5569), 916-919 (2002). encodeAffyRnaHl60SitesHr32 Affy RNA RA 32h Affymetrix PolyA+ RNA (retinoic acid-treated HL-60, 32hrs) Sites ENCODE Transcript Levels encodeAffyRnaHl60SitesHr08 Affy RNA RA 8h Affymetrix PolyA+ RNA (retinoic acid-treated HL-60, 8hrs) Sites ENCODE Transcript Levels encodeAffyRnaHl60SitesHr02 Affy RNA RA 2h Affymetrix PolyA+ RNA (retinoic acid-treated HL-60, 2hrs) Sites ENCODE Transcript Levels encodeAffyRnaHl60SitesHr00 Affy RNA RA 0h Affymetrix PolyA+ RNA (retinoic acid-treated HL-60, 0hrs) Sites ENCODE Transcript Levels encodeAffyRnaHeLaSites Affy RNA HeLa Affymetrix PolyA+ RNA (HeLa) Sites ENCODE Transcript Levels encodeAffyRnaGm06990Sites Affy RNA GM06990 Affymetrix PolyA+ RNA (GM06990) Sites ENCODE Transcript Levels affyTxnPhase3FragsL Affy Tx lRNA Reg Affymetrix Transcriptome Phase 3 Long RNA Fragments Expression Description This track displays transcriptome data from tiling GeneChips produced by Affymetrix. For the complete non-repetitive portion of the human genome, more than 256 million probe pairs were tiled every 5 bp in non-repeat-masked areas and hybridized to cytosolic polyA+ long RNA (>200 nucleotides) from 8 different cell lines. Note that the female cell lines HeLa, SK-N-AS, and U87MG do not contain data for chrY. For the HeLa and HepG2 cell types, samples were also produced for nuclear RNA in addition to cytosolic polyA+ RNA. For experimental details and results, see Kapranov et al. in the References section below. This track shows the coordinates of transcribed fragments (transfrags) representing long RNAs. A separate track -- "Affy Tx lRNA Sig" -- contains signals (PM-MM values) for each probe pair plotted against its genomic coordinates (see Kapranov et al. in the References section below). Display Conventions and Configuration This annotation follows the display conventions for composite tracks. The subtracks within this annotation may be configured in a variety of ways to highlight different aspects of the displayed data. The graphical configuration options are shown at the top of the track description page, followed by a list of subtracks. For more information about the graphical configuration options, click the Graph configuration help link. To display only selected subtracks, uncheck the boxes next to the tracks you wish to hide. Methods For each data point, probes within 30 bp on either side were used to improve the estimate of expression level for a particular probe. This helped to smooth the data and produce a more robust estimate of the transcription level at a particular genomic location. The following analysis steps were used: Replicate arrays were quantile-normalized and the median intensity (using both PM and MM intensities) of each array was scaled to a target value of 50. The expression level was estimated for each mapped probe position by collecting all the probe pairs that fell within a window of ± 30 bp, calculating all non-redundant pairwise averages of PM - MM values of all probe pairs in the window, and taking the median of all resulting pairwise averages. The resulting signal value is the Hodges-Lehmann estimator associated with the Wilcoxon signed-rank statistic of the PM - MM values that lie within ± 30 bp of the sliding window centered at every genomic coordinate. These data are displayed in the track "Affy Tx lRNA Sig". Transfrags were determined by connecting adjacent positive probes with a maximum allowable gap of 11 nucleotides and a minimum run of 49 nucleotides (at least 11 probes with at most one negative probe between two positive probes). Transfrags were discarded if they overlapped pseudogenes or harbored low-complexity or repetitive sequences. Credits Data generation and analysis were performed by the transcriptome group at Affymetrix with assistance from colleagues at the University of Leipzig, Fraunhofer Institute, and University of Vienna: P. Kapranov, J. Cheng, S. Dike, D.A. Nix, R. Duttagupta, A.T. Willingham, P.F. Stadler, J. Hertel, J. Hackermüller, I.L. Hofacker, I. Bell, E. Cheung, J. Drenkow, E. Dumais, S. Patel, G. Helt, M. Ganesh, S. Ghosh, A. Piccolboni, V. Sementchenko, H. Tammana, T.R. Gingeras. Questions or comments about this annotation? Email genome@soe.ucsc.edu. References Kapranov P, Cheng J, Dike S, Nix DA, Duttagupta R, Willingham AT, Stadler PF, Hertel J, Hackermüller J, Hofacker IL, et al. RNA maps reveal new RNA classes and a possible function for pervasive transcription. Science. 2007 Jun 8;316(5830):1484-8. affyTxnPhase3Super Affy Txn Affy Transcriptome Phase 3 Expression Overview This super-track combines related tracks of genome-wide Transcriptome Phase 3 data generated by Affymetrix. There are four member tracks: Affymetrix Transcriptome Phase 3 Long RNA Fragments - the transcribed fragments (transfrags) representing long RNAs. Affymetrix Transcriptome Phase 3 Long RNA Signal - the signal level estimated for each mapped probe position after hybridization with long RNA samples. Affymetrix Transcriptome Phase 3 Short RNA Fragments - the transcribed fragments (transfrags) representing short RNAs. Affymetrix Transcriptome Phase 3 Short RNA Signal - the signal level estimated for each mapped probe position after hybridization with short RNA samples. Credits Data generation and analysis were performed by the transcriptome group at Affymetrix with assistance from colleagues at the University of Leipzig, Fraunhofer Institute, and University of Vienna: P. Kapranov, J. Cheng, S. Dike, D.A. Nix, R. Duttagupta, A.T. Willingham, P.F. Stadler, J. Hertel, J. Hackermüller, I.L. Hofacker, I. Bell, E. Cheung, J. Drenkow, E. Dumais, S. Patel, G. Helt, M. Ganesh, S. Ghosh, A. Piccolboni, V. Sementchenko, H. Tammana, T.R. Gingeras. Questions or comments about this annotation? Email genome@soe.ucsc.edu. References Kapranov P, Cheng J, Dike S, Nix DA, Duttagupta R, Willingham AT, Stadler PF, Hertel J, Hackermüller J, Hofacker IL, et al. RNA maps reveal new RNA classes and a possible function for pervasive transcription. Science. 2007 Jun 8;316(5830):1484-8. affyTxnPhase3FragsU87MG U87MG lRNA Affymetrix U87MG Long RNA (Cytosolic) Fragments Expression affyTxnPhase3FragsSK_N_AS SK-N-AS lRNA Affymetrix SK-N-AS Long RNA (Cytosolic) Fragments Expression affyTxnPhase3FragsPC3 PC3 lRNA Affymetrix PC3 Long RNA (Cytosolic) Fragments Expression affyTxnPhase3FragsNCCIT NCCIT lRNA Affymetrix NCCIT Long RNA (Cytosolic) Fragments Expression affyTxnPhase3FragsJurkat Jurkat lRNA Affymetrix Jurkat Long RNA (Cytosolic) Fragments Expression affyTxnPhase3FragsHepG2Nuclear HepG2 Nucl lRNA Affymetrix HepG2 Long RNA (Nuclear) Fragments Expression affyTxnPhase3FragsHepG2Cyto HepG2 Cyto lRNA Affymetrix HepG2 Long RNA (Cytosolic) Fragments Expression affyTxnPhase3FragsHeLaNuclear HeLa Nucl lRNA Affymetrix HeLa Long RNA (Nuclear) Fragments Expression affyTxnPhase3FragsHeLaCyto HeLa Cyto lRNA Affymetrix HeLa Long RNA (Cytosolic) Fragments Expression affyTxnPhase3FragsHDF HDF lRNA Affymetrix HDF Long RNA (Cytosolic) Fragments Expression affyTxnPhase3L Affy Tx lRNA Sig Affymetrix Transcriptome Phase 3 Long RNA Signal Expression Description This track displays transcriptome data from tiling GeneChips produced by Affymetrix. For the complete non-repetitive portion of the human genome, more than 256 million probe pairs were tiled every 5 bp in non-repeat-masked areas and hybridized to cytosolic polyA+ long RNA (>200 nucleotides) from 8 different cell lines. Note that the female cell lines HeLa, SK-N-AS, and U87MG do not contain data for chrY. For HeLa and HepG2, samples were also produced for nuclear polyA+ RNA in addition to cytosolic polyA+ RNA. For experimental details and results, see Kapranov et al. in the References section below. This track contains the signal level estimated for each mapped probe position after hybridization with long RNA samples. A separate track -- "Affy Tx lRNA Reg " -- contains transcribed fragments (transfrags) representing long RNAs (see Kapranov et al. in the References section below). While the raw data are based on perfect match minus mismatch (PM - MM) probe values and may contain negative values, the track has a minimum value of zero for visualization purposes. Likewise, the probes with high signal values are cut off at the intensity of 150 by default and will appear to have similar magnitude. Display Conventions and Configuration This annotation follows the display conventions for composite tracks. The subtracks within this annotation may be configured in a variety of ways to highlight different aspects of the displayed data. The graphical configuration options are shown at the top of the track description page, followed by a list of subtracks. For more information about the graphical configuration options, click the Graph configuration help link. To display only selected subtracks, uncheck the boxes next to the tracks you wish to hide. Methods For each data point, probes within 30 bp on either side were used to improve the estimate of expression level for a particular probe. This helped to smooth the data and produce a more robust estimate of the transcription level at a particular genomic location. The following analysis steps were performed: Replicate arrays were quantile-normalized and the median intensity (using both PM and MM intensities) of each array was scaled to a target value of 50. The expression level was estimated for each mapped probe position by collecting all the probe pairs that fell within a window of ± 30 bp, calculating all non-redundant pairwise averages of PM - MM values of all probe pairs in the window, and taking the median of all resulting pairwise averages. The resulting signal value is the Hodges-Lehmann estimator associated with the Wilcoxon signed-rank statistic of the PM - MM values that lie within ± 30 bp of the sliding window centered at every genomic coordinate. Transfrags in the track labeled "Affy Tx lRNA Reg" were determined by connecting adjacent positive probes with a maximum allowable gap of 11 nucleotides and a minimum run of 49 nucleotides (at least 11 probes with at most one negative probe between two positive probes). Transfrags were discarded if they overlapped pseudogenes or harbored low-complexity or repetitive sequences. Credits Data generation and analysis were performed by the transcriptome group at Affymetrix with assistance from colleagues at the University of Leipzig, Fraunhofer Institute, and University of Vienna: P. Kapranov, J. Cheng, S. Dike, D.A. Nix, R. Duttagupta, A.T. Willingham, P.F. Stadler, J. Hertel, J. Hackermüller, I.L. Hofacker, I. Bell, E. Cheung, J. Drenkow, E. Dumais, S. Patel, G. Helt, M. Ganesh, S. Ghosh, A. Piccolboni, V. Sementchenko, H. Tammana, T.R. Gingeras. Questions or comments about this annotation? Email genome@soe.ucsc.edu. References Kapranov P, Cheng J, Dike S, Nix DA, Duttagupta R, Willingham AT, Stadler PF, Hertel J, Hackermüller J, Hofacker IL, et al. RNA maps reveal new RNA classes and a possible function for pervasive transcription. Science. 2007 Jun 8;316(5830):1484-8. affyTxnPhase3U87MG U87MG lRNA Affymetrix U87MG Long RNA (Cytosolic) Signal Expression affyTxnPhase3SK_N_AS SK-N-AS lRNA Affymetrix SK-N-AS Long RNA (Cytosolic) Signal Expression affyTxnPhase3PC3 PC3 lRNA Affymetrix PC3 Long RNA (Cytosolic) Signal Expression affyTxnPhase3NCCIT NCCIT lRNA Affymetrix NCCIT Long RNA (Cytosolic) Signal Expression affyTxnPhase3Jurkat Jurkat lRNA Affymetrix Jurkat Long RNA (Cytosolic) Signal Expression affyTxnPhase3HepG2Nuclear HepG2 Nucl lRNA Affymetrix HepG2 Long RNA (Nuclear) Signal Expression affyTxnPhase3HepG2Cyto HepG2 Cyto lRNA Affymetrix HepG2 Long RNA (Cytosolic) Signal Expression affyTxnPhase3HeLaNuclear HeLa Nucl lRNA Affymetrix HeLa Long RNA (Nuclear) Signal Expression affyTxnPhase3HeLaCyto HeLa Cyto lRNA Affymetrix HeLa Long RNA (Cytosolic) Signal Expression affyTxnPhase3HDF HDF lRNA Affymetrix HDF Long RNA (Cytosolic) Signal Expression affyTxnPhase3FragsS Affy Tx sRNA Reg Affymetrix Transcriptome Phase 3 Short RNA Fragments Expression Description This track displays transcriptome data from tiling GeneChips produced by Affymetrix. For the complete non-repetitive portion of the human genome, more than 256 million probe pairs were tiled every 5 bp in non-repeat-masked areas and hybridized to whole-cell short RNA (<200 nucleotides) from two different cell lines. Note that the female cell lines HeLa, SK-N-AS, and U87MG do not contain data for chrY. Data were collected using a strand-specific methodology that allows separate measurements for top strand (sense) and bottom strand (antisense) RNA signals. For experimental details and results, see Kapranov et al. in the References section below. This track shows the coordinates of transcribed fragments (transfrags) representing short RNAs. A separate track -- "Affy Tx sRNA signals" -- contains signals (PM-MM values) for each probe pair plotted against its genomic coordinates (see Kapranov et al. in the References section below). Display Conventions and Configuration This annotation follows the display conventions for composite tracks. The subtracks within this annotation may be configured in a variety of ways to highlight different aspects of the displayed data. The graphical configuration options are shown at the top of the track description page, followed by a list of subtracks. For more information about the graphical configuration options, click the Graph configuration help link. To display only selected subtracks, uncheck the boxes next to the tracks you wish to hide. Methods Genomic regions corresponding to the sRNA transcripts were mapped for the plus or the minus strands of the genome and for HepG2 and HeLa cell lines respectively. The following analysis steps were performed: Replicate arrays were quantile-normalized and the median intensity (using both PM and MM intensities) of each array was scaled to a target value of 20. The expression level was estimated for each mapped probe position without application of a smoothing window. These data are displayed in the track "Affy Tx sRNA Sig". This was followed by generation of transfrags as follows: probes with intensities corresponding to the 98th percentile of all the PM-MM probe intensities were identified. transfrags were determined by connecting adjacent positive probes with a maximum allowable gap of four nucleotides and a minimum run of seven nucleotides (at least two consecutive positive probes with no negative probes between), and must be called positive reproducibly in both biological replicas. to further prioritize the transfrag maps, the intensities of all transfrags were calculated based on the composite signals of the two biological replicas, and transfrags corresponding to the top 25 quartile were used for further analysis. In addition, several successive steps were performed to avoid potentially cross-hybridizing sequences and/or those that mapped to potentially non-unique regions. Transfrags were discarded if they overlapped pseudogenes or harbored low-complexity or repetitive sequences. The resulting maps are believed to represent a very conservative view of the sRNA population in the cell. Credits Data generation and analysis were performed by the transcriptome group at Affymetrix with assistance from colleagues at the University of Leipzig, Fraunhofer Institute, and University of Vienna: P. Kapranov, J. Cheng, S. Dike, D.A. Nix, R. Duttagupta, A.T. Willingham, P.F. Stadler, J. Hertel, J. Hackermüller, I.L. Hofacker, I. Bell, E. Cheung, J. Drenkow, E. Dumais, S. Patel, G. Helt, M. Ganesh, S. Ghosh, A. Piccolboni, V. Sementchenko, H. Tammana, T.R. Gingeras. Questions or comments about this annotation? Email genome@soe.ucsc.edu. References Kapranov P, Cheng J, Dike S, Nix DA, Duttagupta R, Willingham AT, Stadler PF, Hertel J, Hackermüller J, Hofacker IL, et al. RNA maps reveal new RNA classes and a possible function for pervasive transcription. Science. 2007 Jun 8;316(5830):1484-8. affyTxnPhase3FragsHepG2BottomStrand HepG2 - sRNA Affymetrix HepG2 Minus Strand Short RNA (Whole Cell) Fragments Expression affyTxnPhase3FragsHepG2TopStrand HepG2 + sRNA Affymetrix HepG2 Plus Strand Short RNA (Whole Cell) Fragments Expression affyTxnPhase3FragsHeLaBottomStrand HeLa - sRNA Affymetrix HeLa Minus Strand Short RNA (Whole Cell) Fragments Expression affyTxnPhase3FragsHeLaTopStrand HeLa + sRNA Affymetrix HeLa Plus Strand Short RNA (Whole Cell) Fragments Expression affyTxnPhase3S Affy Tx sRNA Sig Affymetrix Transcriptome Phase 3 Short RNA Signal Expression Description This track displays transcriptome data from tiling GeneChips produced by Affymetrix. For the complete non-repetitive portion of the human genome, more than 256 million probepairs were tiled every 5 bp in non-repeat-masked areas and hybridized to whole-cell short RNA (<200 nucleotides) from two different cell lines. Note that the female cell lines HeLa, SK-N-AS, and U87MG do not contain data for chrY. Data were collected using a strand-specific methodology that allows separate measurements for top strand (sense) and bottom strand (antisense) RNA signals. For experimental details and results, see Kapranov et al. in the References section below. This track contains the signal level estimated for each mapped probe position after hybridization with short RNA samples. A separate track named "Affy Tx sRNA Reg" contains transcribed fragments (transfrags) representing short RNAs (see Kapranov et al. in the References section below). While the raw data are based on perfect match minus mismatch (PM - MM) probe values and may contain negative values, the track has a minimum value of zero for visualization purposes. Likewise, the probes with high signal values will be cut off at the intensity of 150 by default and will appear to have similar magnitude. Display Conventions and Configuration This annotation follows the display conventions for composite tracks. The subtracks within this annotation may be configured in a variety of ways to highlight different aspects of the displayed data. The graphical configuration options are shown at the top of the track description page, followed by a list of subtracks. For more information about the graphical configuration options, click the Graph configuration help link. To display only selected subtracks, uncheck the boxes next to the tracks you wish to hide. Methods Genomic regions corresponding to the sRNA transcripts were mapped for the plus or the minus strands of the genome and for HepG2 and HeLa cell lines respectively. The following analysis steps were performed: Replicate arrays were quantile-normalized and the median intensity (using both PM and MM intensities) of each array was scaled to a target value of 20. The expression level was estimated for each mapped probe position without application of a smoothing window. This was followed by the generation of transfrags in the track labeled "Affy Tx sRNA Reg" as follows: probes with intensities corresponding to the 98th percentile of all the PM-MM probe intensities were identified. transfrags were determined by connecting adjacent positive probes with a maximum allowable gap of four nucleotides and a minimum run of seven nucleotides (at least two consecutive positive probes with no negative probes between), and must be called positive reproducibly in both biological replicas. to further prioritize the transfrag maps, the intensities of all the transfrags were calculated based on the composite signals of the two biological replicas, and transfrags corresponding to the top 25 quartile were used for further analysis. In addition, several successive steps were performed to avoid potentially cross-hybridizing sequences and/or those that mapped to potentially non-unique regions. Transfrags were discarded that overlapped pseudogenes or harbored low-complexity or repetitive sequences. These maps are believed to represent a very conservative view of the sRNA population in the cell. Credits Data generation and analysis were performed by the transcriptome group at Affymetrix with assistance from colleagues at the University of Leipzig, Fraunhofer Institute, and University of Vienna: P. Kapranov, J. Cheng, S. Dike, D.A. Nix, R. Duttagupta, A.T. Willingham, P.F. Stadler, J. Hertel, J. Hackermüller, I.L. Hofacker, I. Bell, E. Cheung, J. Drenkow, E. Dumais, S. Patel, G. Helt, M. Ganesh, S. Ghosh, A. Piccolboni, V. Sementchenko, H. Tammana, T.R. Gingeras. Questions or comments about this annotation? Email genome@soe.ucsc.edu. References Kapranov P, Cheng J, Dike S, Nix DA, Duttagupta R, Willingham AT, Stadler PF, Hertel J, Hackermüller J, Hofacker IL, et al. RNA maps reveal new RNA classes and a possible function for pervasive transcription. Science. 2007 Jun 8;316(5830):1484-8. affyTxnPhase3HepG2BottomStrand HepG2 - sRNA Affymetrix HepG2 Minus Strand Short RNA (Whole Cell) Signal Expression affyTxnPhase3HepG2TopStrand HepG2 + sRNA Affymetrix HepG2 Plus Strand Short RNA (Whole Cell) Signal Expression affyTxnPhase3HeLaBottomStrand HeLa - sRNA Affymetrix Hela Minus Strand Short RNA (Whole Cell) Signal Expression affyTxnPhase3HeLaTopStrand HeLa + sRNA Affymetrix Hela Plus Strand Short RNA (Whole Cell) Signal Expression affyTxnPhase2b Affy Txn Phase2 Affymetrix Transcriptome Project Phase 2 Expression Description This track displays transcriptome data from tiling GeneChips produced by Affymetrix. For the ten chromosomes 6, 7, 13, 14, 19, 20, 21, 22, X, and Y, more than 74 million probes were tiled every 5 bp in non-repeat-masked areas and hybridized to mRNA from 11 different cell lines (some cell lines were female and contain no data for chrY). For HepG2, some samples were depleted of polyA transcripts rather than enriched. For experimental details and results, see Cheng et al. in the References section below. Display Conventions and Configuration This annotation follows the display conventions for composite tracks. The subtracks within this annotation may be configured in a variety of ways to highlight different aspects of the displayed data. The graphical configuration options are shown at the top of the track description page, followed by a list of subtracks. For more information about the graphical configuration options, click the Graph configuration help link. To display only selected subtracks, uncheck the boxes next to the tracks you wish to hide. Each subtrack is colored blue in areas that are thought to be transcribed at a statistically significant level as described in the accompanying transfrags (transcribed fragments) track. Transfrags that have a significant blat hit elsewhere in the genome are displayed in a lighter shade of blue, and transfrags that overlap putative pseudogenes are colored an even lighter shade of blue. All other regions of the track are colored brown. While the raw data are based on perfect match minus mismatch (PM - MM) probe values and may contain negative values, the track has a minimum value of zero for visualization purposes. Methods For each data point, probes within 30 bp on either side were used to improve the estimate of expression level for a particular probe. This helped to smooth the data and produce a more robust estimate of the transcription level at a particular genomic location. The following analysis method was used: Replicate arrays were quantile-normalized and the median intensity (using both PM and MM intensities) of each array was scaled to a target value of 44. The expression level was estimated for each mapped probe position by collecting all the probe pairs that fell within a window of ± 30 bp calculating all non-redundant pairwise averages of PM - MM values of all probe pairs in the window taking the median of all resulting pairwise averages The resulting signal value is the Hodges-Lehmann estimator associated with the Wilcoxon signed-rank statistic of the PM - MM values that lie within ± 30 bp of the sliding window centered at every genomic coordinate. Credits Data generation and analysis was performed by the transcriptome group at Affymetrix: Bekiranov S, Brubaker S, Cheng J, Dike S, Drenkow J, Ghosh S, Gingeras T, Helt G, Kampa D, Kapranov P, Long J, Madhavan G, Manak J, Patel S, Piccolboni A, Sementchenko V, and Tammana H. UCSC annotation performed by Chuck Sugnet. References Cheng J, Kapranov P, Drenkow J, Dike S, Brubaker S, Patel S, Long J, Stern D, Tammana H, Helt G et al. Transcriptional maps of 10 human chromosomes at 5-nucleotide resolution. Science. 2005 May 20;308(5725):1149-54. affyTxnPhase2Frag Transfrags Affymetrix Transcriptome Project Phase 2 Expression U87CytosolicPolyAPlusTnFg U87 TnFg U87 Cytosolic polyA+, Affy Transfrags Expression SKNASCytosolicPolyAPlusTnFg SK-N-AS TnFg SK-N-AS Cytosolic polyA+, Affy Transfrags Expression PC3CytosolicPolyAPlusTnFg PC3 TnFg PC3 Cytosolic polyA+, Affy Transfrags Expression NCCITCytosolicPolyAPlusTnFg NCCIT TnFg NCCIT Cytosolic polyA+, Affy Transfrags Expression JurkatCytosolicPolyAPlusTnFg Jurkat TnFg Jurkat Cytosolic polyA+, Affy Transfrags Expression HepG2NuclearPolyAMinusTnFg HepG2- Nuc TnFg HepG2 Nuclear polyA-, Affy Transfrags Expression HepG2CytosolicPolyAMinusTnFg HepG2- Cyto TnFg HepG2 Cytosolic polyA-, Affy Transfrags Expression HepG2NuclearPolyAPlusTnFg HepG2+ Nuc TnFg HepG2 Nuclear polyA+, Affy Transfrags Expression HepG2CytosolicPolyAPlusTnFg HepG2+ Cyto TnFg HepG2 Cytosolic polyA+, Affy Transfrags Expression FHs738LuCytosolicPolyAPlusTnFg FHs738Lu TnFg FHs738Lu Cytosolic polyA+, Affy Transfrags Expression A375CytosolicPolyAPlusTnFg A375 TnFg A375 Cytosolic polyA+, Affy Transfrags Expression affyTxnPhase2Tome Transcriptome Affymetrix Transcriptome Project Phase 2 Expression U87CytosolicPolyAPlusTxn U87 Txn U87 Cytosolic polyA+, Affy Transcriptome Expression SKNASCytosolicPolyAPlusTxn SK-N-AS Txn SK-N-AS Cytosolic polyA+, Affy Transcriptome Expression PC3CytosolicPolyAPlusTxn PC3 Txn PC3 Cytosolic polyA+, Affy Transcriptome Expression NCCITCytosolicPolyAPlusTxn NCCIT Txn NCCIT Cytosolic polyA+, Affy Transcriptome Expression JurkatCytosolicPolyAPlusTxn Jurkat Txn Jurkat Cytosolic polyA+, Affy Transcriptome Expression HepG2NuclearPolyAMinusTxn HepG2- Nuc Txn HepG2 Nuclear polyA-, Affy Transcriptome Expression HepG2CytosolicPolyAMinusTxn HepG2- Cyto Txn HepG2 Cytosolic polyA-, Affy Transcriptome Expression HepG2NuclearPolyAPlusTxn HepG2+ Nuc Txn HepG2 Nuclear polyA+, Affy Transcriptome Expression HepG2CytosolicPolyAPlusTxn HepG2+ Cyto Txn HepG2 Cytosolic polyA+, Affy Transcriptome Expression FHs738LuCytosolicPolyAPlusTxn FHs738Lu Txn FHs738Lu Cytosolic polyA+, Affy Transcriptome Expression A375CytosolicPolyAPlusTxn A375 Txn A375 Cytosolic polyA+, Affy Transcriptome Expression affyU133 Affy U133 Alignments of Affymetrix Consensus/Exemplars from HG-U133 Expression Description This track shows the location of the consensus and exemplar sequences used for the selection of probes on the Affymetrix HG-U133A and HG-U133B chips. Methods Consensus and exemplar sequences were downloaded from the Affymetrix Product Support and mapped to the genome using blat followed by pslReps with the parameters: -minCover=0.5 -minAli=0.97 -nearTop=0.005 Credits Thanks to Affymetrix for the data underlying this track. affyU133Plus2 Affy U133Plus2 Alignments of Affymetrix Consensus/Exemplars from HG-U133 Plus 2.0 Expression Description This track shows the location of the consensus and exemplar sequences used for the selection of probes on the Affymetrix HG-U133 Plus 2.0 chip. Methods Consensus and exemplar sequences were downloaded from the Affymetrix Product Support and mapped to the genome using blat followed by pslReps with the parameters: -minCover=0.3 -minAli=0.95 -nearTop=0.005 Credits Thanks to Affymetrix for the data underlying this track. affyU95 Affy U95 Alignments of Affymetrix Consensus/Exemplars from HG-U95 Expression Description This track shows the location of the consensus and exemplar sequences used for the selection of probes on the Affymetrix HG-U95Av2 chip. For this chip, probes are predominantly designed from consensus sequences. Methods Consensus and exemplar sequences were downloaded from the Affymetrix Product Support and mapped to the genome using blat followed by pslReps with the parameters: -minCover=0.3 -minAli=0.95 -nearTop=0.005 Credits Thanks to Affymetrix for the data underlying this track. encodeMsaAlignComp Align Agree MSA Alignment Agreement ENCODE Comparative Genomics Description This track shows the level of agreement between the multiple sequence alignments in ENCODE regions generated by the programs MAVID, MLAGAN, and TBA (v2). The alignments were taken from the Sep. 2005 ENCODE MSA freeze. Each subtrack in the annotation shows the base-by-base agreement between alignments generated by two of the three programs. An additional subtrack shows the overall mean agreement among all three alignment pairs. Display Conventions and Configuration This annotation follows the display conventions for composite tracks. The subtracks within this annotation may be configured in a variety of ways to highlight different aspects of the displayed data. The graphical configuration options are shown at the top of the track description page, followed by a list of subtracks. To display only selected subtracks, uncheck the boxes next to the tracks you wish to hide. For more information about the graphical configuration options, click the Graph configuration help link. Methods The agreement for a given human base was computed as the fraction of species other than human in which the alignments from the two programs exactly agree. Two alignments can agree by aligning the human to the same base in another species or by both aligning the human base to a gap in the other species. Note that when two programs both align a human base to gaps in all other species, there is perfect agreement. This is somewhat misleading, because the agreement in the alignment may be simply an artifact of missing sequence in the other species. To facilitate analysis in these instances, the MSA Alignment Gaps (Align Gap) annotation track has been provided to show the number of ungapped species for every base in every alignment. Credits The agreement tracks were generated by Lior Pachter and Ariel Schwartz of UC Berkeley. References Schwartz AS, Myers EW, Pachter L. Alignment metric accuracy. Submitted. 2007. encodeMsaAlignMlaganTba MLAGAN-TBA Agreement MLAGAN-TBA ENCODE Comparative Genomics encodeMsaAlignMavidTba MAVID-TBA Agreement MAVID-TBA ENCODE Comparative Genomics encodeMsaAlignMavidMlagan MAVID-MLAGAN Agreement MAVID-MLAGAN ENCODE Comparative Genomics encodeMsaAlignMean Mean Mean Agreement MAVID-MLAGAN-TBA ENCODE Comparative Genomics encodeMsaAlignUngapped Align Gaps MSA Alignment Gaps (#Ungapped Species) ENCODE Comparative Genomics Description This track shows the number of species aligned to the human genome in the ENCODE regions in which the alignment does not contain a gap at a particular human base position. The multiple species alignments, which were generated by the programs MAVID, MLAGAN and TBA (v2), are taken from the Sep. 2005 ENCODE MSA freeze. This track, which complements the MSA Alignment Agreement track (Align Agree), is useful for distinguishing between agreements in which both programs align the human base to the same base in another species and those instances in which both align the human base to a gap in the other species. Display Conventions and Configuration This annotation follows the display conventions for composite tracks. The subtracks within this annotation may be configured in a variety of ways to highlight different aspects of the displayed data. The graphical configuration options are shown at the top of the track description page, followed by a list of subtracks. To display only selected subtracks, uncheck the boxes next to the tracks you wish to hide. For more information about the graphical configuration options, click the Graph configuration help link. Methods See the description page for the Align Agree track for a discussion of the methods used to generate these data. Credits The agreement tracks were generated by Lior Pachter and Ariel Schwartz of UC Berkeley. References Schwartz AS, Myers EW, Pachter L. Alignment metric accuracy. Submitted. 2007. encodeMsaAlignTbaUngapped TBA Ungapped Species in TBA ENCODE Comparative Genomics encodeMsaAlignMlaganUngapped MLAGAN Ungapped Species in MLAGAN ENCODE Comparative Genomics encodeMsaAlignMavidUngapped MAVID Ungapped Species in MAVID ENCODE Comparative Genomics allenBrainAli Allen Brain Allen Brain Atlas Probes Expression Description This track provides a link into the Allen Brain Atlas (ABA) images for this probe. The ABA is an extensive database of high resolution in-situ hybridization images of adult male mouse brains covering the majority of genes. Methods The ABA created a platform for high-throughput in situ hybridization (ISH) that allows a highly systematic approach to analyzing gene expression in the brain. ISH is a technique that allows the cellular localization of mRNA transcripts for specific genes. Labeled antisense probes, specific to a particular gene, are hybridized to cellular (sense) transcripts and subsequent detection of the bound probe produces specific labeling in those cells expressing the particular gene. This method involves tagged nucleotides detected by colorimetric methods. The platform used for the ABA utilizes this non-isotopic approach, with digoxigenin-labeled nucleotides incorporated into a riboprobe produced by in vitro transcription. This method produces a label that fills the cell body, in contrast to autoradiography that produces scattered silver grains surrounding each labeled cell. To enhance the ability to detect low level expression, the ABA has incorporated a tyramide signal amplification step into the protocol that greatly increases sensitivity. The specific methodology is described in detail within the ABA Data Production Processes document. Credits Thanks to the Allen Institute for Brain Science in general, and Susan Sunkin in particular, for coordinating with UCSC on this annotation. altLocations Alt Haplotypes Alternate Haplotypes to Reference Sequence Correspondence Mapping and Sequencing Description This track shows the corresponding location between the alternate sequences/haplotypes to their corresponding location on the assembled reference genome. altGraphX Alt-Splicing Alternative Splicing from ESTs/mRNAs mRNA and EST Description This track summarizes alternative splicing shown in the mRNA and EST tracks. The blocks represent exons; lines indicate possible splice junctions. The graphical display is drawn such that no exons overlap, making alternative events easier to view when the track is in full display mode and the resolution is set to approximately gene-level. To help reduce the noise present in the EST libraries, exons and splice junctions are filtered based on orthologous mouse transcripts and the frequency with which an exon or intron appears in human transcript libraries. Only those exons and splice junctions that have an orthologous exon or splice junction in the mouse transcriptome or are present three or more times in the human transcriptome are kept. Transcripts labeled as mRNA in GenBank are weighted more heavily, reflecting their typically higher quality. This process is similar to that presented in Sugnet, C.W. et al., Transcriptome and genome conservation of alternative splicing events in humans and mice. Pacific Symposium on Biocomputing (PSB) 2004 Online Proceedings. Methods The splicing graphs for each genome were generated separately from their native EST and mRNA transcripts using the following process: The mRNAs and ESTs were aligned to the genomic sequence using blat. A near-best-in-genome filter was applied such that only alignments with 97% identity over 90% of the transcript and with a score no more than 0.5% lower than the best score were kept. The ESTs and mRNAs were oriented by examining the splice sites used in the genomic sequence. Only consensus splice sites, GT->AG and the less common GC->AG, were used to orient the transcripts. Alignments were clustered together by sequence overlap in exons. As new splice sites were discovered, they were entered into the graphs as vertices; the exons, introns, and splice junctions connecting them were recorded as edges. Each graph was considered to be a single locus, although they may be fragments of an actual gene structure. The supporting mRNA and EST accessions for each edge were also stored. Truncated transcripts were extended by overlap with other transcripts to the next consensus splice site. This prevented the retention of vertices in the graph that were not true splice sites. After the splicing graphs were constructed independently for both human and mouse, they were mapped to each other using the entire set of genome mouse net alignments (viewable on the browser as the Mouse Net track). Only those exons and splice junctions that were common to both or occurred three or more times in the human transcript were kept in the splicing graph. When counting the number of times an exon or splice junction was included in the human transcripts, those designated as mRNA were weighted more heavily than those designated as EST. References For more information on the mouse net alignments, see Kent, W.J., Baertsch, R., Hinrichs, A., Miller, W., and Haussler, D. Evolution's cauldron: Duplication, deletion, and rearrangement in the mouse and human genomes. Proc Natl Acad Sci USA 100(20), 11484-11489 (2003). Credits This annotation was generated by Chuck Sugnet of the UCSC Genome Bioinformatics Group. gold Assembly Assembly from Fragments Mapping and Sequencing Description This track shows the draft assembly of the human genome. This assembly merges contigs from overlapping drafts and finished clones into longer sequence contigs. The sequence contigs are ordered and oriented when possible by mRNA, EST, paired plasmid reads (from the SNP Consortium) and BAC end sequence pairs. In dense mode, this track depicts the path through the draft and finished clones (aka the golden path) used to create the assembled sequence. Clone boundaries are distinguished by the use of alternating gold and brown coloration. Where gaps exist in the path, spaces are shown between the gold and brown blocks. If the relative order and orientation of the contigs between the two blocks is known, a line is drawn to bridge the blocks. Clone Type Key: F - Finished A - In Active Finishing D - Draft P - Pre-Draft O - Other Sequence augustus Augustus Augustus Gene Predictions Genes and Gene Predictions Description This track shows predictions of AUGUSTUS. AUGUSTUS is available through the GOBICS web server. Display Conventions and Configuration This annotation follows the display conventions for composite tracks. To display only selected subtracks, uncheck the boxes next to the tracks you wish to hide. This track also follows the display conventions for gene prediction tracks. This track contains an optional codon coloring feature that allows users to quickly validate and compare gene predictions. To display codon colors, select the genomic codons option from the Color track by codons pull-down menu. Click the Help on codon coloring link for more information about this feature. Methods AUGUSTUS uses a generalized hidden Markov model (GHMM) that models coding and non-coding sequence, splice sites, the branch point region, translation start and end, and lengths of exons and introns. This version has been trained on a set of 1284 human genes. Augustus Gene Predictions Using Hints This subtrack was made using hints from several other tracks: Human mRNAs and Spliced Human ESTs aligned with BLAT TransMapped RefSeq genes from mouse, rat, cow, chicken and dog Retroposed Genes Exoniphy and PhastCons Conserved Elements (from human assembly hg18) Augustus De Novo Gene Predictions This subtrack was made using only the target genome sequence and evolutionary conservation. The conservation information was extracted from the Exoniphy track and the PhastCons Conserved Elements track. Further, hints about retroposed genes were used, that are based only on previous de novo predictions of AUGUSTUS. No transcribed sequences were used for this track. Credits The Augustus subtracks were created by Mario Stanke. The TransMap track was created by Mark Diekhans, the Retroposed Genes tracks by Robert Baertsch, and the Exoniphy and PhastCons Conserved Elements tracks by Adam Siepel's group. References Stanke M. Gene prediction with a hidden Markov model. Ph.D. thesis. Universität Göttingen, Germany. 2004. Stanke M, Steinkamp R, Waack S, Morgenstern B. AUGUSTUS: a web server for gene finding in eukaryotes. Nucl Acids Res. 2004 Jul 1;32(Web Server Issue):W309-12. Stanke M, Tzvetkova A, Morgenstern B. AUGUSTUS at EGASP: using EST, protein and genomic alignments for improved gene prediction in the human genome. Genome Biology. 2006;7(Suppl 1):S11. Stanke M, Waack S. Gene prediction with a hidden Markov model and a new intron submodel. Bioinformatics. 2003 Sep;19(Suppl. 2):ii215-25. augustusXRA Augustus De Novo Augustus De Novo Gene Predictions Genes and Gene Predictions augustusHints Augustus Hints Augustus Gene Predictions Using Hints Genes and Gene Predictions bacEndPairs BAC End Pairs BAC End Pairs Mapping and Sequencing Description Bacterial artificial chromosomes (BACs) are a key part of many large-scale sequencing projects. A BAC typically consists of 25 - 350 kb of DNA. During the early phase of a sequencing project, it is common to sequence a single read (approximately 500 bases) off each end of a large number of BACs. Later on in the project, these BAC end reads can be mapped to the genome sequence. This track shows these mappings in cases where both ends could be mapped. These BAC end pairs can be useful for validating the assembly over relatively long ranges. In some cases, the BACs are useful biological reagents. This track can also be used for determining which BAC contains a given gene, useful information for certain wet lab experiments. A valid pair of BAC end sequences must be at least 25 kb but no more than 350 kb away from each other. The orientation of the first BAC end sequence must be "+" and the orientation of the second BAC end sequence must be "-". The scoring scheme used for this annotation assigns 1000 to an alignment when the BAC end pair aligns to only one location in the genome (after filtering). When a BAC end pair or clone aligns to multiple locations, the score is calculated as 1500/(number of alignments). Methods BAC end sequences are placed on the assembled sequence using Jim Kent's blat program. Credits Additional information about the clone, including how it can be obtained, may be found at the NCBI Clone Registry. To view the registry entry for a specific clone, open the details page for the clone and click on its name at the top of the page. Some BAC library clones (RPCI-11, and others) can be ordered from BACPAC Genomics, RIKEN, or from Thermofisher and possibly other companies. yaleBertoneTars Bertone Yale TAR Yale Transcriptionally Active Regions (TARs) (Bertone data) Expression Description This track shows the locations of transcriptionally active regions (TARs)/transcribed fragments (transfrags) hybridized to an oligonucleotide microarray with a design based on human assembly hg13 (NCBI Build 31) (Bertone et al., 2004). Methods Microarrays were designed using sequence from the human hg13 assembly. The genome sequence was screened for repetitive elements and low-complexity DNA using RepeatMasker in the sensitive mode. Additional low-complexity filtering was performed using the NSEG (segment sequence(s) by local complexity) program using a minimum segment length of 21 nucleotides to determine low complexity segments of lowest probability. After filtering, 1.5 Gb of nonrepetitive DNA remained and microarray probes were chosen using the NASA Oligonucleotide Probe Selection Algorithm (NOPSA). NOPSA is designed to find the optimal probes for hybridization. A database of the frequency of every 18-mer in the genome is created using a hash algorithm. Chaining was used to resolve collisions. Average frequencies of 36-mers in the genome were determined from the frequencies of each 18-mer subsequence in the 36-mer and its reverse complement. 36-mer oligonucleotides with a frequency equal to one are selected as potential probes for the microarray (from supporting online material for Stolc et al., 2004) This resulted in probe selection based on several criteria: Every 36-mer in the genome is unique. Sequences that could form a loop with a stem of > 7 bp were excluded. Factors such as sequence length, extent of complementarity and base composition were also considered. A total of 51,874,388 36-mer oligonucleotide probes were selected from both the sense and antisense strands at an average resolution of 46 bp to cover the non-repetitive sequence from the whole genome. Probes were spaced every 10 nucleotides on average. The probes were synthesized via maskless photolithography at a feature density of approximately 390,000 probes per slide. Biological samples that were hybridized to the arrays consisted of triple-selected human liver poly(A)+ RNA pooled from several individuals (supplied by Ambion). One biological replicate was carried out. See this NCBI GEO accession for details of experimental protocols. The TARs identified for hg13 (NCBI Build 31) were mapped to this assembly using Blat. The program pslCDnaFilter was used to filter alignments using the parameters -minId=0.96, -minCover=0.25, -localNearBest=0.001,-minQSize=20, -minNonRepSize=16, -ignoreNs, -bestOverlap. Display Conventions TARs are represented by blocks in the graphical display. The numeric part of the ID displayed when the track has pack or full visibility is the ID used by the Yale Database for Active Regions with Tools (DART). A link to this database is provided on the details page for each TAR. Data Analysis Two groups of TARs were identified: Normal and Poly(A)-associated. Normal TARs: Clusters of transcription units were identified that consisted of at least five consectutive probes with fluorescence intensities in the top 90th intensity percentile and with genomic coordinates within a 250-nt window. After collecting these regions genome-wide, their locations were compared to those of annotated components of genes. As a result, a total of 13,889 transcription units, ranging in size from 209 to 3,438 nucleotides, were identified. Under the null hypothesis of zero transcription, only 400 were expected to be found. Of those regions identified, one-third (4,931) correspond to previously annotated exons while the other 8,958 are new transcribed sequences that are referred to as TARs. Poly(A)-associated TARs: Another set of criteria was used to find TARs in which the probe hybridization intensities were correlated with the presence of a polyadenylation signal 3' to the TAR. Transcription units are five consecutive probes with fluoroscence intensities in the top 80th intensity percentile and in a window of 250 nucleotides. The 3' region also must contain or be close to a polyadenylation signal. Transcription units with an associated polyadenylation signal of "AATAAA" were assigned to a type I group, while those with "ATTAAA" were type II. Only 100 of these should occur at random in the genome under the null hypothesis of zero transcription. The majority (1,991) were found to be within annotated exons, and 952 were located more than 10 kb from an annotated gene. A total of 1,371 type I and 674 type II poly(A) sequences were identified within exons of known genes. 1,289 (94%) of type I and 607 (90%) of type II instances were found to be in the 3' exon of the gene. Verification The TARs were validated using RT-PCR on human liver poly(A)+ RNA. Forty-eight poly(A)-associated and 48 non-poly(A)-associated TARs were investigated. In 94% (90/96) of cases, the PCR products were found to be of the expected size in a single-pass assay. Credits These data were generated and analyzed by a collaboration between the labs of Michael Snyder, Mark Gerstein, and Sherman Weissman at Yale University and with NASA Ames Research Center (Moffett Field, California) and Eloret Corporation (Sunnyvale, California). References Bertone P, Stolc V, Royce TE, Rozowsky JS, Urban AE, Zhu X, Rinn JL, Tongprasit W, Samanta M, Weissman S et al. Global identification of human transcribed sequences with genome tiling arrays. Science. 2004 Dec 24;306(5705):2242-6. Stolc V, Gauhar Z, Mason C, Halasz G, van Batenburg MF, Rifkin SA, Hua S, Herreman T, Tongprasit W, Barbano PE et al. A gene expression map for the euchromatic genome of Drosophilamelanogaster. Science. 2004 Oct 22;306(5696):655-60. encodeBuFirstExon BU First Exon Boston University First Exon Activity ENCODE Transcript Levels Description This track displays expression levels of computationally identified first exons and a constitutive exon of genes in ENCODE regions, based on the real competitive Polymerase Chain Reaction (rcPCR) technique described in Ding et al. (2003). Expression levels are indicated by color, ranging from black (no expression) to red (high expression). Experiments were performed on total RNA samples of ten normal human tissues purchased from Clontech (Palo Alto, CA): cerebral cortex, colon, heart, kidney, liver, lung, skeletal muscle, spleen, stomach, and testis. The name for each alternative transcript starts with the gene name, followed by an identifier for the alternative first exon or the constitutive exon. For example, for gene CAV1, there are three alternative first exons (CAV1-E1A, CAV1-E1B, and CAV1-E1C) and the third exon is chosen as the constitutively expressed exon (CAV1-E3). Methods Alternative transcription start sites (TSS) for 20 ENCODE genes were predicted using PromoSer, an in-house computational tool. PromoSer computationally identifies the TSS by considering alignments of a large number of partial and full-length mRNA sequences and ESTs to genomic DNA, with provision for alternative promoters. In PromoSer, the treatment of alternative first exons (or the resulting TSSs) is as follows: all transcripts (mRNA, full-length mRNA and EST) from the same gene cluster are examined individual ESTs are not considered for alternative TSSs; only the 5'-most positions from all ESTs in the cluster are considered a potential TSS if multiple 5'-end positions are more than 20 bp apart, they are reported as alternative TSSs For each gene, all alternative first exons were identified based on manual selection of PromoSer predictions. An exon that is shared by all transcripts (called the constitutive exon) was also selected. The selection process involved visually examining the structure of the cluster, preferably using the latest data available on UCSC, to identify distinct first exons that were well formed (having multiple supporting sequences) and had no evidence (especially from newer sequences) of additional sequence that made them internal exons. After the first exon was identified, a subsequence (between 100-300 bases) was selected for use in the experiment. The selection process avoided repeat sequences as much as possible and if the two first exons partially overlapped, the non-overlapping region was selected. If those conditions caused the remaining sequence to be too short (or the first exon itself was too short), a junction with the second exon was used. A constitutive exon was also selected that was included in all (or most) of the alternative transcripts and suitable sequences were then extracted as above (no exon junctions are used). The absolute expression levels of all exons were individually quantified by rcPCR by designing four assays with PCR amplicons corresponding to each exon. Amplicons were designed according to transcript sequences and can span a large distance on the genomic sequence. In addition, some amplicons were designed across the junctions between first exons and the constitutive second exons, and thus these amplicons may overlap with the amplicons that correspond to the constitutive second exons. The rcPCR technique combined competitive PCR and matrix-assisted laser desorption/ionization time-of-flight mass spectrometry (MALDI-TOF MS) for gene expression analysis. To measure the expression level of a gene, an oligonucleotide standard (60-80 bases) of known concentration, complementary to the target sequence with a single base mismatch in the middle, was added as the competitor for PCR. The gene of interest and the oligonucleotide standard resembled two alleles of a heterozygous locus in an allele frequency analysis experiment, and thus could be quantified by the high-throughput MALDI-TOF MS based MassARRAY system (Sequenom Inc.). After PCR, a base extension reaction was carried out with an extension primer, a ThermoSequenase and a mixture of ddNTPs/dNTP (for example, a mixture of ddA, ddC, ddT, and dG). The extension primer annealed the immediate 5’-upstream sequence of the mismatch position. Depending on the nature of the mismatch and the mixture composition of ddNTPs/dNTP, one or two bases were added to the extension primer, producing two extension products with one base-length difference. These two extension products were then detected and quantified by MALDI-TOF MS. Expression ratios (e.g. CAV1-E1A/CAV1-E3, CAV1-E1B/CAV1-E3, CAV1-E1C/CAV1-E3) indicate the relative abundance of alternative first exons. 18S rRNA was used for exon absolute expression normalization among different tissues. Values shown on this track represent the relative abundance of the alternative first exons with respect to the 18S rRNA. The raw values have been log10 transformed and scaled to show graded colors on the browser. Verification One biological replicate was performed for each gene. Two to four competitor concentrations were used to detect the expression level of each exon. Two to six technical replicates were performed for each competitor concentration. One more biological replicate will be performed in the future. Credits Data generation and analysis for this track were performed by ZLAB at Boston University. The following people contributed: Shengnan Jin, Anason Halees, Heather Burden, Yutao Fu, Ulas Karaoz, Yong Yu, Chunming Ding, Charles R. Cantor, and Zhiping Weng. References Ding, C. and Cantor, C.R. A high-throughput gene expression analysis technique using competitive PCR and matrix-assisted laser desorption ionization time-of-flight MS. Proc Natl Acad Sci U S A 100(6), 3059-64 (2003). Ding, C. and Cantor, C.R. Direct molecular haplotyping of long-range genomic DNA with M1-PCR. Proc Natl Acad Sci U S A 100(13), 7449-53 (2003). Halees, A.S., Leyfer, D. and Weng, Z. PromoSer: A large-scale mammalian promoter and transcription start site identification service. Nucleic Acids Res. 31(13), 3554-9 (2003). Halees, A.S. and Weng, Z. PromoSer: improvements to the algorithm, visualization and accessibility. Nucleic Acids Res., 32, W191-W194 (2004). encodeBuFirstExonTestis BU Testis Boston University First Exon Activity in Testis ENCODE Transcript Levels encodeBuFirstExonStomach BU Stomach Boston University First Exon Activity in Stomach ENCODE Transcript Levels encodeBuFirstExonSpleen BU Spleen Boston University First Exon Activity in Spleen ENCODE Transcript Levels encodeBuFirstExonSkMuscle BU Skel. Muscle Boston University First Exon Activity in Skeletal Muscle ENCODE Transcript Levels encodeBuFirstExonLung BU Lung Boston University First Exon Activity in Lung ENCODE Transcript Levels encodeBuFirstExonLiver BU Liver Boston University First Exon Activity in Liver ENCODE Transcript Levels encodeBuFirstExonKidney BU Kidney Boston University First Exon Activity in Kidney ENCODE Transcript Levels encodeBuFirstExonHeart BU Heart Boston University First Exon Activity in Heart ENCODE Transcript Levels encodeBuFirstExonColon BU Colon Boston University First Exon Activity in Colon ENCODE Transcript Levels encodeBuFirstExonCerebrum BU Cere. Cortex Boston University First Exon Activity in Cerebral Cortex ENCODE Transcript Levels encodeBu_ORChID1 BU ORChID Boston University ORChID (OH Radical Cleavage Intensity Database) ENCODE Chromosome, Chromatin and DNA Structure Description This track displays the predicted hydroxyl radical cleavage intensity on naked DNA for each nucleotide in the ENCODE regions. Because the hydroxyl radical cleavage intensity is proportional to the solvent accessible surface area of the deoxyribose hydrogen atoms (Balasubramanian et al., 1998), this track represents a structural profile of the DNA in the ENCODE regions. Please visit the ORChID website maintained by the Tullius group for access to experimental hydroxyl radical cleavage data, and to a server which can be used to predict the cleavage pattern for any input sequence. Display Conventions and Configuration This track may be configured in a variety of ways to highlight different aspects of the displayed data. The graphical configuration options are shown at the top of the track description page. For more information, click the Graph configuration help link. Methods Hydroxyl radical cleavage intensity predictions were performed using an in-house sliding trimer window (STW) algorithm. This algorithm draws data from the ·OH Radical Cleavage Intensity Database (ORChID), which contains more than 150 experimentally determined cleavage patterns. These predictions are fairly accurate, with a Pearson coefficient of ~0.85 between the predicted and experimentally determined cleavage intensities. For more details on the hydroxyl radical cleavage method, see the References section below. Verification The STW algorithm has been cross-validated by removing each test sequence from the training set and performing a prediction. The mean correlation coefficient (between predicted and experimental cleavage patterns) from this study was 0.85. Credits These data were generated through the combined effort of Bo Pang at MIT and Jason Greenbaum, Steve Parker, and Tom Tullius of Boston University. References Balasubramanian, B., Pogozelski, W.K., and Tullius, T.D. DNA strand breaking by the hydroxyl radical is governed by the accessible surface areas of the hydrogen atoms of the DNA backbone. Proc. Natl. Acad. Sci. USA 95(17), 9738-9743 (1998). Price, M. A., and Tullius, T. D. Using the Hydroxyl Radical to Probe DNA Structure. Meth. Enzymol. 212, 194-219 (1992). Tullius, T. D. Probing DNA Structure with Hydroxyl Radicals. In Current Protocols in Nucleic Acid Chemistry, (eds. Beaucage, S.L., Bergstrom, D.E., Glick, G.D. and Jones, R.A.) (Wiley, 2001), pp. 6.7.1-6.7.8. ccdsGene CCDS Consensus CDS Genes and Gene Predictions Description This track shows human genome high-confidence gene annotations from the Consensus Coding Sequence (CCDS) project. This project is a collaborative effort to identify a core set of human protein-coding regions that are consistently annotated and of high quality. The long-term goal is to support convergence towards a standard set of gene annotations on the human genome. Collaborators include: European Bioinformatics Institute (EBI) National Center for Biotechnology Information (NCBI) University of California, Santa Cruz (UCSC) Wellcome Trust Sanger Institute (WTSI) For more information on the different gene tracks, see our Genes FAQ. Methods CDS annotations of the human genome were obtained from two sources: NCBI RefSeq and a union of the gene annotations from Ensembl and Vega, collectively known as Hinxton. Genes with identical CDS genomic coordinates in both sets become CCDS candidates. The genes undergo a quality evaluation, which must be approved by all collaborators. The following criteria are currently used to assess each gene: an initiating ATG (Exception: a non-ATG translation start codon is annotated if it has sufficient experimental support), a valid stop codon, and no in-frame stop codons (Exception: selenoproteins, which contain a TGA codon that is known to be translated to a selenocysteine instead of functioning as a stop codon) ability to be translated from the genome reference sequence without frameshifts recognizable splicing sites no intersection with putative pseudogene predictions supporting transcripts and protein homology conservation evidence with other species A unique CCDS ID is assigned to the CCDS, which links together all gene annotations with the same CDS. CCDS gene annotations are under continuous review, with periodic updates to this track. Credits This track was produced at UCSC from data downloaded from the CCDS project web site. References Hubbard T, Barker D, Birney E, Cameron G, Chen Y, Clark L, Cox T, Cuff J, Curwen V, Down T et al. The Ensembl genome database project. Nucleic Acids Res. 2002 Jan 1;30(1):38-41. PMID: 11752248; PMC: PMC99161 Pruitt KD, Harrow J, Harte RA, Wallin C, Diekhans M, Maglott DR, Searle S, Farrell CM, Loveland JE, Ruef BJ et al. The consensus coding sequence (CCDS) project: Identifying a common protein-coding gene set for the human and mouse genomes. Genome Res. 2009 Jul;19(7):1316-23. PMID: 19498102; PMC: PMC2704439 Pruitt KD, Tatusova T, Maglott DR. NCBI Reference Sequence (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Res. 2005 Jan 1;33(Database issue):D501-4. PMID: 15608248; PMC: PMC539979 cytoBand Chromosome Band Chromosome Bands Localized by FISH Mapping Clones Mapping and Sequencing Description The chromosome band track represents the approximate location of bands seen on Giemsa-stained chromosomes. Chromosomes are displayed in the browser with the short arm first. Cytologically identified bands on the chromosome are numbered outward from the centromere on the short (p) and long (q) arms. At low resolution, bands are classified using the nomenclature [chromosome][arm][band], where band is a single digit. Examples of bands on chromosome 3 include 3p2, 3p1, cen, 3q1, and 3q2. At a finer resolution, some of the bands are subdivided into sub-bands, adding a second digit to the band number, e.g. 3p26. This resolution produces about 500 bands. A final subdivision into a total of 862 sub-bands is made by adding a period and another digit to the band, resulting in 3p26.3, 3p26.2, etc. Methods A full description of the method by which the chromosome band locations are estimated can be found in Furey and Haussler, 2003. Barbara Trask, Vivian Cheung, Norma Nowak and others in the BAC Resource Consortium used fluorescent in-situ hybridization (FISH) to determine a cytogenetic location for large genomic clones on the chromosomes. The results from these experiments are the primary source of information used in estimating the chromosome band locations. For more information about the process, see the paper, Cheung, et al., 2001. and the accompanying web site, Human BAC Resource. BAC clone placements in the human sequence are determined at UCSC using a combination of full BAC clone sequence, BAC end sequence, and STS marker information. Credits We would like to thank all the labs that have contributed to this resource: Fred Hutchinson Cancer Research Center (FHCRC) National Cancer Institute (NCI) Roswell Park Cancer Institute (RPCI) The Wellcome Trust Sanger Institute (SC) Cedars-Sinai Medical Center (CSMC) Los Alamos National Laboratory (LANL) UC San Francisco Cancer Center (UCSF) References Cheung VG, Nowak N, Jang W, Kirsch IR, Zhao S, Chen XN, Furey TS, Kim UJ, Kuo WL, Olivier M et al. Integration of cytogenetic landmarks into the draft sequence of the human genome. Nature. 2001 Feb 15;409(6822):953-8. PMID: 11237021 Furey TS, Haussler D. Integration of the cytogenetic map with the draft human genome sequence. Hum Mol Genet. 2003 May 1;12(9):1037-44. PMID: 12700172 cytoBandIdeo Chromosome Band (Ideogram) Chromosome Bands Localized by FISH Mapping Clones (for Ideogram) Mapping and Sequencing encodeMsaElements Consens Elements MSA Consensus Constrained Elements ENCODE Comparative Genomics Description The consensus elements in this track were generated by the ENCODE Multi-Species Analysis group from the nine different combinations of three conservation algorithms (phastCons, binCons, and GERP) and three sequence alignment methods (TBA, MLAGAN and MAVID) applied to the ENCODE region sequences of 28 vertebrate species as defined in the September 2005 ENCODE MSA sequence freeze and the MSA species guide tree. Three different stringencies were used. The loose set of constrained sequences represent bases identified as being constrained by any conservation algorithm on any alignment. The moderate set of constrained sequences is derived from bases shown to be constrained by at least two of the three conservation algorithms on at least two of the three alignments. Finally, the strict set of constrained sequences represent only those bases that were constrained using all three conservation programs on all three multi-sequence alignments. Display Conventions and Configuration The locations of constrained elements are indicated by blocks in the graphical display. To show only selected subtracks within this annotation, uncheck the boxes next to the tracks you wish to hide. Methods See the description pages for the TBA Elements, MLAGAN Elements and MAVID Elements for additional information about methods used to generate these data. Verification See the description pages for the TBA Elements, MLAGAN Elements and MAVID Elements for information about verification techniques used to generate these data. Credits The strict, moderate, and loose data shown in these subtracks were contributed by Elliott Margulies of NHGRI representing the ENCODE Multi-Species Analysis group. Conservation Scoring PhastCons was developed by Adam Siepel, Cold Spring Harbor Laboratory, while at the Haussler Lab at UCSC. BinCons was developed by Elliott Margulies, while at the Eric Green lab. GERP was developed primarily by Greg Cooper in the lab of Arend Sidow at Stanford University (Depts of Pathology and Genetics), in close collaboration with Eric Stone (Biostatistics, NC State), and George Asimenos and Eugene Davydov in the lab of Serafim Batzoglou (Dept. of Computer Science, Stanford). Sequence Alignment TBA and Blastz were developed by Minmei Hou, Scott Schwartz and Webb Miller of the Penn State Bioinformatics Group. MLAGAN, Shuffle-LAGAN, and SuperMap were written by Mike Brudno while at the Batzoglou lab. MUSCLE was authored by Bob Edgar. AB-BLAST was provided by the Gish lab at the School of Medicine, University of Washington in St. Louis. Mercator was written by Colin Dewey and Lior Pachter at the Pachter Lab Comparative Genomics Group at UC Berkeley. MAVID was authored by Nicholas Bray and Lior Pachter. The phylogenetic tree is based on Murphy et al. (2001) and general consensus in the vertebrate phylogeny community. References Blanchette, M., Kent, W.J., Reimer, C., Elnitski, L., Smit, A., Roskin, K., Baertsch, R., Rosenbloom, K.R., Clawson, H. et al. Aligning Multiple Genomic Sequences With the Threaded Blockset Aligner. Genome Res 14(4), 708-15 (2004). Bray, N. and Pachter, L. MAVID: Constrained Ancestral Alignment of Multiple Sequences. Genome Res 14(4), 693-99 (2004). Brudno, M., Do, C., Cooper, G., Kim, M.F., Davydov, E., NISC Comparative Sequencing Program, Green, E.D., Sidow, A. and Batzoglou, S. LAGAN and Multi-LAGAN: efficient tools for large-scale multiple alignment of genomic DNA. Genome Res 13(4), 721-31 (2003). Brudno, M., Malde, S., Poliakov, A., Do, C., Couronne, O., Dubchak, I. and Batzoglou, S. Global alignment: finding rearrangements during alignment. Bioinformatics 19(Suppl. 1), i54-i62 (2003). Burge, C. and Karlin, S. Prediction of complete gene structures in human genomic DNA. J Mol Biol 268(1), 78-94 (1997). Chiaromonte, F., Yap, V.B., and Miller, W. Scoring pairwise genomic sequence alignments. Pac Symp Biocomput, 115-26 (2002). Cooper, G.M., Stone, E.A., Asimenos, G., NISC Comparative Sequencing Program, Green, E.D., Batzoglou, S. and Sidow, A. Distribution and intensity of constraint in mammalian genomic sequence. Genome Res 15(7), 901-13 (2005). Dewey, C.N. and Pachter, L. Mercator: multiple whole-genome orthology map construction. In preparation. Edgar, R.C. MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucl Acids Res 32(5), 1792-97 (2004). Kent, W.J. BLAT-the BLAST-like alignment tool. Genome Res 12(4), 656-664 (2002). Kurtz, S., Phillippy, A., Delcher, A.L., Smoot, M., Shumway, M., Antonescu, C. and Salzberg, S.L. Versatile and open software for comparing large genomes. Genome Biol 5(2), R12 (2004). Margulies, E.H., Blanchette, M., NISC Comparative Sequencing Program, Haussler, D. and Green, E.D. Identification and characterization of multi-species conserved sequences. Genome Res 13(12), 2507-18 (2003). Murphy, W.J., et al. Resolution of the early placental mammal radiation using Bayesian phylogenetics. Science 294(5550), 2348-51 (2001). Schwartz, S., Kent, W.J., Smit, A., Zhang, Z., Baertsch, R., Hardison, R., Haussler, D. and Miller, W. Human-Mouse Alignments with BLASTZ. Genome Res 13(1), 103-7 (2003). Siepel, A., Bejerano, G., Pedersen, J.S., Hinrichs, A., Hou, M., Rosenbloom, K., Clawson, H., Spieth, J., Hillier, L.W. et al. Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. Genome Res 15(8), 1034-50 (2005). encodeMsaElIntersect Strict MSA Consensus Strict Constrained Elements ENCODE Comparative Genomics encodeMsaElModerate Moderate MSA Consensus Moderate Constrained Elements ENCODE Comparative Genomics encodeMsaElUnion Loose MSA Consensus Loose Constrained Elements ENCODE Comparative Genomics clonePos Coverage Clone Coverage Mapping and Sequencing Description This track shows the coverage level of the genome and, in full display mode, the position of each clone that aligns to the genome sequence. For more information on the track display, see the Display Conventions section. Each clone is assigned one of three coverage levels, which may be viewed in the "Status" field on the details page for the clone: Finished - less than one error in 10,000 bases. Draft - more than 4x coverage. Predraft - less than 4x coverage. Display Conventions and Configuration In dense display mode, this track shows the coverage level of the genome. Finished regions are depicted in black. Draft regions are shown in various shades of gray that correspond to the level of coverage. At high resolution these usually resolve into gaps between clones. In full display mode, this track shows the position of each clone that aligns to the genome sequence. Where fragments of a clone do not form a continuous alignment, individual fragments are displayed against a gray background and numbered. Methods Clones from the National Center for Biotechnology Information (NCBI) were aligned to the genome using psLayout. Where a clone record contained unordered fragments, fragments were individually aligned, then clustered about a median value and displayed together if they align with gaps no greater than 60 kb. Some fragments in an original clone record may not be displayed. Credits This annotation was devised by Jim Kent, with further engineering by Hiram Clawson. encodeDless DLESS Detection of LinEage Specific Selection (DLESS) ENCODE Comparative Genomics Description This track shows elements that are predicted to be under lineage-specific selection, according to the DLESS (Detection of LinEage Specific Selection) program. Three types of elements are identified: elements "conserved" across all species, elements "gained" (i.e., that have come under selection) on some branch of the phylogeny, and elements "lost" (i.e., released from selection) on some branch of the phylogeny. Currently, DLESS allows for negative selection only and permits at most one gain or loss event per element. Thus, sequences that are conserved (relative to a model of neutral evolution) in some subtree of the phylogeny, but are not especially conserved in the complementary "supertree," are predicted as "gains," and sequences that are conserved in some supertree but not in the complementary subtree are predicted as "losses." Sequences that are conserved across the whole tree are simply labeled "conserved." Display Conventions Predicted conserved elements are shown in black, gains are shown in green, and losses are shown in red. Gains and losses are labeled with two species names indicating the branch of the phylogeny on which the event in question is predicted to have occurred. For example, a gain labeled "rat-mouse" is predicted to have occurred on the branch above the most recent common ancestor of rat and mouse (i.e., it is a rodent-specific conserved element). By clicking on an element in "pack" or "full" mode you can obtain a details page summarizing the support for the prediction. This page includes various statistics and p-values computed by the phyloP program. Methods DLESS was run on the NHGRI/PSU TBA alignments of the sequences from the September 2005 ENCODE data freeze. Only the 17 mammals that are well represented across ENCODE targets were included. The non-mammalian vertebrates were excluded because they can only be aligned in conserved regions. The program was given a phylogeny and model of neutral substitution estimated from fourfold degenerate sites in coding regions, using the phyloFit program. (The tree topology was held fixed; we assumed the same topology as for the other ENCODE analyses.) A rendering of the estimated neutral phylogeny, showing the 17 species at the leaves and estimated branch lengths in expected substitutions per site, is available here. The parameters that define DLESS's HMM and indel model were estimated by maximum likelihood. The following values were used: --expected-length 20 --target-coverage 0.055 --phi 0.261 --indel-model 0.0334,0.0533,0.0529,0.0117,0.0206,0.0654 After predicted elements were obtained using DLESS, p-values for each element were computed using phyloP. Only predictions with p-values of less than 0.05 were included in the track (conditional p-values in the case of lineage-specific elements; see Siepel et al., 2006). DLESS is based on a phylo-HMM with states for neutrally evolving and conserved regions, and for gains and losses on each branch of the tree. It uses insertions and deletions as well as substitutions in identifying elements under selection. PhyloP computes p-values based on prior and posterior distributions of the number of substitutions, as implied by a model of neutral substitution. These distributions are obtained using a dynamic programming algorithm. Details are given in Siepel et al. (2006). Credits DLESS and phyloP were written by Adam Siepel, based on ideas worked out in collaboration with Katie Pollard and David Haussler. Thanks to Elliott Margulies for providing the model of neutral substitution, and to Brian Raney for preprocessing the alignments to distinguish between indels and missing data. References Siepel, A., Pollard, K.S. and Haussler, D. New methods for detecting lineage-specific selection. In Proc. 10th Int'l Conf. on Research in Computational Molecular Biology (RECOMB '06) (2006). encodeNhgriDukeDnaseHs Duke/NHGRI DNase Duke/NHGRI DNaseI-Hypersensitivity ENCODE Chromosome, Chromatin and DNA Structure Description This track displays DNaseI-hypersensitive sites identified using two methods (DNase-chip and MPSS sequencing) in seven human cell types: primary unactivated and activated CD4+ T cells GM06990 lymphoblastoid HeLa S3 cervical carcinoma (Puck et al., 1956) HepG2 liver carcinoma H9 human undifferentiated embryonic stem (ES) (Thomson et al., 1998) IMR90 human fibroblast K562 myeloid leukemia-derived (Klein et al., 1976) DNaseI-hypersensitive sites are associated with all types of gene regulatory regions, including promoters, enhancers, silencers, insulators, and locus control regions. Display Conventions and Configuration The subtracks within this track are grouped into three sections: Raw subtracks display log2 ratio data averaged from three biological replicates and three DNase concentrations. Pval subtracks show significant regions that likely represent valid DNaseI-hypersensitive sites based on the raw data. The higher the score for the region, the more likely the site is to be hypersensitive. Regions have unique identifiers that are prefixed with the cell type. For display purposes, the p value scores were mapped to integer scores in the range 0-1000. Regions are displayed in a range of light gray to black, based on score. MPSS subtracks show hypersensitive sites determined by massively parallel signature sequencing (MPSS). Each cluster has a unique identifier. The last digit of each identifier represents the number of sequences that map within that particular cluster. The sequence number is also reflected in the score, e.g. a cluster of two sequences scores 500, three sequences scores 750 and four or more sequences scores 1000. Sites are displayed in a range of light gray to black, based on score. The "Raw" and "Pval" subtracks are displayed by default. Use the checkboxes on the Track Settings page to change the subtracks displayed. Methods DNase-Chip DNaseI hypersensitive sites were isolated using a method called DNase-chip (Crawford et al., 2006). Briefly, DNaseI digested ends from intact chromatin were captured using three different DNase concentrations as well as three biological replicates. This material was amplified, labeled, and hybridized to NimbleGen ENCODE tiled microarrays. H9 human ES cells (Thomson et al., 1998) were cultured on a feeder layer of mitotically inactivated mouse embryo fibroblasts. For analysis, human ES cell colonies were separated away from the feeder layer and processed for DNaseI hypersensitive site mapping. Cultures were routinely inspected by immunohistochemistry, flow cytometry, and microarray to ensure that the human ES cells were in the undifferentiated state. For the DNase-chip experiments, the raw data were averaged from nine hybridizations per cell type. The Pval scores represent -log10 p values as determined by the ACME (Algorithm for Capturing Microarray Enrichment) program (Scacheri et al., 2006). Only regions that had p value < 0.001 were included. For display in the Genome Browser, the p value scores were mapped to integer scores in the range 0-1000 using the following formula: score = (pVal * 35) + 100. The -log10 p values can be viewed using the Table Browser. MPSS Sequencing Primary human CD4+ T cells were activated by incubation with anti-CD3 and anti-CD28 antibodies for 24 hours. DNaseI-hypersensitive sites were cloned from the cells before and after activation, and sequenced using massively parallel signature sequencing (Brenner et al., 2000; Crawford et al., 2006). Only those clusters of multiple DNaseI library sequences that map within 500 bases of each other are displayed. Verification DNase-Chip A real-time PCR assay (McArthur et al., 2001; Crawford et al. , 2004) was used to validate a randomly selected subset of DNase-chip regions. For the New DNase-chip, the Sensitivity of DNase-chip was determined to be > 86% and Specificity to be > 97%. Approximately 20-30% of regions detected in only a single DNase concentration are valid. 50-80% of regions detected in two out of three DNase concentrations are valid (the exact percentage depends on which two DNase concentrations had significant signal). 90% of regions detected in all three DNase concentrations are valid. This data set includes elements for all 44 ENCODE regions. MPSS Sequencing Real-time PCR was used to verify valid DNaseI-hypersensitive sites. Approximately 50% of clusters of two sequences are valid. These clusters are shown in light gray. 80% of clusters of three sequences are valid, indicated by dark gray. 100% of clusters of four or more sequences are valid, shown in black. This data set includes confirmed elements for 35 of the 44 ENCODE regions. It is estimated that these data identify 10-20% of all hypersensitive sites within CD4+ T cells. Further sequencing will be required to identify additional sites. MPSS data from the whole genome can be found in the Expression and Regulation track group (NHGRI DNaseI-HS track). Credits These data were produced at the Crawford Lab at Duke University, and at the Collins Lab at NHGRI. Thanks to Gregory E. Crawford and Francis S. Collins for supplying the information for this track. H9 cells were grown in collaboration with Ron McKay and Paul Tesar at the National Institute of Neurological Disorders and Stroke (NINDS)—an institute of the National Institutes of Health (NIH). References Brenner S, Johnson M, Bridgham J, Golda G, Lloyd DH, Johnson D, Luo S, McCurdy S, Foy M, Ewan M et al. Gene expression analysis by massively parallel signature sequencing (MPSS) on microbead arrays. Nat Biotechnol. 2000 Jun;18(6):630-4. Crawford GE, Davis S, Scacheri PC, Renaud G, Halawi MJ, Erdos MR, Green R, Meltzer PS, Wolfsberg TG, Collins FS. DNase-chip: A high resolution method to identify DNase I hypersensitive sites using tiled microarrays. Nature Methods. 2006 Jul;3(7):503-9. Crawford GE, Holt IE, Mullikin JC, Tai D, Blakesley R, Bouffard G, Young A, Masiello C, Green ED, Wolfsberg TG et al. Identifying gene regulatory elements by genome-wide recovery of DNase hypersensitive sites. Proc Natl Acad Sci USA. 2004 Jan 27;101(4):992-7. Crawford GE, Holt IE, Whittle J, Webb BD, Tai D, Davis S, Margulies EH, Chen Y, Bernat JA, Ginsburg D et al. Genome-wide mapping of DNase hypersensitive sites using massively parallel signature sequencing (MPSS). Genome Res. 2006 Jan;16(1):123-31. (See also NHGRI's data site for the project.) Klein E, Ben-Bassat H, Neumann H, Ralph P, Zeuthen J, Polliack A, Vanky F. Properties of the K562 cell line, derived from a patient with chronic myeloid leukemia. Int J Cancer. 1976 Oct 15;18(4):421-31. McArthur M, Gerum S, Stamatoyannopoulos G. Quantification of DNaseI-sensitivity by real-time PCR: quantitative analysis of DNaseI-hypersensitivity of the mouse beta-globin LCR . J Mol Biol. 2001 Oct 12;313(1):27-34. Puck TT, Marcus PI, Cieciura SJ. Clonal growth of mammalian cells in vitro: growth characteristics of colonies from single HeLa cells with and without a "feeder" layer. J Exp Med. 1956 Feb 1;103(2):273-83. Scacheri PC, Crawford GE, Davis S. Statistics for ChIP-chip and DNase hypersensitivity experiments on NimbleGen arrays. Methods Enzymol. 2006;411:270-82. Thomson JA, Itskovitz-Eldor J, Shapirom SS, Waknitz MA, Swiergiel JJ, Marshall VS, Jones JM. Embryonic stem cell lines derived from human blastocysts. Science. 1998 Nov 6;282(5391):1145-7. encodeNhgriDnaseHsMpssCd4Act DNase CD4-act MS Duke/NHGRI DNaseI-Hypersensitive Sites (CD4+ T-Cells Activated, MPSS method) ENCODE Chromosome, Chromatin and DNA Structure encodeNhgriDnaseHsMpssCd4 DNase CD4 MS Duke/NHGRI DNaseI-Hypersensitive Sites (CD4+ T-Cells, MPSS method) ENCODE Chromosome, Chromatin and DNA Structure encodeNhgriDnaseHsChipPvalK562 DNase K562 Pval Duke/NHGRI DNaseI-Hypersensitivity P-Value (K562) ENCODE Chromosome, Chromatin and DNA Structure encodeNhgriDnaseHsChipPvalImr90 DNase IMR90 Pval Duke/NHGRI DNaseI-Hypersensitivity P-Value (IMR90) ENCODE Chromosome, Chromatin and DNA Structure encodeNhgriDnaseHsChipPvalH9 DNase H9 Pval Duke/NHGRI DNaseI-Hypersensitivity P-Value (H9) ENCODE Chromosome, Chromatin and DNA Structure encodeNhgriDnaseHsChipPvalHepG2 DNase HepG2 Pval Duke/NHGRI DNaseI-Hypersensitivity P-Value (HepG2) ENCODE Chromosome, Chromatin and DNA Structure encodeNhgriDnaseHsChipPvalHela DNase HeLa Pval Duke/NHGRI DNaseI-Hypersensitivity P-Value (HeLaS3) ENCODE Chromosome, Chromatin and DNA Structure encodeNhgriDnaseHsChipPvalCd4 DNase CD4 Pval Duke/NHGRI DNaseI-Hypersensitivity P-Value (CD4+) ENCODE Chromosome, Chromatin and DNA Structure encodeNhgriDnaseHsChipPvalGm06990 DNase GM069 Pval Duke/NHGRI DNaseI-Hypersensitivity P-Value (GM06990) ENCODE Chromosome, Chromatin and DNA Structure encodeNhgriDnaseHsChipRawK562 DNase K562 Raw Duke/NHGRI DNaseI-Hypersensitivity Raw (K562) ENCODE Chromosome, Chromatin and DNA Structure encodeNhgriDnaseHsChipRawImr90 DNase IMR90 Raw Duke/NHGRI DNaseI-Hypersensitivity Raw (IMR990) ENCODE Chromosome, Chromatin and DNA Structure encodeNhgriDnaseHsChipRawH9 DNase H9 Raw Duke/NHGRI DNaseI-Hypersensitivity Raw (H9) ENCODE Chromosome, Chromatin and DNA Structure encodeNhgriDnaseHsChipRawHepG2 DNase HepG2 Raw Duke/NHGRI DNaseI-Hypersensitivity Raw (HepG2) ENCODE Chromosome, Chromatin and DNA Structure encodeNhgriDnaseHsChipRawHela DNase HeLa Raw Duke/NHGRI DNaseI-Hypersensitivity Raw (HeLaS3) ENCODE Chromosome, Chromatin and DNA Structure encodeNhgriDnaseHsChipRawCd4 DNase CD4 Raw Duke/NHGRI DNaseI-Hypersensitivity Raw (CD4+) ENCODE Chromosome, Chromatin and DNA Structure encodeNhgriDnaseHsChipRawGm06990 DNase GM069 Raw Duke/NHGRI DNaseI-Hypersensitivity Raw (GM06990) ENCODE Chromosome, Chromatin and DNA Structure ECgene ECgene Genes ECgene v1.2 Gene Predictions with Alt-Splicing Genes and Gene Predictions Description ECgene (gene prediction by EST clustering) predicts genes by combining genome-based EST clustering and a transcript assembly procedure in a coherent and consistent fashion. Specifically, ECgene takes alternative splicing events into consideration. The positions of splice sites (i.e. exon-intron boundaries) in the genome map are utilized as critical information in the whole procedure. Sequences that share splice sites in the genomic alignment are grouped together to define an EST cluster. Transcript assembly, based on graph theory, produces gene models and clone evidence, which is essentially identical to sub-clustering according to splice variants. Display Conventions and Configuration This track follows the display conventions for gene prediction tracks. The track description page offers the following filter and configuration options: Color track by condons: Select the genomic codons option to color and label each codon in a zoomed-in display to facilitate validation and comparison of gene predictions. Click the "Codon coloring help" link on the track description page for more information about this feature. Methods The following is a brief summary of the ECgene algorithm: Genomic alignment of mRNA and ESTs: Input sequences were aligned against the genome using the Blat program developed by Jim Kent. Blat alignments were corrected for valid splice sites, and the SIM4 program was used for suspicious alignments if necessary. Sequences that share more than one splice site were grouped together to define an EST cluster in a similar manner to the genome-based version of the UniGene algorithm. The exon-connectivity in each cluster was represented as a directed acyclic graph (DAG). Distinct paths along exons were obtained by the depth-first-search (DFS) method. They correspond to possible gene models encompassing all alternative splicing events. EST sequences in each cluster were sub-clustered further according to the compatibility of each splice variant with gene structure, and they can be regarded as clone evidence for the corresponding isoform. Gene models without sufficient evidence were discarded at this stage. The presence of polyA tails, detected from careful analysis of genomic alignment of mRNA and EST sequences, was specifically used to determine the gene boundary. Finally, unspliced sequences were added without altering the exon-intron boundaries of existing gene models. Coding potential of gene models: Peptide sequences are available only for those gene models judged to have good coding potential. ORF and CDS were determined based on the number of exons, the ORF length, the presence of the start codon (Met), and the CDS length. ORFs (defined as the region between two adjacent stop codons) were classified into four groups: spliced ORFs with Met spliced ORFs without Met single-exon ORFs with Met single-exon ORFs without Met Initially, the first group was searched for the ORF with the longest CDS. Coding sequences were accepted if they were longer than 30 amino acids (93 bp) or they were identical to one of SwissProt proteins excluding fragmented entries. If such an ORF could not be identified in the first group, the other groups were examined sequentially for the presence of an ORF using the same criteria. Genes lacking an apparent ORF were defined as non-coding RNA genes. Credits This algorithm and the predictions for this track were developed by Professor Sanghyuk Lee's Lab of Bioinformatics at Ewha Womans Univeristy, Seoul, KOREA. encodeRegions ENCODE Regions Encyclopedia of DNA Elements (ENCODE) Regions ENCODE Regions and Genes Description This track depicts target regions for the NHGRI ENCODE project. The long-term goal of this project is to identify all functional elements in the human genome sequence to facilitate a better understanding of human biology and disease. During the pilot phase, 44 regions comprising 30 Mb — approximately 1% of the human genome — have been selected for intensive study to identify, locate and analyze functional elements within the regions. These targets are being studied by a diverse public research consortium to test and evaluate the efficacy of various methods, technologies, and strategies for locating genomic features. The outcome of this initial phase will form the basis for a larger-scale effort to analyze the entire human genome. See the NHGRI target selection process web page for a description of how the target regions were selected. To open a UCSC Genome Browser with a menu for selecting ENCODE regions on the human genome, use ENCODE Regions in the UCSC Browser. The UCSC resources provided for the ENCODE project are described on the UCSC ENCODE Portal. Credits Thanks to the NHGRI ENCODE project for providing this initial set of data. ensGene Ensembl Genes Ensembl Genes Genes and Gene Predictions Description These gene predictions were generated by Ensembl. For more information on the different gene tracks, see our Genes FAQ. Methods For a description of the methods used in Ensembl gene predictions, please refer to Hubbard et al. (2002), also listed in the References section below. Data access Ensembl Gene data can be explored interactively using the Table Browser or the Data Integrator. For local downloads, the genePred format files for hg17 are available in our downloads directory as ensGene.txt.gz or in our genes download directory in GTF format. For programmatic access, the data can be queried from the REST API or directly from our public MySQL servers. Instructions on this method are available on our MySQL help page and on our blog. Previous versions of this track can be found on our archive download server. Credits We would like to thank Ensembl for providing these gene annotations. For more information, please see Ensembl's genome annotation page. References Hubbard T, Barker D, Birney E, Cameron G, Chen Y, Clark L, Cox T, Cuff J, Curwen V, Down T et al. The Ensembl genome database project. Nucleic Acids Res. 2002 Jan 1;30(1):38-41. PMID: 11752248; PMC: PMC99161 eponine Eponine TSS Eponine Predicted Transcription Start Sites Regulation Description The Eponine program provides a probabilistic method for detecting transcription start sites (TSS) in mammalian genomic sequence, with good specificity and excellent positional accuracy. Methods Eponine models consist of a set of DNA weight matrices recognizing specific sequence motifs. Each of these is associated with a position distribution relative to the TSS. Eponine has been tested by comparing the output with annotated mRNAs from human chromosome 22. From this work, we estimate that using the default threshold (0.999) it detects >50% of transcription start sites with approximately 70% specificity. However, it does not always predict the direction of transcription correctly—an effect that seems to be common among computational TSS finders. Credits Thanks to Thomas Down at the Sanger Institute for providing the Eponine program (version 2, March 6, 2002) which was run at UCSC to produce this track. References Down TA, Hubbard TJP. Computational detection and location of transcription start sites in mammalian genomic DNA. Genome Res. 2002 Mar;12(3):458-61. evofold EvoFold EvoFold Predictions of RNA Secondary Structure Genes and Gene Predictions Description This track shows RNA secondary structure predictions made with the EvoFold program, a comparative method that exploits the evolutionary signal of genomic multiple-sequence alignments for identifying conserved functional RNA structures. Display Conventions and Configuration Track elements are labeled using the convention ID_strand_score. When zoomed out beyond the base level, secondary structure prediction regions are indicated by blocks, with the stem-pairing regions shown in a darker shade than unpaired regions. Arrows indicate the predicted strand. When zoomed in to the base level, the specific secondary structure predictions are shown in parenthesis format. The confidence score for each position is indicated in grayscale, with darker shades corresponding to higher scores. The details page for each track element shows the predicted secondary structure (labeled SS anno), together with details of the multiple species alignments at that location. Substitutions relative to the human sequence are color-coded according to their compatibility with the predicted secondary structure (see the color legend on the details page). Each prediction is assigned an overall score and a sequence of position-specific scores. The overall score measures evidence for any functional RNA structures in the given region, while the position-specific scores (0 - 9) measure the confidence of the base-specific annotations. Base-pairing positions are annotated with the same pair symbol. The offsets are provided to ease visual navigation of the alignment in terms of the human sequence. The offset is calculated (in units of ten) from the start position of the element on the positive strand or from the end position when on the negative strand. The graphical display may be filtered to show only those track elements with scores that meet or exceed a certain threshhold. To set a threshhold, type the minimum score into the text box at the top of the description page. Methods Evofold makes use of phylogenetic stochastic context-free grammars (phylo-SCFGs), which are combined probabilistic models of RNA secondary structure and primary sequence evolution. The predictions consist of both a specific RNA secondary structure and an overall score. The overall score is essentially a log-odd score between a phylo-SCFG modeling the constrained evolution of stem-pairing regions and one which only models unpaired regions. The predictions for this track were based on the conserved elements of an 8-way vertebrate alignment of the human, chimpanzee, mouse, rat, dog, chicken, zebrafish, and Fugu assemblies. NOTE: These predictions were originally computed on the hg17 (May 2004) human assembly, from which the hg16 (July 2003), hg18 (May 2006), and hg19 (Feb 2009) predictions were lifted. As a result, the multiple alignments shown on the track details pages may differ from the 8-way alignments used for their prediction. Additionally, some weak predictions have been eliminated from the set displayed on hg18 and hg19. The hg17 prediction set corresponds exactly to the set analyzed in the EvoFold paper referenced below. Credits The EvoFold program and browser track were developed by Jakob Skou Pedersen of the UCSC Genome Bioinformatics Group, now at Aarhus University, Denmark. The RNA secondary structure is rendered using the VARNA Java applet. References EvoFold Pedersen JS, Bejerano G, Siepel A, Rosenbloom K, Lindblad-Toh K, Lander ES, Kent J, Miller W, Haussler D. Identification and classification of conserved RNA secondary structures in the human genome. PLoS Comput Biol. 2006 Apr;2(4):e33. PMID: 16628248; PMC: PMC1440920 Phylo-SCFGs Knudsen B, Hein J. RNA secondary structure prediction using stochastic context-free grammars and evolutionary history. Bioinformatics. 1999 Jun;15(6):446-54. PMID: 10383470 Pedersen JS, Meyer IM, Forsberg R, Simmonds P, Hein J. A comparative method for finding and folding RNA secondary structures within protein-coding regions. Nucleic Acids Res. 2004;32(16):4925-36. PMID: 15448187; PMC: PMC519121 PhastCons Siepel A, Bejerano G, Pedersen JS, Hinrichs AS, Hou M, Rosenbloom K, Clawson H, Spieth J, Hillier LW, Richards S et al. Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. Genome Res. 2005 Aug;15(8):1034-50. PMID: 16024819; PMC: PMC1182216 exoniphy Exoniphy Exoniphy Human/Mouse/Rat/Dog Genes and Gene Predictions Description The exoniphy program identifies evolutionarily conserved protein-coding exons in a multiple alignment using a phylogenetic hidden Markov model (phylo-HMM), a kind of statistical model that simultaneously describes exon structure and exon evolution. This track shows exoniphy predictions for the human May 2004 (hg17), mouse Aug. 2005 (mm7), rat Jun. 2003 (rn3), and dog May 2005 (canFam2) genomes, as aligned by the multiz program. For this track, only alignments on the "syntenic net" between human and each other species were considered. Methods For a description of exoniphy, see Siepel et al. (2004). Multiz is described in Blanchette et al. (2004). The alignment chaining methods behind the "syntenic net" are described in Kent et al. (2003). References Blanchette, M. et al. Aligning multiple genomic sequences with the threaded blockset aligner. Genome Res. 14, 708-175 (2004). Kent, W.J. et al. Evolution's cauldron: duplication, deletion, and rearrangement in the mouse and human genomes. P. Natl. Acad. Sci. USA 100(20), 11484-11489 (2003). Siepel, A. and Haussler, D. Computational identification of evolutionarily conserved exons. RECOMB '04 (2004). exonWalk ExonWalk ExonWalk Alt-Splicing Transcripts Genes and Gene Predictions Description The ExonWalk program merges cDNA evidence together to predict full length isoforms, including alternative transcripts. To predict transcripts that are biologically functional, rather than the result of technical or biological noise, ExonWalk requires that every intron and exon be either: 1) Present in cDNA libraries of another organism (i.e. also present in mouse), 2) Have three separate cDNA GenBank entries supporting it, or 3) Be evolving like a coding exon as determined by Exoniphy. Once the transcripts are predicted a orf finder is used (BESTORF from Softberry) is used to find the best open reading frame. By default transcripts that are targets for nonsense mediated decay (NMD) are filtered out as they are less likely to be translated into proteins. Methods The input to the ExonWalk program is the AltSplice track which has filtered out exons and introns that are not: 1) Present in cDNA libraries of another organism (i.e. also present in mouse), 2) Have three separate cDNA GenBank entries supporting it, or 3) Be evolving like a coding exon as determined by Exoniphy. The ExonWalk algorithm takes these filtered sequences and constructs a graph where the exons are the nodes and the introns are the edges. The goal of the program is to produce all full length transcripts implied by the transcripts. Full length transcripts are defined as transcripts that are not a subsequences of another transcript. The stages of the algorithm can be divided into three steps as illustrated in Figure 1 below: Detection and connection of compatible transcripts (Figure 1B). Merging of vertices that are identical in terms of splicing (Figure 1C). Exploration of all paths in the resulting graph (Figure 1D). Different stages of the ExonWalk Program. A. Different transcripts for a particular gene have been aligned to the genome to give an order and orientation. B. Exons in the overlapping section of compatible transcripts are joined to form new edges. C. Vertices which are redundant are pruned from the graph, being replaced by edges from other, equivalent, vertices. This simplifies the initial graph and yet retains splicing specific information. D. The maximal paths through the graph are explored to produce a set of maximal (full length) transcripts. Initially each each transcript is an independent sub-graph in the exon graph. Individual transcripts are then compared pairwise to determine if they are compatible. If they are compatible, an edge is created between exons of the overlap, called a compatibility edge. This results in a directed graph where overlapping exons are connected together, and thus compatible transcripts have been connected as well (Figure 1B). The algorithm then makes use of the implicit order provided by the genome sequence and the fact that splicing occurs in order to explore all of the paths present in the graph. Comments/Questions? Email sugnet@soe.ucsc.edu firstEF FirstEF FirstEF: First-Exon and Promoter Prediction Regulation Description This track shows predictions from the FirstEF (First Exon Finder) program. Three types of predictions are displayed: exon, promoter and CpG window. If two consecutive predictions are separated by less than 1000 bp, FirstEF treats them as one cluster of alternative first exons that may belong to same gene. The cluster number is displayed in the parentheses of each item. For example, "exon(405-)" represents the exon prediction in cluster number 405 on the minus strand. The exon, promoter and CpG-window are interconnected by this cluster number. Alternative predictions within the same cluster are denoted by "#N" where "N" is the serial number of an alternative prediction in the cluster. Each predicted exon is either CpG-related or non-CpG-related, based on a score of the frequency of CpG dinucleotides. An exon is classified as CpG-related if the CpG score is greater than a threshold value, and non-CpG-related if less than the threshold. If an exon is CpG-related, its associated CpG-window is displayed. The browser displays features with higher scores in darker shades of gray/black. Method FirstEF is a 5' terminal exon and promoter prediction program. It consists of different discriminant functions structured as a decision tree. The probabilistic models are optimized to find potential first donor sites and CpG-related and non-CpG-related promoter regions based on discriminant analysis. For every potential first donor site (GT) and an upstream promoter region, FirstEF decides whether or not the intermediate region can be a potential first exon, based on a set of quadratic discriminant functions. FirstEF calculates the a posteriori probabilities of exon, donor, and promoter for a given GT and an upstream window of length 570 bp. For a description of the FirstEF program and the underlying classification models, refer to Davuluri et al., 2001. Credits The predictions for this track are produced by Ramana V. Davuluri of Ohio State University and Ivo Grosse and Michael Q. Zhang of Cold Spring Harbor Lab. References Davuluri RV, Grosse I, Zhang MQ. Computational identification of promoters and first exons in the human genome. Nat Genet. 2001 Dec;29(4):412-7. fishClones FISH Clones Clones Placed on Cytogenetic Map Using FISH Mapping and Sequencing Description This track shows the location of fluorescent in situ hybridization (FISH)-mapped clones along the assembly sequence. The locations of these clones were obtained from the NCBI Human BAC Resource here. Earlier versions of this track obtained this information directly from the paper Cheung, et al. (2001). More information about the BAC clones, including how they may be obtained, can be found at the Human BAC Resource and the Clone Registry web sites hosted by NCBI. To view Clone Registry information for a clone, click on the clone name at the top of the details page for that item. Using the Filter This track has a filter that can be used to change the color or include/exclude the display of a dataset from an individual lab. This is helpful when many items are shown in the track display, especially when only some are relevant to the current task. The filter is located at the top of the track description page, which is accessed via the small button to the left of the track's graphical display or through the link on the track's control menu. To use the filter: In the pulldown menu, select the lab whose data you would like to highlight or exclude in the display. Choose the color or display characteristic that will be used to highlight or include/exclude the filtered items. If "exclude" is chosen, the browser will not display clones from the lab selected in the pulldown list. If "include" is selected, the browser will display clones only from the selected lab. When you have finished configuring the filter, click the Submit button. Credits We would like to thank all of the labs that have contributed to this resource: Fred Hutchinson Cancer Research Center (FHCRC) National Cancer Institute (NCI) Roswell Park Cancer Institute (RPCI) The Wellcome Trust Sanger Institute (SC) Cedars-Sinai Medical Center (CSMC) Los Alamos National Laboratory (LANL) UC San Francisco Cancer Center (UCSF) References Cheung VG, Nowak N, Jang W, Kirsch IR, Zhao S, Chen XN, Furey TS, Kim UJ, Kuo WL, Olivier M et al. Integration of cytogenetic landmarks into the draft sequence of the human genome. Nature. 2001 Feb 15;409(6822):953-8. PMID: 11237021 fosEndPairs Fosmid End Pairs Fosmid End Pairs Mapping and Sequencing Description A valid pair of fosmid end sequences must be at least 30 kb but no more than 50 kb away from each other. The orientation of the first fosmid end sequence must be "+" and the orientation of the second fosmid end sequence must be "-". Note: For hg19 and hg18 assemblies, the Fosmid End Pairs track is a main track under the "Mapping and Sequencing" track category. On the hg38 assembly, the FOSMID End Pairs track is a subtrack within the Clone Ends track under the "Mapping and Sequencing" track category. Under the list of subtracks on the Clone Ends Track Settings page, the FOSMID End Pairs track is now named "WIBR-2 Fosmid library." With the WIBR-2 Fosmid library track setting on full, individual clone end mapping items are listed in the browser; click into any item to see details from NCBI. Methods End sequences were trimmed at the NCBI using ssahaCLIP written by Jim Mullikin. Trimmed fosmid end sequences were placed on the assembled sequence using Jim Kent's blat program. Credits Sequencing of the fosmid ends was done at the Eli & Edythe L. Broad Institute of MIT and Harvard University. Clones are available through the BACPAC Resources Center at Children's Hospital Oakland Research Institute (CHORI). fox2ClipSeqComp FOX2 CLIP-seq FOX2 Adaptor-trimmed CLIP-seq reads Regulation Description The FOX2 CLIP-seq track shows adaptor-trimmed CLIP-seq reads that mapped uniquely to the repeat-masked human genome (hg17). The reads were converted to hg18 coordinates using the UCSC LiftOver tool. Reads on the forward strand are displayed in blue; those on the reverse strand are shown in red. Methods Cross-linking immunoprecipitation coupled with high-throughput sequencing (CLIP-seq) of cell type-specific splicing regulator FOX2 (also known as RBM9) was performed in human embryonic stem cells. MosaikAligner was utilized to align the reads to the repeat-masked genome. Briefly, HUES6 human embryonic stem cells were treated with UV irradiation to stabilize in vivo protein-RNA interactions, followed by antibody-mediated precipitation of specific RNA-protein complexes. SDS-PAGE was then utilized to isolate protein-RNA adducts after RNA trimming with nuclease, 3'RNA linkers were ligated, and nucleotides were 5' end labeled with γ-32P-ATP. Recovered RNA was ligated to a 5' linker before amplification by RT-PCR. Both linkers were designed to be compatible with Illumina 1G genome analyzer sequencing. Approximately 4 million reads were uniquely mapped to the repeat-masked human genome by MosaikAligner. To identify CLIP clusters, we performed the following steps: (i) CLIP reads were associated with protein-coding genes as defined by the region from the annotated transcriptional start to the end of each gene locus. (ii) CLIP reads were separated into the categories of sense or antisense to the transcriptional direction of the gene. (iii) Sense CLIP reads were extended by 100 nt in the 5'-to-3' direction. The height of each nucleotide position is the number of reads that overlap that position. (iv) The count distribution of heights is as follows from 1, 2, ...h, ...H-1, H: {n1, n2, ...nh, ...nH-1, nH; N = Σni (i = 1:H)}. For a particular height, h, the associated probability of observing a height of at least h is Ph = Σni(i = h:H) /N. (v) We computed the background frequency after randomly placing the same number of extended reads within the gene for 100 iterations. This controls for the length of the gene and the number of reads. For each iteration, the count distribution and probabilities for the randomly placed reads (Ph,random) was generated as in step (iv). (vi) Our modified FDR for a peak height was computed as FDR(h) = (μh + σh)/Ph, where μh and σh is the average and s.d., respectively, of Ph,random across the 100 iterations. For each gene loci, we chose a threshold peak height h* as the smallest height equivalent to FDR(h*) < 0.001. We identified FOX2 binding clusters by grouping nucleotide positions satisfying h > h* and occurred within 50 nt of each other. For further details of the method used to generate this annotation please refer to Yeo et al. (2009). Credits Thanks to Gene Yeo at the University of California, San Diego for providing this annotation. For additional information on FOX2 CLIP-seq reads, please contact geneyeo@ucsd. edu directly. References Yeo GW, Coufal NG, Liang YL, Peng GE, Fu XD, Gage FH. An RNA code for the FOX2 splicing regulator revealed by mapping RNA-protein interactions in stem cells. Nat. Struct. Mol. Biol. 2009 Jan 11;16:130-137. fox2ClipSeqCompViewreads Reads FOX2 Adaptor-trimmed CLIP-seq reads Regulation fox2ClipSeq FOX2 CLIP-seq FOX2 Adaptor-trimmed CLIP-seq Reads Regulation fox2ClipSeqCompViewdensity Density FOX2 Adaptor-trimmed CLIP-seq reads Regulation fox2ClipSeqDensityReverseStrand Density Reverse FOX2 Adaptor-trimmed CLIP-seq Density Reverse Strand Regulation fox2ClipSeqDensityForwardStrand Density Forward FOX2 Adaptor-trimmed CLIP-seq Density Forward Strand Regulation gap Gap Gap Locations Mapping and Sequencing Description This track depicts gaps in the assembly. Most gaps - with the exception of intractable heterochromatic, centromeric, telomeric, and short-arm gaps - have been closed during the finishing process, although a small number still remain. This assembly contains the following types of gaps: Fragment - gaps between the contigs of a draft clone (size varies). (In this context, a contig is a set of overlapping sequence reads.) Clone - gaps between clones in the same map contig (size varies). Contig - gaps between map contigs (size varies). Heterochromatin - gaps from large blocks of heterochromatin (size varies). Centromere - gaps from centromeres (3,000,000 Ns). Short_arm - large gaps in the short (p) arm (size varies). Telomere - gaps from telomeres (size varies). Display Conventions and Configuration Gaps are represented as black boxes in this track. If the relative order and orientation of the contigs on either side of the gap is known, it is a bridged gap. In this case, a white line is drawn through the black box representing the gap and the gap is labeled "yes". gc5Base GC Percent GC Percent in 5-Base Windows Mapping and Sequencing Description The GC percent track shows the percentage of G (guanine) and C (cytosine) bases in 5-base windows. High GC content is typically associated with gene-rich areas. This track may be configured in a variety of ways to highlight different apsects of the displayed information. Click the "Graph configuration help" link for an explanation of the configuration options. Credits The data and presentation of this graph were prepared by Hiram Clawson. rnaCluster Gene Bounds Gene Boundaries as Defined by RNA and Spliced EST Clusters mRNA and EST Description This track shows the boundaries of genes and the direction of transcription as deduced from clustering spliced ESTs and mRNAs against the genome. When many spliced variants of the same gene exist, this track shows the variant that spans the greatest distance in the genome. Method ESTs and mRNAs from GenBank were aligned against the genome using BLAT. Alignments with less than 97.5% base identity within the aligning blocks were filtered out. When multiple alignments occurred, only those alignments with a percentage identity within 0.2% of the best alignment were kept. The following alignments were also discarded: ESTs that aligned without any introns, blocks smaller than 10 bases, and blocks smaller than 130 bases that were not located next to an intron. The orientations of the ESTs and mRNAs were deduced from the GT/AG splice sites at the introns; ESTs and mRNAs with overlapping blocks on the same strand were merged into clusters. Only the extent and orientation of the clusters are shown in this track. Scores for individual gene boundaries were assigned based on the number of cDNA alignments used: 300 — based on a single cDNA alignment 600 — based on two alignments 900 — based on three alignments 1000 — based on four or more alignments Credits This track, which was originally developed by Jim Kent, was generated at UCSC and uses data submitted to GenBank by scientists worldwide. References Benson DA, Karsch-Mizrachi I, Lipman DJ, Ostell J, Wheeler DL. GenBank: update. Nucleic Acids Res. 2004 Jan 1;32:D23-6. Kent WJ. BLAT - the BLAST-like alignment tool. Genome Res. 2002 Apr;12(4):656-64. geneid Geneid Genes Geneid Gene Predictions Genes and Gene Predictions Description This track shows gene predictions from the geneid program developed by Roderic Guigó's Computational Biology of RNA Processing group which is part of the Centre de Regulació Genòmica (CRG) in Barcelona, Catalunya, Spain. Methods Geneid is a program to predict genes in anonymous genomic sequences designed with a hierarchical structure. In the first step, splice sites, start and stop codons are predicted and scored along the sequence using Position Weight Arrays (PWAs). Next, exons are built from the sites. Exons are scored as the sum of the scores of the defining sites, plus the the log-likelihood ratio of a Markov Model for coding DNA. Finally, from the set of predicted exons, the gene structure is assembled, maximizing the sum of the scores of the assembled exons. Credits Thanks to Computational Biology of RNA Processing for providing these data. References Blanco E, Parra G, Guigó R. Using geneid to identify genes. Curr Protoc Bioinformatics. 2007 Jun;Chapter 4:Unit 4.3. PMID: 18428791 Parra G, Blanco E, Guigó R. GeneID in Drosophila. Genome Res. 2000 Apr;10(4):511-5. PMID: 10779490; PMC: PMC310871 genscan Genscan Genes Genscan Gene Predictions Genes and Gene Predictions Description This track shows predictions from the Genscan program written by Chris Burge. The predictions are based on transcriptional, translational and donor/acceptor splicing signals as well as the length and compositional distributions of exons, introns and intergenic regions. For more information on the different gene tracks, see our Genes FAQ. Display Conventions and Configuration This track follows the display conventions for gene prediction tracks. The track description page offers the following filter and configuration options: Color track by codons: Select the genomic codons option to color and label each codon in a zoomed-in display to facilitate validation and comparison of gene predictions. Go to the Coloring Gene Predictions and Annotations by Codon page for more information about this feature. Methods For a description of the Genscan program and the model that underlies it, refer to Burge and Karlin (1997) in the References section below. The splice site models used are described in more detail in Burge (1998) below. Credits Thanks to Chris Burge for providing the Genscan program. References Burge C. Modeling Dependencies in Pre-mRNA Splicing Signals. In: Salzberg S, Searls D, Kasif S, editors. Computational Methods in Molecular Biology. Amsterdam: Elsevier Science; 1998. p. 127-163. Burge C, Karlin S. Prediction of complete gene structures in human genomic DNA. J. Mol. Biol. 1997 Apr 25;268(1):78-94. PMID: 9149143 encodeGisChipPetAll GIS ChIP-PET GIS ChIP-PET ENCODE Chromatin Immunoprecipitation Description This track shows binding sites for p53, STAT1, and c-Myc, as determined by chromatin immunoprecipitation (ChIP) and paired-end di-tag (PET) sequencing. The p53 and c-Myc site data is genome-wide; data for STAT1 is restricted to the ENCODE regions. The p53 protein is a transcription factor involved in the control of cell growth that is often expressed at high levels in cancer cells. STAT1 is a signal transducer and transcription factor that binds to gamma interferon activating sequence. The c-Myc (cellular myelocytomatosis) protein is a transcription factor associated with cell proliferation, differentiation, and neoplastic disease. The PET sequences in this track are derived from individual ChIP fragments as follows: FactorFragmentsCell lineTreatment p53 65,572 HCT116 6hrs 5-fluorouracil (5FU) STAT1 263,901 HeLa none STAT1 327,838 HeLa gamma interferon (gIFN) c-Myc 273,566 P493 B cell with tetracycline-repressible c-Myc transgene none For the STAT1 experiments, a total of 4,007 of the PETs from the stimulated cells and 3,180 PETs from unstimulated cells were mapped to the ENCODE regions. The data from the unstimulated cells were used as the negative control. Only STAT1 PETs mapped to the ENCODE regions are shown in this track. Display Conventions and Configuration In the graphical display, PET sequences are shown as two blocks, representing the ends of the pair, connected by a thin arrowed line. Overlapping PET clusters (PET fragments that overlap one another) originating from the ChIP enrichment process define the genomic loci that are potential transcription factor binding sites (TFBSs). PET singletons, from non-specific ChIP fragments that did not cluster, are not shown. In full and packed display modes, the arrowheads on the horizontal line represent the orientation of the PET sequence, and an ID of the format XXXXX-M is shown to the left of each PET, where X is the unique ID for each PET and M is the number of PET sequences at this location. The track coloring reflects the value of M: light gray indicates one or two sequences (score = 333), dark gray is used for three sequences (score = 800) and black indicates four or more PET sequences (score = 1000) at the location. Methods The cross-linked chromatin was sheared and precipitated with a high affinity antibody. The DNA fragments were end-polished and cloned into the plasmid vector, pGIS3. pGIS3 contains two MmeI recognition sites that flank the cloning site, which were used to produce a 36 bp PET from the original ChIP DNA fragments (18 bp from each of the 5' and 3' ends). Multiple 36 bp PETs were concatenated and cloned into pZero-1 for sequencing, where each sequence read can generate 10-15 PETs. The PET sequences were extracted from raw sequence reads and mapped to the genome, defining the boundaries of each ChIP DNA fragment. The following specific mapping criteria were used: both 5' and 3' signatures must be present on the same chromosome their 5' to 3' orientation must be correct a minimal 17 bp match must exist for each 18 bp 5' and 3' signature the tags must have genomic alignments within 7 Kb of each other Due to the known possibility of MmeI slippage (+/- 1 bp) that leads to ambiguities at the PET signature boundaries, a minimal 17 bp match was set for each 18 bp signature. The total count of PET sequences mapped to the same locus but with slight nucleotide differences may reflect the expression level of the transcripts. Only PETs with specific mapping (one location) to the genome were considered. PETs that mapped to multiple locations may represent low complexity or repetitive sequences, and therefore were not included for further analysis. Verification Statistical and experimental verification exercises have shown that the overlapping PET clusters result from ChIP enrichment events. P53 HCT116 Monte Carlo simulation using the p53 ChIP-PET data estimated that about 27% of PET-2 clusters (PET clusters with two overlapping members), 3% of the PET clusters with 3 overlapping members (PET-3 clusters), and less than 0.0001% of PET clusters with more than 3 overlapping members were due to random chance. This suggests that the PET clusters most likely represent the real enrichment events by ChIP and that a higher number of overlapping fragments correlates to a higher probability of a real ChIP enrichment event. Furthermore, based on goodness-of-fit analysis for assessing the reliability of PET clusters, it was estimated that less than 36% of the PET-2 clusters and over 99% of the PET-3+ clusters (clusters with three or more overlapping members) are true enrichment ChIP sites. Thus, the verification rate is nearly 100% for PET-3+ ChIP clusters, and the PET-2 clusters contain significant noise. In addition to these statistical analyses, 40 genomic locations identified by PET-3+ clusters were randomly selected and analyzed by quantitative real-time PCR. The relative enrichment of candidate regions compared to control GST ChIP DNA was determined and all 40 regions (100%) were confirmed to have significant enrichment of p53 ChIP clusters. STAT1 HeLa Monte Carlo simulation using the STAT1 ChIP-PET data from interferon gamma-stimulated dataset estimated that random chance accounted for about 58% of PET-3 clusters (maximal numbers of PETs within the overlap region of any cluster), 21% of the PET clusters with 4 overlapping members (PET-4 clusters), and less than 0.5% of PET clusters with more than 5 overlapping members. This suggests that the PET-5+ clusters represent the real enrichment events by ChIP and that a higher number of overlapping fragments correlates to a higher probability of a real ChIP enrichment event. Furthermore, based on goodness-of-fit analysis for assessing the reliability of PET clusters, it was estimated that less than 30% of the PET-4 clusters and over 90% of the PET-5+ clusters (clusters with five or more overlapping members) are true enrichment ChIP sites. In addition to these statistical analyses, 9 out of 14 genomic locations (64%) identified by PET-5+ clusters in the ENCODE regions were supported by ChIP-chip data from Yale using the same ChIP DNA as hybridization material. c-Myc P493 Monte Carlo simulation using the c-Myc ChIP-PET data estimated that about 32% of PET-3 clusters (maximal numbers of PETs within the overlap region of any cluster) and 4% of the PET clusters with 4 or more overlapping members (PET-4+ clusters) were due to random chance. This suggests that ~ 70% of PET-3+ clusters represent the real enrichment events by ChIP and that a higher number of overlapping fragments correlates to a higher probability of a real ChIP enrichment event. In addition to these statistical analyses, 29 genomic locations identified by PET-3+ clusters and 19 genomic locations defined by PET-2 clusters were randomly selected and subjected for quantitative real-time PCR analyses. The relative enrichment of candidate regions compared to control GST ChIP DNA was determined and all 29 PET-3+ regions (100%) and 19 PET-2 regions (47%) were confirmed significant enrichment of c-Myc ChIP, indicating that all of the PET-3+ and 47% of the PET-2 clusters defined regions are true c-Myc bound targets. Credits The ChIP-PET library and sequence data were produced at the Genome Institute of Singapore. The data were mapped and analyzed by scientists from the Genome Institute of Singapore, the Bioinformatics Institute, Singapore, and Boston University. The STAT1 ChIP fragment prep was provided by Ghia Euskirchen from the Snyder lab at Yale. The c-Myc ChIP fragment prep was provided by Karen Zeller from the Dang lab at Johns Hopkins University. References Ng P, Wei CL, Sung WK, Chiu KP, Lipovich L, Ang CC, Gupta S, Shahab A, Ridwan A, Wong CH et al. Gene identification signature (GIS) analysis for transcriptome characterization and genome annotation. Nat Methods. 2005 Feb;2(2):105-11. Wei CL, Wu Q, Vega VB, Chiu KP, Ng P, Zhang T, Shahab A, Yong HC, Fu Y, Weng Z et al. A Global Map of p53 Transcription-Factor Binding Sites in the Human Genome. Cell 2006 Jan 13;124(1):207-19. Chiu KP, Wong CH, Chen Q, Ariyaratne P, Ooi HS, Wei CL, Sung WK, Ruan Y. PET-Tool: a software suite for comprehensive processing and managing of Paired-End diTag (PET) sequence data. BMC Bioinformatics. 2006 Aug 25;7:390. Zeller KI, Zhao X, Lee CW, Chiu KP, Yao F, Yustein JT, Ooi HS, Orlov YL, Shahab A, Yong HC et al. Global mapping of c-Myc binding sites and target gene networks in human B cells. Proc Natl Acad Sci U S A. 2006 Nov 21;103(47):17834-9. encodeGisChipPetMycP493 cMyc P493 GIS ChIP-PET: c-Myc Ab on P493 B cells ENCODE Chromatin Immunoprecipitation encodeGisChipPetStat1NoGif STAT1 HeLa -gIF GIS ChIP-PET: STAT1 Ab on untreated HeLa cells ENCODE Chromatin Immunoprecipitation encodeGisChipPetStat1Gif STAT1 HeLa +gIF GIS ChIP-PET: STAT1 Ab on gIF treated HeLa cells ENCODE Chromatin Immunoprecipitation encodeGisChipPet p53 HCT116 +5FU GIS ChIP-PET: p53 Ab on 5FU treated HCT116 cells ENCODE Chromatin Immunoprecipitation Description This track shows genome-wide p53 binding sites as determined by chromatin immunoprecipitation (ChIP) and paired-end di-tag (PET) sequencing. The p53 protein is a transcription factor involved in the control of cell growth that is often expressed at high levels in cancer cells. See the Methods section below for more information about ChIP and PET. The PET sequences in this track are derived from 65,572 individual p53 ChIP fragments of 5-fluorouracil (5FU) stimulated HCT116 cells. More datasets will be submitted in the future, including STAT1, TAF250, and E2F1. Display Conventions and Configuration In the graphical display, PET sequences are shown as two blocks, representing the ends of the pair, connected by a thin arrowed line. Overlapping PET clusters (PET fragments that overlap one another) originating from the ChIP enrichment process define the genomic loci that are potential transcription factor binding sites (TFBSs). PET singletons, from non-specific ChIP fragments that did not cluster, are not shown. In full and packed display modes, the arrowheads on the horizontal line represent the orientation of the PET sequence, and an ID of the format XXXXX-M is shown to the left of each PET, where X is the unique ID for each PET and M is the number of PET sequences at this location. The track coloring reflects the value of M: light gray indicates one or two sequences (score = 333), dark gray is used for three sequences (score = 800) and black indicates four or more PET sequences (score = 1000) at the location. Methods HCT116 cells were treated with 5FU for six hours. The cross-linked chromatin was sheared and precipitated with a high affinity antibody. The DNA fragments were end-polished and cloned into the plasmid vector, pGIS3. pGIS3 contains two MmeI recognition sites that flank the cloning site, which were used to produce a 36 bp PET from the original ChIP DNA fragments (18 bp from each of the 5' and 3' ends). Multiple 36 bp PETs were concatenated and cloned into pZero-1 for sequencing, where each sequence read can generate 10-15 PETs. The PET sequences were extracted from raw sequence reads and mapped to the genome, defining the boundaries of each ChIP DNA fragment. The following specific mapping criteria were used: both 5' and 3' signatures must be present on the same chromosome their 5' to 3' orientation must be correct a minimal 17 bp match must exist for each 18 bp 5' and 3' signature the tags must have genomic alignments within 4 Kb of each other Due to the known possibility of MmeI slippage (+/- 1 bp) that leads to ambiguities at the PET signature boundaries, a minimal 17 bp match was set for each 18 bp signature. The total count of PET sequences mapped to the same locus but with slight nucleotide differences may reflect the expression level of the transcripts. Only PETs with specific mapping (one location) to the genome were considered. PETs that mapped to multiple locations may represent low complexity or repetitive sequences, and therefore were not included for further analysis. Verification Statistical and experimental verification exercises have shown that the overlapping PET clusters result from ChIP enrichment events. Monte Carlo simulation using the p53 ChIP-PET data estimated that about 27% of PET-2 clusters (PET clusters with two overlapping members), 3% of the PET clusters with 3 overlapping members (PET-3 clusters), and less than 0.0001% of PET clusters with more than 3 overlapping members were due to random chance. This suggests that the PET clusters most likely represent the real enrichment events by ChIP and that a higher number of overlapping fragments correlates to a higher probability of a real ChIP enrichment event. Furthermore, based on goodness-of-fit analysis for assessing the reliability of PET clusters, it was estimated that less than 36% of the PET-2 clusters and over 99% of the PET-3+ clusters (clusters with three or more overlapping members) are true enrichment ChIP sites. Thus, the verification rate is nearly 100% for PET-3+ ChIP clusters, and the PET-2 clusters contain significant noise. In addition to these statistical analyses, 40 genomic locations identified by PET-3+ clusters were randomly selected and analyzed by quantitative real-time PCR. The relative enrichment of candidate regions compared to control GST ChIP DNA was determined and all 40 regions (100%) were confirmed to have significant enrichment of p53 ChIP clusters. Credits The p53 ChIP-PET library and sequence data were produced at the Genome Institute of Singapore. The data were mapped and analyzed by scientists from the Genome Institute of Singapore, the Bioinformatics Institute, Singapore, and Boston University. References Ng P, Wei CL, Sung WK, Chiu KP, Lipovich L, Ang CC, Gupta S, Shahab A, Ridwan A, Wong CH, et al. Gene identification signature (GIS) analysis for transcriptome characterization and genome annotation. Nat Methods. 2005 Feb;2(2):105-11. Epub 2005 Jan 9. encodeGisRnaPet GIS-PET RNA Gene Identification Signature Paired-End Tags of PolyA+ RNA ENCODE Regions and Genes Description This track shows the starts and ends of mRNA transcripts determined by paired-end ditag (PET) sequencing. PETs are composed of 18 bases from either end of a cDNA; 36 bp PETs from many clones were concatenated together and cloned into pZero-1 for efficient sequencing. See the Methods and References sections below for more details on PET sequencing. The PET sequences in this track are full-length transcripts derived from two cell lines with differing treatments: the log phase of MCF7 cells MCF7 cells treated with estrogen (10nM beta-estradiol) for 12 hours HCT116 cells treated with 5FU (5-fluorouracil) for 6 hours In total, 584,624 PETs were generated for the log phase MCF7 cells, 153,179 PETs were generated for the estrogen-treated MCF7 cells, and 280,340 PETs were generated for the HCT116 cells. More than 80% of the PETs in the HCT116 and log phase MCF7 cells were mapped to the genome. The 474,278 log phase MCF7 PETs and 223,261 HCT116 PETs that mapped with single and multiple (up to ten) matches in the genome are shown in the two subtracks. For the estrogen-treated MCF7 cells, only those PETs mapped to the ENCODE regions with the above match criteria (4881 total) are displayed. In the graphical display, the ends are represented by blocks connected by a horizontal line. In full and packed display modes, the arrowheads on the horizontal line represent the direction of transcription, and an ID of the format XXXXX-N-M is shown to the left of each PET, where X is the unique ID for each PET, N indicates the number of mapping locations in the genome (1 for a single mapping location, 2 for two mapping locations, and so forth), and M is the number of PET sequences at this location. The total count of PET sequences mapped to the same locus but with slight nucleotide differences may reflect the expression level of the transcripts. PETs that mapped to multiple locations may represent low complexity or repetitive sequences. The graphical display also uses color coding to reflect the uniqueness and expression level of each PET: ColorMappingPETS observed at location dark blueunique2 or more light blueunique1 medium brownmultiple2 or more light brownmultiple1 Methods PolyA+ RNA was isolated from the cells. A full-length cDNA library was constructed and converted into a PET library for Gene Identification Signature analysis (Ng et al., 2005). Generation of PET sequences involved cloning of cDNA sequences into the plasmid vector, pGIS3. pGIS3 contains two MmeI recognition sites that flank the cloning site, which were used to produce a 36 bp PET. Each 36 bp PET sequence contains 18 bp from each of the 5' and 3' ends of the original full-length cDNA clone. The 18 bp 3' signature contains 16 bp 3'-specific nucleotides and an AA residual of the polyA tail to indicate the sequence orientation. PET sequences were mapped to the genome using the following specific criteria: a minimal continuous 16 bp match must exist for the 5' signature; the 3' signature must have a minimal continuous 14 bp match both 5' and 3' signatures must be present on the same chromosome their 5' to 3' orientation must be correct the maximal genomic span of a PET genomic alignment must be less than one million bp Most of the PET sequences (more than 90%) were mapped to specific locations (single mapping loci). PETs mapping to 2 - 10 locations are also included and may represent duplicated genes or pseudogenes in the genome. Verification To assess overall PET quality and mapping specificity, the top ten most abundant PET clusters that mapped to well-characterized known genes were examined. Over 99% of the PETs represented full-length transcripts, and the majority fell within ten bp of the known 5' and 3' boundaries of these transcripts. The PET mapping was further verified by confirming the existence of physical cDNA clones represented by the ditags. PCR primers were designed based on the PET sequences and amplified the corresponding cDNA inserts from the parental GIS flcDNA library for sequencing analysis. In a set of 86 arbitrarily-selected PETs representing a wide range of annotation categories — including known genes (38 PETs), predicted genes (2 PETs), and novel transcripts (46 PETs) — 84 (97.7%) confirmed the existence of bona fide transcripts. Credits The GIS-PET libraries and sequence data for transcriptome analysis were produced at the Genome Institute of Singapore. The data were mapped and analyzed by scientists from the Genome Institute of Singapore and the Bioinformatics Institute of Singapore. References Ng P, Wei CL, Sung WK, Chiu KP, Lipovich L, Ang CC, Gupta S, Shahab A, Ridwan A, Wong CH, et al. Gene identification signature (GIS) analysis for transcriptome characterization and genome annotation. Nat Methods. 2005 Feb;2(2):105-11. Epub 2005 Jan 9. encodeGisRnaPetHCT116 GIS RNA HCT116 Gene Identification Signature Paired-End Tags of PolyA+ RNA (5FU-stim HCT116) ENCODE Regions and Genes encodeGisRnaPetMCF7Estr GIS RNA MCF7 Est Gene Identification Signature Paired-End Tags of PolyA+ RNA (estrogen-stim MCF7) ENCODE Regions and Genes encodeGisRnaPetMCF7 GIS RNA MCF7 Gene Identification Signature Paired-End Tags of PolyA+ RNA (log phase MCF7) ENCODE Regions and Genes gnfAtlas2 GNF Atlas 2 GNF Expression Atlas 2 Expression Description This track shows expression data from the GNF Gene Expression Atlas 2. This contains two replicates each of 79 human tissues run over Affymetrix microarrays. By default, averages of related tissues are shown. Display all tissues by selecting "All Arrays" from the "Combine arrays" menu on the track settings page. As is standard with microarray data red indicates overexpression in the tissue, and green indicates underexpression. You may want to view gene expression with the Gene Sorter as well as the Genome Browser. Credits Thanks to the Genomics Institute of the Novartis Research Foundation (GNF) for the data underlying this track. References Su AI, Wiltshire T, Batalov S, Lapp H, Ching KA, Block D, Zhang J, Soden R, Hayakawa M, Kreiman G et al. A gene atlas of the mouse and human protein-encoding transcriptomes. Proc Natl Acad Sci U S A. 2004 Apr 20;101(16):6062-7. PMID: 15075390; PMC: PMC395923 affyRatio GNF Ratio GNF Gene Expression Atlas Ratios Using Affymetrix GeneChips Expression Description This track shows expression data from GNF (The Genomics Institute of the Novartis Research Foundation) using Affymetrix GeneChips. The chip types, chip IDs or tissue averages associated with experiments can be displayed by selecting the appropriate option from the Experiment Display menu on the track description page. For more information, see the Track Configuration section. Methods For detailed information about the experiments, see Su et al. 2002 in the References section below. Alignments displayed on the track correspond to the target sequences used by Affymetrix to choose probes. In dense display mode, the track color denotes the average signal over all experiments on a log base 2 scale. Lighter colors correspond to lower signals and darker colors correspond to higher signals. In full display mode, the color of each item represents the log base 2 ratio of the signal of that particular experiment to the median signal of all experiments for that probe. More information about individual probes and probe sets is available on the Affymetrix website. Track Configuration This track may be configured to change the display mode and colors or vary the type of experiment information shown. The configuration controls are located at the top of the track description page, which is accessed via the small button to the left of the track's graphical display or the link on the track's control menu. Display mode: To change the display mode for the track, select the desired display setting from the Display Mode pulldown list. Combine Arrays: All arrays may be displayed with either the chip ID or the tissue type as the label. Replicate arrays may also be combined by expression medians. When you have finished making changes, click the Submit button to commit your changes and return to the Genome Browser tracks display. Credits Thanks to GNF for providing these data. References Su, A.I., Cooke, M.P., Ching, K.A., Hakak, Y., Walker, J.R., Wiltshire, T., Orth, A.P., Vega, R.G., Sapinoso, L.M., Moqrich, A. et al. Large-scale analysis of the human and mouse transcriptomes. Proc Natl Acad Sci USA 99(7), 4465-70 (2002). HInvGeneMrna H-Inv H-Invitational Genes mRNA Alignments mRNA and EST Description This track shows alignments of full-length cDNAs that were used as the basis of the H-Invitational Gene Database (HInv-DB). The HInv-DB is a human gene database containing human-curated annotation of 41,118 full-length cDNA clones representing 21,037 cDNA clusters. The project was initiated in 2002 and the database became publicly available in April 2004. HInv-DB entries describe the following entities: gene structures functions novel alternative splicing isoforms non-coding functional RNAs functional domains sub-cellular localizations metabolic pathways predictions of protein 3D structure mapping of SNPs and microsatellite repeat motifs in relation with orphan diseases gene expression profiling comparative results with mouse full-length cDNAs gene structures Methods To cluster redundant cDNAs and alternative splicing variants within the H-Inv cDNAs, a total of 41,118 H-Inv cDNAs were mapped to the human genome using the mapping pipeline developed by the Japan Biological Information Research Center (JBIRC). The mapping yielded 40,140 cDNAs that were aligned against the genome using the stringent criteria of at least 95% identity and 90% length coverage. These 40,140 cDNAs were clustered to 20,190 loci, resulting in an average of 2.0 cDNAs per locus. For the remaining 978 unmapped cDNAs, cDNA-based clustering was applied, yielding 847 clusters. In total, 21,037 clusters (20,190 mapped and 847 unmapped) were identified and integrated into H-InvDB. H-Inv cluster IDs (e.g. HIX0000001) were assigned to these clusters. A representative sequence was selected from each cluster and used for further analyses and annotation. A full description of the construction of the HInv-DB is contained in the report by the H-Inv Consortium (see References section). Credits The H-InvDB is hosted at the JBIRC. The human-curated annotations were produced during invitational annotation meetings held in Japan during the summer of 2002, with a follow-up meeting in November 2004. Participants included 158 scientists representing 67 institutions from 12 countries. The full-length cDNA clones and sequences were produced by the Chinese National Human Genome Center (CHGC), the Deutsches Krebsforschungszentrum (DKFZ/MIPS), Helix Research Institute, Inc. (HRI), the Institute of Medical Science in the University of Tokyo (IMSUT), the Kazusa DNA Research Institute (KDRI), the Mammalian Gene Collection (MGC/NIH) and the Full-Length Long Japan (FLJ) project. References Imanishi T, Itoh T, Suzuki Y, O'Donovan C, Fukuchi S, Koyanagi KO, Barrero RA, Tamura T, Yamaguchi- Kabata Y, Tanino M et al. Integrative annotation of 21,037 human genes validated by full-length cDNA clones. PLoS Biol. 2004 Jun;2(6):e162. PMID: 15103394; PMC: PMC393292 encodeHapMapCov HapMap Coverage ENCODE HapMap (16c.1) Resequencing Coverage ENCODE Variation Description This track shows depth sequencing coverage for the four HapMap populations in the ten ENCODE regions that have been resequenced for variation. The data for each population is shown in a separate subtrack: HapMap Allele Frequencies (CEU): Utah residents with ancestry from northern and western Europe HapMap Allele Frequencies (CHB): Han Chinese in Beijing, China HapMap Allele Frequencies (JPT): Japanese in Tokyo, Japan HapMap Allele Frequencies (YRI): Yoruba in Ibadan, Nigeria The ENCODE regions targeted in this annotation include: ENr112 (chr2) ENr131 (chr2) ENr113 (chr4) ENm010 (chr7) ENm013 (chr7) ENm014 (chr7) ENr321 (chr8) ENr232 (chr9) ENr123 (chr12) ENr213 (chr18) Display Conventions and Configuration The subtracks within this annotation may be configured in a variety of ways to highlight different aspects of the displayed data. The graphical configuration options are shown at the top of the track description page, followed by a list of subtracks. To display only selected subtracks, uncheck the boxes next to the tracks you wish to hide. For more information about the graphical configuration options, click the Graph configuration help link. Methods Each data value represents the number of sequencing traces that covered the nucleotide. See the International HapMap Project website for information about how these data were collected and analyzed. Credits These data were obtained from HapMap public release 16c.1. Thanks to the International HapMap Project for making this information available. encodeHapMapCovYRI HapMap Cov YRI HapMap Resequencing Coverage Yoruban (YRI) ENCODE Variation encodeHapMapCovJPT HapMap Cov JPT HapMap Resequencing Coverage Japanese (JPT) ENCODE Variation encodeHapMapCovCHB HapMap Cov CHB HapMap Resequencing Coverage Chinese (CHB) ENCODE Variation encodeHapMapCovCEU HapMap Cov CEU HapMap Resequencing Coverage CEPH (CEU) ENCODE Variation encodeHapMapAlleleFreq HapMap SNPs ENCODE HapMap (16c.1) Allele Frequencies ENCODE Variation Description This track shows allele frequencies for the four HapMap populations in the ten ENCODE regions that have been resequenced for variation. The data for each population is shown in a separate subtrack: HapMap Allele Frequencies (CEU): Utah residents with ancestry from northern and western Europe HapMap Allele Frequencies (CHB): Han Chinese in Beijing, China HapMap Allele Frequencies (JPT): Japanese in Tokyo, Japan HapMap Allele Frequencies (YRI): Yoruba in Ibadan, Nigeria The ENCODE regions targeted in this annotation include: ENr112 (chr2) ENr131 (chr2) ENr113 (chr4) ENm010 (chr7) ENm013 (chr7) ENm014 (chr7) ENr321 (chr8) ENr232 (chr9) ENr123 (chr12) ENr213 (chr18) See the Methods section for a discussion of the scoring method used in this annotation. The data set combines SNPs from the HapMap resequencing project, in addition to SNPs discovered previously. Display Conventions and Configuration The complete list of subtracks available in this annotation is shown at the top of the track description page. To display only selected subtracks, uncheck the boxes next to the tracks you wish to hide. Allele locations are indicated by tickmarks using a grayscale coloring scheme based on score, where darker shading indicates a higher score. A lower score indicates little or no variation; a higher score indicates a split between the reference and variant observations in the population. The track details page for an individual allele displays the variant and reference sequences, the allele frequencies, the origination of the data, and the total sample count. Methods See the International HapMap Project website for information about how these data were collected and analyzed. The score calculation in this annotation is a function of the minor allele frequency (maf), which varies from 0.0 to 0.5. The score has been normalized to a range of 500 to 1000 using the formula score = 500 + (maf * 1000). Thus, a score of 500 indicates no variation; a score of 1000 indicates an even split between reference and variant observations in the population. Credits These data were obtained from HapMap public release 16c.1. Thanks to the International HapMap Project for making this information available. encodeHapMapAlleleFreqYRI Allele Freq YRI HapMap Minor Allele Frequencies Yoruban (YRI) ENCODE Variation encodeHapMapAlleleFreqJPT Allele Freq JPT HapMap Minor Allele Frequencies Japanese (JPT) ENCODE Variation encodeHapMapAlleleFreqCHB Allele Freq CHB HapMap Minor Allele Frequencies Chinese (CHB) ENCODE Variation encodeHapMapAlleleFreqCEU Allele Freq CEU HapMap Minor Allele Frequencies CEPH (CEU) ENCODE Variation est Human ESTs Human ESTs Including Unspliced mRNA and EST Description This track shows alignments between human expressed sequence tags (ESTs) in GenBank and the genome. ESTs are single-read sequences, typically about 500 bases in length, that usually represent fragments of transcribed genes. NOTE: As of April, 2007, we no longer include GenBank sequences that contain the following URL as part of the record: http://fulllength.invitrogen.com Some of these entries are the result of alignment to pseudogenes, followed by "correction" of the EST to match the genomic sequence. It is therefore not the sequence of the actual EST and makes it appear that the EST is transcribed. Invitrogen no longer sells the clones. Display Conventions and Configuration This track follows the display conventions for PSL alignment tracks. In dense display mode, the items that are more darkly shaded indicate matches of better quality. The strand information (+/-) indicates the direction of the match between the EST and the matching genomic sequence. It bears no relationship to the direction of transcription of the RNA with which it might be associated. The description page for this track has a filter that can be used to change the display mode, alter the color, and include/exclude a subset of items within the track. This may be helpful when many items are shown in the track display, especially when only some are relevant to the current task. To use the filter: Type a term in one or more of the text boxes to filter the EST display. For example, to apply the filter to all ESTs expressed in a specific organ, type the name of the organ in the tissue box. To view the list of valid terms for each text box, consult the table in the Table Browser that corresponds to the factor on which you wish to filter. For example, the "tissue" table contains all the types of tissues that can be entered into the tissue text box. Multiple terms may be entered at once, separated by a space. Wildcards may also be used in the filter. If filtering on more than one value, choose the desired combination logic. If "and" is selected, only ESTs that match all filter criteria will be highlighted. If "or" is selected, ESTs that match any one of the filter criteria will be highlighted. Choose the color or display characteristic that should be used to highlight or include/exclude the filtered items. If "exclude" is chosen, the browser will not display ESTs that match the filter criteria. If "include" is selected, the browser will display only those ESTs that match the filter criteria. This track may also be configured to display base labeling, a feature that allows the user to display all bases in the aligning sequence or only those that differ from the genomic sequence. For more information about this option, click here. Several types of alignment gap may also be colored; for more information, click here. Methods To make an EST, RNA is isolated from cells and reverse transcribed into cDNA. Typically, the cDNA is cloned into a plasmid vector and a read is taken from the 5' and/or 3' primer. For most — but not all — ESTs, the reverse transcription is primed by an oligo-dT, which hybridizes with the poly-A tail of mature mRNA. The reverse transcriptase may or may not make it to the 5' end of the mRNA, which may or may not be degraded. In general, the 3' ESTs mark the end of transcription reasonably well, but the 5' ESTs may end at any point within the transcript. Some of the newer cap-selected libraries cover transcription start reasonably well. Before the cap-selection techniques emerged, some projects used random rather than poly-A priming in an attempt to retrieve sequence distant from the 3' end. These projects were successful at this, but as a side effect also deposited sequences from unprocessed mRNA and perhaps even genomic sequences into the EST databases. Even outside of the random-primed projects, there is a degree of non-mRNA contamination. Because of this, a single unspliced EST should be viewed with considerable skepticism. To generate this track, human ESTs from GenBank were aligned against the genome using blat. Note that the maximum intron length allowed by blat is 750,000 bases, which may eliminate some ESTs with very long introns that might otherwise align. When a single EST aligned in multiple places, the alignment having the highest base identity was identified. Only alignments having a base identity level within 0.5% of the best and at least 96% base identity with the genomic sequence were kept. Credits This track was produced at UCSC from EST sequence data submitted to the international public sequence databases by scientists worldwide. References Benson DA, Karsch-Mizrachi I, Lipman DJ, Ostell J, Wheeler DL. GenBank: update. Nucleic Acids Res. 2004 Jan 1;32(Database issue):D23-6. Kent WJ. BLAT - The BLAST-Like Alignment Tool. Genome Res. 2002 Apr;12(4):656-64. encodeRna Known+Pred RNA Known and Predicted RNA Transcription in the ENCODE Regions ENCODE Regions and Genes Description This track shows the locations of known and predicted non-protein-coding RNA genes and pseudogenes that fall within the ENCODE regions. It contains all information in Sean Eddy's RNA Genes track for these regions, combined with computational predictions generated by Jakob Skou Pedersen's EvoFold algorithm. In addition to the fields contained in the RNA Genes track, this track also includes ENCODE-related fields describing overlap with transcribed regions and repeats. Feature types in this annotation include: tRNA: transfer RNA (or pseudogene) rRNA: ribosomal RNA (or pseudogene) scRNA: small cytoplasmic RNA (or pseudogene) snRNA: small nuclear RNA (or pseudogene) snoRNA: small nucleolar RNA (or pseudogene) miRNA: microRNA (or pseudogene) misc_RNA: miscellaneous other RNA, such as Xist (or pseudogene) "-": unknown RNA Display Conventions and Configuration The locations of the RNA genes and pseudogenes are represented by blocks in the graphical display, color-coded as follows: Black: region is Repeatmasked. Green: region is transcribed. Red: region is from the RNA Genes track and is not transcribed. Blue: region is an EvoFold prediction and is not transcribed. The display may be filtered to show only those items with unnormalized scores that meet or exceed a certain threshhold. To set a threshhold, type the minimum score into the text box at the top of the description page. Methods The RNA Genes track was supplemented with EvoFold predictions and filtered to include only those items that lie within the ENCODE regions. Regions that are at least 10 percent Repeatmasked are flagged because no transcriptional data is available for them. A region is considered transcribed if at least 10 percent overlaps with any Affymetrix transcribed fragment (transfrag), derived from six microarray experiments, or Yale transcriptionally-active region (TAR), derived from 15 microarray experiments. In these cases, each array from which the overlapped transfrags and TARs were derived is listed. EvoFold is a comparative method that exploits the evolutionary signal of genomic multiple-sequence alignments for identifying conserved functional RNA structures. The method makes use of phylogenetic stochastic context-free grammars (phylo-SCFGs), which are combined probabilistic models of RNA secondary structure and primary sequence evolution. The predictions consist both of a specific RNA secondary structure and an overall score. The overall score is essentially a log-odd score phylo-SCFG modeling the constrained evolution of stem-pairing regions and one which only models unpaired regions. Two sets of EvoFold predictions are included in this track. The first, labeled EvoFold, contains predictions based on the conserved elements of an 8-way vertebrate alignment of the human, chimpanzee, mouse, rat, dog, chicken, zebrafish, and Fugu assemblies. The second set of predictions, TBA23_EvoFold, was based on the conserved elements of the 23-way TBA alignments present in the ENCODE regions. When a pair of these predictions overlap, only the EvoFold prediction is shown. Credits These data were kindly provided by Sean Eddy at Washington University, Jakob Skou Pedersen at UC Santa Cruz, and The Encode Consortium. This annotation track was generated by Matt Weirauch. References Knudsen, B. and J.J. Hein. RNA secondary structure prediction using stochastic context-free grammars and evolutionary history. Bioinformatics 15(6), 446-54 (1999). Pedersen, J.S., Bejerano, G. and Haussler, D. Identification and classification of conserved RNA secondary structures in the human genome. (In preparation). encodeUcsdNgGif LI Ng gIF ChIP Ludwig Institute/UCSD ChIP-chip NimbleGen - Gamma Interferon Experiments ENCODE Chromatin Immunoprecipitation Description This track displays results of the following ChIP-chip (NimbleGen) gamma interferon experiments on HeLa cells: anti-H3K4me2, no gamma interferon anti-H3K4me2, 30 minutes after gamma interferon anti-H3K4me3, no gamma interferon anti-H3K4me3, 30 minutes after gamma interferon anti-H3ac, no gamma interferon anti-H3ac, 30 minutes after gamma interferon anti-H4ac, no gamma interferon anti-STAT1, 30 minutes after gamma interferon anti-RNA Pol2 in initiation complex, no gamma interferon anti-RNA Pol2 in initiation complex, 30 minutes after gamma interferon ENCODE region-wide location analysis of dimethylated K4 histone H3 (HK4me2 or diMeH3K4), trimethylated K4 histone H3 (H3K4me3 or triMeH3K4), RNA polymerase II, acetylated histone H3 (H3ac or AcH3), acetylated histone H4 (H4ac or AcH3) and STAT1 was conducted with ChIP-chip using chromatin extracted from HeLa cells induced for 30 minutes with gamma interferon as well as uninduced cells. Methods Chromatin from both induced and uninduced HeLa cells was separately cross-linked, precipitated with different antibodies, sheared, amplified and hybridized to an oligonucleotide tiling array produced by NimbleGen Systems. The array includes non-repetitive sequences within the 44 ENCODE regions tiled from NCBI Build 35 (UCSC hg17) with 50-mer probes at 38 bp interval. For H3K4me3 and Pol2, intensity values for biological replicate arrays were combined after quantile normalization using R. The averages of the quantile normalized intensity values for each probe were then median-scaled and Loess-normalized using R to obtain the adjusted logR-values. For all the other markers, each replicate was Loess-normalized and combined after intensity-based quantile normalization. The average log ratio for each probe was derived using linear model fitting with R. The peak positions were identified using the Mpeak program. Ren Lab download page. --> Verification Three biological replicates were used to generate the track for each factor at each time point with the exception of RNA Pol2 uninduced, where only two biological replicates were used. Credits The data for this track were generated at the Ren Lab, Ludwig Institute for Cancer Research at UC San Diego. encodeUcsdNgHeLaStat1_p30_peak LI STAT1 +gIF Pk Ludwig Institute/UCSD ChIP-chip Ng Peak: HeLa, STAT1, 30 min after gamma interferon ENCODE Chromatin Immunoprecipitation encodeUcsdNgHeLaAcH4_p0_peak LI H4ac -gIF Pk Ludwig Institute/UCSD ChIP-chip Ng Peak: HeLa, H4ac, no gamma interferon ENCODE Chromatin Immunoprecipitation encodeUcsdNgHeLaAcH3_p30_peak LI H3ac +gIF Pk Ludwig Institute/UCSD ChIP-chip Ng Peak: HeLa, H3ac, 30 min after gamma interferon ENCODE Chromatin Immunoprecipitation encodeUcsdNgHeLaAcH3_p0_peak LI H3ac -gIF Pk Ludwig Institute/UCSD ChIP-chip Ng Peak: HeLa, H3ac, no gamma interferon ENCODE Chromatin Immunoprecipitation encodeUcsdNgHeLaDmH3K4_p30_peak LI H3K4m2 +IF Pk Ludwig Institute/UCSD ChIP-chip Ng Peak: HeLa, H3K4me2, 30 min after gamma interferon ENCODE Chromatin Immunoprecipitation encodeUcsdNgHeLaDmH3K4_p0_peak LI H3K4m2 -IF Pk Ludwig Institute/UCSD ChIP-chip Ng Peak: HeLa, H3K4me2, no gamma interferon ENCODE Chromatin Immunoprecipitation encodeUcsdNgHeLaRnap_p30 LI Pol2 +gIF Ludwig Institute/UCSD ChIP-chip Ng: HeLa, Pol2, 30 min after gamma interferon ENCODE Chromatin Immunoprecipitation encodeUcsdNgHeLaRnap_p0 LI Pol2 -gIF Ludwig Institute/UCSD ChIP-chip Ng: HeLa, Pol2, no gamma interferon ENCODE Chromatin Immunoprecipitation encodeUcsdNgHeLaStat1_p30 LI STAT1 +gIF Ludwig Institute/UCSD ChIP-chip Ng: HeLa, STAT1, 30 min after gamma interferon ENCODE Chromatin Immunoprecipitation encodeUcsdNgHeLaAcH4_p0 LI H4ac -gIF Ludwig Institute/UCSD ChIP-chip Ng: HeLa, H4ac, no gamma interferon ENCODE Chromatin Immunoprecipitation encodeUcsdNgHeLaAcH3_p30 LI H3ac +gIF Ludwig Institute/UCSD ChIP-chip Ng: HeLa, H3ac, 30 min after gamma interferon ENCODE Chromatin Immunoprecipitation encodeUcsdNgHeLaAcH3_p0 LI H3ac -gIF Ludwig Institute/UCSD ChIP-chip Ng: HeLa, H3ac, no gamma interferon ENCODE Chromatin Immunoprecipitation encodeUcsdNgHeLaH3K4me3_p30 LI H3K4m3 +gIF Ludwig Institute/UCSD ChIP-chip Ng: HeLa, H3K4me3, 30 min after gamma interferon ENCODE Chromatin Immunoprecipitation encodeUcsdNgHeLaH3K4me3_p0 LI H3K4me3 -gIF Ludwig Institute/UCSD ChIP-chip Ng: HeLa, H3K4me3, no gamma interferon ENCODE Chromatin Immunoprecipitation encodeUcsdNgHeLaDmH3K4_p30 LI H3K4me2 +gIF Ludwig Institute/UCSD ChIP-chip Ng: HeLa, H3K4me2, 30 min after gamma interferon ENCODE Chromatin Immunoprecipitation encodeUcsdNgHeLaDmH3K4_p0 LI H3K4me2 -gIF Ludwig Institute/UCSD ChIP-chip Ng: HeLa, H3K4me2, no gamma interferon ENCODE Chromatin Immunoprecipitation encodeUcsdNgChip LI Ng TAF1 IMR90 Ludwig Institute NimbleGen ChIP-chip: TAF1 antibody, IMR90 cells ENCODE Chromatin Immunoprecipitation Description This genome-wide track shows likely TAF1 binding sites in fibroblastoid (IMR90) cells as assayed by ChIP-chip using a NimbleGen microarray. The two subtracks displayed below the main track show known TAF1 binding sites and additional novel sites where, according to the data in the first track, TAF1 is most likely to bind. TAF1, a protein found at the start of transcribed genes, is a general transcription factor that is a key part of the pre-initiation complex found on the promoter. It is more fully known as TBP-associated factor 1 of the TFIID complex or by its molecular weight as TAF250. To survey the entire human genome in an unbiased fashion, a total of 38 high-density oligonucleotide arrays (NimbleGen platform) were fabricated, representing approximately 1.45 billion base pairs of non-repetitive DNA with 50-mer oligonucleotides positioned at every 100 base pairs throughout the human genome (UCSC hg16). Using this array, genome-wide location analysis of TAF1 was conducted employing ChIP-chip using chromatin extracted from primary fibroblast IMR90 cells. Methods Chromatin from IMR90 cells lines was cross-linked, precipitated with TAF1 antibody (sc-735, Santa Cruz), sheared, amplified and hybridized to 38 high-density oligonucleotide arrays (NimbleGen). These arrays contain a total of 14,535,659 50-mer oligonucleotides positioned at every 100 base pairs through the human genome (UCSC hg16). Using this set of arrays, a total of 9,966 clusters of TFIID binding sites were identified. To verify the binding of TFIID to these sequences, a condensed array was designed containing a total of 379,521 oligonucleotides to represent the 9,966 putative TFIID binding sequences plus 29 control genomic loci at 100 bp resolution. Using these condensed arrays, two independent chromatin immunoprecipitation (ChIP) experiments were performed with the antibodies against TAF1, RNA polymerase II, acetylated histone 3 and dimethylated K4 histone 3. A total of 8,597 TFIID binding regions, ranging in size from 400 bp to 9.8 Kbp, were confirmed by the TAF1 replicate experiments. The verification data can be viewed in the LI/Ng Validation track. To further define the sites of TFIID binding within the identified regions, a model-based peak-finding algorithm was developed that estimates the most likely TFIID binding sites based on the hybridization intensity of probes within each fragment. The signals from a set of consecutive significantly-enriched probes were collectively used to locate the most likely TFIID binding site to the probe with the peak signal. The algorithm predicted a total of 12,150 TFIID binding sites within the 8,597 confirmed TFIID binding fragments. The locations of the 12,150 peaks were compared to the annotated 5' end of transcripts from RefSeq, GenBank and DBTSS, using a cutoff of 2.5 Kbp. It was found that 10,504 peaks corresponding to 9,281 non-redundant transcripts were within 2.5 Kbp of the annotated 5' end. 47 of the remaining peaks were within 2.5 Kbp of Ensembl genes, resulting in a total of 9328 known non-redundant promoters. The remaining peaks were further filtered using Acembly annotation and H3ac, RNAP and MeH3K4 ChIP-chip data. The total number of novel peaks was 1,239. The raw data have been deposited in GEO (GSE2672) and will be released following the publication of the paper. Verification The peaks from genome scan experiments were verified using condensed arrays, as described in the Methods section. The verification data may be viewed in the LI/Ng Validation track. References Kim, T.H., Barrera, L.O., Zheng, M., Qu, C., Singer, M.A., Richmond, T.A., Wu, Y., Green, R.D. and Ren, B. A high-resolution map of active promoters in the human genome. Nature 436, 876-880 (2005). encodeUcsdNgChipNovelSites TAF1 Novel Sites Ludwig Institute TAF1 Sites Matching No Known Transcripts ENCODE Chromatin Immunoprecipitation encodeUcsdNgChipKnownSites TAF1 Known Sites Ludwig Institute TAF1 Sites Matching to Known Transcripts ENCODE Chromatin Immunoprecipitation encodeUcsdNgChipSignal LI Ng TAF1 IMR90 Ludwig Institute NimbleGen ChIP-chip: TAF1 antibody, IMR90 cells ENCODE Chromatin Immunoprecipitation encodeUcsdNgValChip LI Ng Validation Ludwig Institute ChIP-chip Validation: IMR90 cells ENCODE Chromatin Immunoprecipitation Description This track displays validation data from ChIP-chip experiments on four factors in IMR90 cells using a condensed array covering putative TAF1 binding sites. This track may be used to validate the whole genome scan shown in the LI/Ng TAF1 IMR90 track (Ludwig Institute, UCSD ChIP-chip (TAF1) genome scan). All four factors — Pol2, H3ac, H3K4me2, and TAF1 itself — are associated with the start of transcribed genes. Thus, there should be a very strong correlation between the signals shown in this track and the LI/NG TAF1 IMR90 track. TAF1 is a component of TFIID, which is itself a component of the pre-initiation complex that assembles on promoter regions. Pol2, more fully known as RNA Polymerase II, is the enzyme responsible for transcription of mRNA. The specific antibody against Pol2, 8WG16 from the Abcom catalog, binds specifically to the non-phosphorylated form of Pol2 which is associated with the pre-initiation complex. H3ac and H3K4me2 are forms of histone H3 that are associated with transcriptionally-active chromatin. Methods For the whole genome scan, chromatin from IMR90 cells lines was cross-linked, precipitated with TAF1 antibody (sc-735, Santa Cruz), sheared, amplified and hybridized to 38 high-density oligonucleotide arrays (NimbleGen). These arrays contain a total of 14,535,659 50-mer oligonucleotides, positioned at every 100 base pairs throughout the human genome (UCSC hg16). Using this set of arrays, a total of 9,966 clusters of TAF1 binding sites were identified. The whole genome scan data can be viewed in the LI/Ng TAF1 IMR90 track (Ludwig Institute, UCSD ChIP-chip (TAF1) genome scan). To verify the binding of TAF1 to these sequences, a condensed array was designed containing a total of 379,521 oligonucleotides to represent the 9,966 putative TAF1 binding sequences plus 29 control genomic loci at 100 bp resolution. Using these condensed arrays, two independent chromatin immunoprecipitation (ChIP) experiments were performed with the antibodies against TAF1, Pol2, acetylated histone 3 and dimethylated K4 histone 3. A total of 8,597 TAF1 binding regions, ranging in size from 400 bp to 9.8 Kbp, were confirmed by the TAF1 replicate experiments. The raw data have been deposited in GEO (GSE2672) and will be released following publication of the paper. Verification The peaks from genome scan experiments were verified using condensed arrays, as described in the Methods section. References Kim, T.H., Barrera, L.O., Zheng, M., Qu, C., Singer, M.A., Richmond, T.A., Wu, Y., Green, R.D. and Ren, B. A high-resolution map of active promoters in the human genome. Nature 436, 876-880 (2005). encodeUcsdNgValChipTaf LI Ng Val TAF1 Ludwig Institute ChIP-chip Validation: TAF1 antibody, IMR90 cells ENCODE Chromatin Immunoprecipitation encodeUcsdNgValChipRnap LI Ng Val Pol2 Ludwig Institute ChIP-chip Validation: Pol2 8WG16 antibody, IMR90 cells ENCODE Chromatin Immunoprecipitation encodeUcsdNgValChipH3K4me LI Ng Val H3K4m2 Ludwig Institute ChIP-chip Validation: H3K4me2 antibody, IMR90 cells ENCODE Chromatin Immunoprecipitation encodeUcsdNgValChipH3ac LI Ng Val H3ac Ludwig Institute ChIP-chip Validation: H3ac antibody, IMR90 cells ENCODE Chromatin Immunoprecipitation ctgPos Map Contigs Physical Map Contigs Mapping and Sequencing Description This track shows the locations of human contigs on the physical map. The underlying data is derived from the NCBI seq_contig.md file that accompanies this assembly. All contigs are "+" oriented in the assembly. Methods For human genome reference sequences dated April 2003 and later, the individual chromosome sequencing centers are responsible for preparing the assembly of their chromosomes in AGP format. The files provided by these centers are checked and validated at NCBI, and form the basis for the seq_contig.md file that defines the physical map contigs. For more information on the human genome assembly process, see The NCBI Handbook. encodeMavidAlign MAVID Alignment MAVID Alignments ENCODE Comparative Genomics Description This track displays human-centric multiple sequence alignments in the ENCODE regions for the 28 vertebrates included in the September 2005 ENCODE MSA freeze, based on comparative sequence data generated for the ENCODE project as well as whole-genome assemblies residing at UCSC, as listed: human (May 2004, hg17) armadillo (NISC and May 2005 Broad Assisted Assembly v 1.0) baboon (NISC) chicken (Feb 2004, galGal2) chimp (Nov 2003, panTro1) colobus_monkey (NISC) cow (BCM) dog (July 2004, canFam1) dusky_titi (NISC) elephant (NISC and May 2005 Broad Assisted Assembly v 1.0) fugu (Aug 2002, fr1) galago (NISC) hedgehog (NISC) macaque (Jan 2005, rheMac1) marmoset (NISC) monodelphis (Oct 2004, monDom1) mouse (Mar 2005, mm6) mouse_lemur (NISC) owl_monkey (NISC) platypus (NISC and Aug 2005 Mullikin Phusion Assembly of WUGSC Traces) rabbit (NISC and May 2005 Broad Assisted Assembly v 1.0) rat (June 2003, rn3) rfbat (NISC) shrew (NISC and Sep 2005 Mullikin Phusion Assembly of Broad Traces) tenrec (Apr 2005 Mullikin Phusion Assembly of Broad Traces) tetraodon (Feb 2004, tetNig1) xenopus (Oct 2004, xenTro1) zebrafish (June 2004, danRer2) The alignments in this track were generated using the Mercator orthology mapping program and the MAVID multiple global alignment program. The Genome Browser companion tracks, MAVID Cons and MAVID Elements, display conservation scoring and conserved elements for these alignments based on various conservation methods. Display Conventions and Configuration In full display mode, this track shows pairwise alignments of each species aligned to the human genome. In dense mode, the alignments are depicted using a gray-scale density gradient. The checkboxes in the track configuration section allow the exclusion of species from the pairwise display. When zoomed-in to the base-display level, the track shows the base composition of each alignment. The numbers and symbols on the "Gaps" line indicate the lengths of gaps in the human sequence at those alignment positions relative to the longest non-human sequence. If there is sufficient space in the display, the size of the gap is shown; if not, and if the gap size is a multiple of 3, a "*" is displayed, otherwise "+" is shown. To view detailed information about the alignments at a specific position, zoom in the display to 30,000 or fewer bases, then click on the alignment. Methods Mercator was first used to identify the colinear and orthologous segments in the sequences given for each ENCODE region. Input to Mercator was generated by using Genscan to predict genes in all sequences, Blat to compare predicted coding exons, and MUMmer to identify non-coding exact matches between all pairs of sequences. The output of Mercator was a small-scale one-to-one orthology map for each ENCODE region, as well as a set of alignment constraints based on matched landmarks (e.g., exons and long non-coding exact matches). MAVID was then used to construct a global multiple alignment of each colinear orthologous segment set specified in the orthology map. As part of its input, MAVID used a phylogenetic tree determined from alignments of four-fold degenerate sites in the ENCODE regions. Credits Generation of the MAVID alignments was engineered by Colin Dewey at the Pachter Lab Comparative Genomics Group at UC Berkeley. Mercator was written by Colin Dewey and Lior Pachter. MAVID was authored by Nicholas Bray and Lior Pachter. The phylogenetic tree is based on Murphy et al. (2001). References Bray, N. and Pachter, L. MAVID: Constrained Ancestral Alignment of Multiple Sequences. Genome Res 14(4), 693-699 (2004). Burge, C. and Karlin, S. Prediction of complete gene structures in human genomic DNA. J Mol Biol 268(1), 78-94 (1997). Dewey, C.N. and Pachter, L. Mercator: multiple whole-genome orthology map construction. In preparation. Kent, W.J. BLAT-the BLAST-like alignment tool. Genome Res 12(4), 656-664 (2002). Kurtz, S., Phillippy, A., Delcher, A.L., Smoot, M., Shumway, M., Antonescu, C. and Salzberg, S.L. Versatile and open software for comparing large genomes. Genome Biol 5(2), R12 (2004). Murphy, W.J., et al. Resolution of the early placental mammal radiation using Bayesian phylogenetics. Science 294(5550), 2348-51 (2001). encodeMavidCons MAVID Cons MAVID Conservation ENCODE Comparative Genomics Description This track displays different measurements of conservation based on the MAVID multiple sequence alignments of ENCODE regions shown in the MAVID Alignment track. Two programs — phastCons (phylogenetic hidden-Markov model method) and GERP (Genomic Evolutionary Rate Profiling) — were used to generate the conservation scoring shown in this track. A related track, MAVID Elements, shows multi-species conserved sequences (MCSs) based on the conservation measurements displayed in this track. For details on the conservation scores generated by each program, refer to the individual Methods subsections. Display Conventions and Configuration The subtracks within this composite annotation track may be configured in a variety of ways to highlight different aspects of the displayed data. The graphical configuration options are shown at the top of the track description page, followed by a list of the subtracks. A subtrack may be hidden from view by unchecking the box to the left of the track name in the list. For more information about the graphical configuration options, click the Graph configuration help link. Color differences among the subtracks are arbitrary; they provide a visual cue for distinguishing the different gene prediction methods. See the Methods section for display information specific to each subtrack. Methods The methods used to create the MAVID alignments in the ENCODE regions are described in the MAVID Alignment track description. PhastCons The phastCons program predicts conserved elements and produces base-by-base conservation scores using a two-state phylogenetic hidden Markov model. The model consists of a state for conserved regions and a state for nonconserved regions, each of which is associated with a phylogenetic model. These two models are identical except that the branch lengths of the conserved phylogeny are multiplied by a scaling parameter rho (0 < rho < 1). For determining the conservation for the ENCODE alignments, the nonconserved model was estimated from four-fold degenerate coding sites within the ENCODE regions using the program phyloFit. The parameter rho was then estimated by maximum likelihood, conditional on the nonconserved model, using the EM algorithm implemented in phastCons. Parameter estimation was based on a single large alignment, constructed by concatenating the alignments for all conserved regions. PhastCons was run with the options --expected-lengths 15 and --target-coverage 0.01 to obtain the desired level of "smoothing" and a final coverage by conserved elements of 5%. The conservation score at each base is the posterior probability that the base was generated by the conserved state of the phylo-HMM. It can be interpreted as the probability that the base is in a conserved element, given the assumptions of the model and the estimated parameters. Scores range from 0 to 1, with higher scores corresponding to higher levels of conservation. More details on phastCons can be found in Siepel et. al. (2005) cited below. GERP The GERP score is the expected substitution rate minus the observed substitution rate at a particular human base. Scores are estimated on a column-by-column basis using multiple sequence alignments of mammalian genomic DNA. The scores are both positive and negative, with negative values (i.e. observed > expected) corresponding to neutral or unconstrained sites and positive values (i.e. observed < expected) corresponding to constrained or slowly evolving sites. The expected and observed rates are both calculated on a phylogenic tree using the same fixed topology. The branch lengths of the expected tree are based on the average substitutions at neutral sites. The branch lengths of the observed tree, which is calculated separately for each human base, are based on the substitutions seen at the column of the multiple alignment at that base. Species that have gaps at a particular column are not considered in the scoring for that column. Higher scores correspond to human bases in alignment columns with higher degrees of similarity, i.e. bases that have evolved slowly, some of which have been under purifying selection. The opposite holds true for swiftly evolving (low similarity) columns. Scores are deterministic, given a maximum-likelihood model of nucleotide substitution, species topology, neutral tree, and alignment. Credits PhastCons was developed by Adam Siepel, Cold Spring Harbor Laboratory, while at the Haussler Lab at UCSC. GERP was developed primarily by Greg Cooper in the lab of Arend Sidow at Stanford University (Depts of Pathology and Genetics), in close collaboration with Eric Stone (Biostatistics, NC State), and George Asimenos and Eugene Davydov in the lab of Serafim Batzoglou (Dept. of Computer Science, Stanford). The GERP data for this track was generated by Greg Cooper. The PhastCons data was generated by Elliott Margulies, with assistance from Adam Siepel. References Margulies, E.H., Blanchette, M., NISC Comparative Sequencing Program, Haussler, D. and Green, E.D. Identification and characterization of multi-species conserved sequences. Genome Res 13(12), 2507-18 (2003). Siepel, A., Bejerano, G., Pedersen, J.S., Hinrichs, A., Hou, M., Rosenbloom, K., Clawson, H., Spieth, J., Hillier, L.W. et al. Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. Genome Res 15(8), 1034-50 (2005). References for the MAVID alignment tools can be found on the MAVID Alignment track description page. encodeMavidGerpCons MAVID GERP Cons MAVID GERP Conservation ENCODE Comparative Genomics encodeMavidPhastCons MAVID PhastCons MAVID PhastCons Conservation ENCODE Comparative Genomics encodeMavidElements MAVID Elements MAVID Conserved Elements ENCODE Comparative Genomics Description This track displays multi-species conserved sequences (MCSs) derived from phastCons, binCons, and GERP (Genomic Evolutionary Rate Profiling), conservation scoring of human ENCODE genomic DNA alignments to 27 other vertebrates using the MAVID alignment package. The multiple sequence alignments may be viewed in the MAVID Alignment track. Another related track, MAVID Cons, shows the conservation scoring. The descriptions accompanying these tracks detail the methods used to create the alignments and conservation. Display Conventions and Configuration The locations of conserved elements are indicated by blocks in the graphical display. This composite annotation track consists of several subtracks that show conserved elements derived by the methods listed above. To show only selected subtracks, uncheck the boxes next to the tracks you wish to hide. The display may also be filtered to show only those items with unnormalized scores that meet or exceed a certain threshold. To set a threshold, type the minimum score into the text box at the top of the description page. Display characteristics specific to certain subtracks are described in the respective Methods sections below. Methods PhastCons-based Elements The predicted MCSs are segments of the alignment that are likely to have been "generated" by the conserved state of the phylo-HMM, i.e. maximal segments in which the maximum-likelihood (Viterbi) path remains in the conserved state. BinCons-based Elements The binCons score is based on the cumulative binomial probability of detecting the observed number of identical bases (or greater) in sliding 25 bp windows (moving one bp at a time) between the reference sequence and each other species, given the neutral rate at four-fold degenerate sites. Neutral rates are calculated separately at each targeted region. For targets with no gene annotations, the average percent identity across all alignable sequence was instead used to weight the individual species binomial scores; this latter weighting scheme was found to closely match 4D weights. The negative log of these p-values was then averaged across all human-referenced pairwise combinations, and the highest scoring overlapping 25 bp window for each base was the resulting score. This track shows the plotting of a ranked percentile score normalized between 0 and 1 across all ENCODE regions, such that the top 5% most conserved sequence across all ENCODE regions have a score of 0.95 or greater, the top 10% have a score of 0.9 or greater, and so on. For each ENCODE target, a conservation score threshold was picked to match the number of conserved bases predicted by phastCons, an alternative method for measuring conservation. This latter method has been found slightly more reliable for predicting the expected fraction of conserved sequence in each target. Clusters of bases that exceeded the given conservation score threshold were designated as MCSs. The minimum length of an MCS is 25 bases. Strict cutoffs were used: if even one base fell below the conservation score threshold, it separated an MCS into two distinct regions. More details on binCons can be found in Margulies et. al. (2003) cited below. GERP-based Elements GERP constrained elements exhibit significant evidence of the effects of purifying selection. Elements are scored according to the inferred intensity of purifying selection and are measured as "rejected substitutions" (RSs). RSs capture the magnitude of difference between the number of "observed" substitutions (estimated using maximum likelihood) and the number that would be "expected" under a neutral model of evolution. The RS is displayed as part of the item name. Items with higher RSs are displayed in a darker shade of blue. The score shown on the details page, which has been scaled by 300 for display purposes, is generally not as accurate as the RS count that is part of the item name. "Constrained elements" are identified as those groups of consecutive human bases that have an observed rate of evolution that is smaller than the expected rate. These groups of columns are merged if they are less than a few nucleotides apart and are scored according to the sum of the site-by-site difference between observed and expected rates (RS). Permutations of the actual alignments were analyzed, and the "constrained elements" identified in these permuted alignments were treated as "false positives". Subsequently, an RS threshold was picked such that the total length of "false positive" constrained elements (identified in the permuted alignments) was less than 5% of the length of constrained elements identified in the actual alignment. Thus, all annotated constrained elements are significant at better than 95% confidence, and the total fraction of the ENCODE regions annotated as constrained is 5-7%. Credits PhastCons was developed by Adam Siepel, Cold Spring Harbor Laboratory, while at the Haussler Lab at UCSC. BinCons was developed by Elliott Margulies of NHGRI, while at the Eric Green lab. BinCons and phastCons MCS data were contributed by Elliott Margulies, with assistance from Adam Siepel of UCSC. GERP was developed primarily by Greg Cooper in the lab of Arend Sidow at Stanford University (Depts of Pathology and Genetics), in close collaboration with Eric Stone (Biostatistics, NC State), and George Asimenos and Eugene Davydov in the lab of Serafim Batzoglou (Dept. of Computer Science, Stanford). References See the MAVID Alignment and MAVID Cons tracks for references. encodeMavidGerpEl MAVID GERP MAVID GERP Conserved Elements ENCODE Comparative Genomics encodeMavidBinConsEl MAVID BinCons MAVID BinCons Conserved Elements ENCODE Comparative Genomics encodeMavidPhastConsEl MAVID PhastCons MAVID PhastCons Conserved Elements ENCODE Comparative Genomics mgcFullMrna MGC Genes Mammalian Gene Collection Full ORF mRNAs Genes and Gene Predictions Description This track shows alignments of human mRNAs from the Mammalian Gene Collection (MGC) having full-length open reading frames (ORFs) to the genome. The goal of the Mammalian Gene Collection is to provide researchers with unrestricted access to sequence-validated full-length protein-coding cDNA clones for human, mouse, and rat genes. Display Conventions and Configuration The track follows the display conventions for gene prediction tracks. An optional codon coloring feature is available for quick validation and comparison of gene predictions. To display codon colors, select the genomic codons option from the Color track by codons pull-down menu. For more information about this feature, go to the Coloring Gene Predictions and Annotations by Codon page. Methods GenBank human MGC mRNAs identified as having full-length ORFs were aligned against the genome using blat. When a single mRNA aligned in multiple places, the alignment having the highest base identity was found. Only alignments having a base identity level within 1% of the best and at least 95% base identity with the genomic sequence were kept. Credits The human MGC full-length mRNA track was produced at UCSC from mRNA sequence data submitted to GenBank by the Mammalian Gene Collection project. References Mammalian Gene Collection project references. Kent WJ. BLAT--the BLAST-like alignment tool. Genome Res. 2002 Apr;12(4):656-64. PMID: 11932250; PMC: PMC187518 encodeMlaganAlign MLAGAN Alignment MLAGAN Alignments ENCODE Comparative Genomics Description This track displays human-centric multiple sequence alignments in the ENCODE regions for the 28 vertebrates included in the September 2005 ENCODE MSA freeze, based on comparative sequence data generated for the ENCODE project as well as whole-genome assemblies residing at UCSC, as listed: human (May 2004, hg17) chimp (Nov 2003, panTro1) colobus_monkey (NISC) baboon (NISC) macaque (Jan 2005, rheMac1) dusky_titi (NISC) owl_monkey (NISC) marmoset (NISC) mouse_lemur (NISC) galago (NISC) rat (June 2003, rn3) mouse (Mar 2005, mm6) rabbit (NISC and May 2005 Broad Assisted Assembly v 1.0) cow (BCM) dog (July 2004, canFam1) rfbat (NISC) hedgehog (NISC) shrew (NISC and Sep 2005 Mullikin Phusion Assembly of Broad Traces) armadillo (NISC and May 2005 Broad Assisted Assembly v 1.0) elephant (NISC and May 2005 Broad Assisted Assembly v 1.0) tenrec (Apr 2005 Mullikin Phusion Assembly of Broad Traces) monodelphis (Oct 2004, monDom1) platypus (NISC and Aug 2005 Mullikin Phusion Assembly of WUGSC Traces) chicken (Feb 2004, galGal2) xenopus (Oct 2004, xenTro1) tetraodon (Feb 2004, tetNig1) fugu (Aug 2002, fr1) zebrafish (June 2004, danRer2) The alignments in this track were generated using the LAGAN Alignment Toolkit. The Genome Browser companion tracks, MLAGAN Cons and MLAGAN Elements, display conservation scoring and conserved elements for these alignments based on various conservation methods. Display Conventions and Configuration In full display mode, this track shows pairwise alignments of each species aligned to the human genome. In dense mode, the alignments are depicted using a gray-scale density gradient. The checkboxes in the track configuration section allow the exclusion of species from the pairwise display. When zoomed-in to the base-display level, the track shows the base composition of each alignment. The numbers and symbols on the "Gaps" line indicate the lengths of gaps in the human sequence at those alignment positions relative to the longest non-human sequence. If there is sufficient space in the display, the size of the gap is shown; if not, and if the gap size is a multiple of 3, a "*" is displayed, otherwise "+" is shown. To view detailed information about the alignments at a specific position, zoom in the display to 30,000 or fewer bases, then click on the alignment. Methods MLAGAN alignments were produced by a pipeline specifically designed for ENCODE. First, AB-BLAST was used to find local similarities (anchors) between the human sequence and the sequence of every other species. Then, Shuffle-LAGAN was used to calculate the highest-scoring human-monotonic chain of these local similarities (according to a scoring scheme that penalized evolutionary rearrangements), and — with the help of a utility called SuperMap — produce a map of orthologous segments, in increasing human coordinates. This map was used to undo the genomic rearrangements of the other sequence and convert it to a form that was directly alignable to the human sequence. The new humanized sequences, together with the human sequence, were then multiply aligned using MLAGAN. The resulting alignments were subsequently refined using MUSCLE, which processed small non-overlapping alignment windows and realigned them in an iterative fashion, keeping the refined alignment if it had a better sum-of-pairs score than the original. Finally, a pairwise refinement round was performed, during which the pieces that had very low identity (in the induced pairwise alignments between human and each species) were removed from the alignment. Credits The MLAGAN alignments were generated by George Asimenos from Stanford's ENCODE group. Shuffle-LAGAN, SuperMap and MLAGAN were written by Mike Brudno. MUSCLE was authored by Bob Edgar. WU-BLAST was provided by the Gish lab at the School of Medicine, University of Washington in St. Louis. The phylogenetic tree is based on Murphy et al. (2001). References Brudno M, Do C, Cooper G, Kim MF, Davydov E, Green ED, Sidow A, Batzoglou S. LAGAN and Multi-LAGAN: efficient tools for large-scale multiple alignment of genomic DNA. Genome Res. 2003;13(4):721-31. Brudno M, Malde S, Poliakov A, Do C, Couronne O, Dubchak I, Batzoglou S. Global alignment: finding rearrangements during alignment. Bioinformatics. 2003;19(Suppl. 1):i54-i62. Edgar RC. MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucl Acids Res. 2004;32(5):1792-7. Murphy WJ, Eizirik E, O'Brien SJ, Madsen O, Scally M, Douady CJ, Teeling E, Ryder OA, Stanhope MJ, de Jong WW et al. Resolution of the early placental mammal radiation using Bayesian phylogenetics. Science. 2001;294(5550):2348-51. encodeMlaganCons MLAGAN Cons MLAGAN Conservation ENCODE Comparative Genomics Description This track displays different measurements of conservation based on the MLAGAN multiple sequence alignments of ENCODE regions shown in the MLAGAN Alignment track. Two programs — phastCons (phylogenetic hidden-Markov model method), and GERP (Genomic Evolutionary Rate Profiling) — generated the conservation scoring used to create this track. A related track, MLAGAN Elements, shows multi-species conserved sequences (MCSs) based on the conservation measurements displayed in this track. For details on the conservation scores generated by each program, refer to the individual Methods subsections. Display Conventions and Configuration The subtracks within this composite annotation track may be configured in a variety of ways to highlight different aspects of the displayed data. The graphical configuration options are shown at the top of the track description page, followed by a list of the subtracks. A subtrack may be hidden from view by unchecking the box to the left of the track name in the list. For more information about the graphical configuration options, click the Graph configuration help link. Color differences among the subtracks are arbitrary; they provide a visual cue for distinguishing the different gene prediction methods. See the Methods section for display information specific to each subtrack. Methods The methods used to create the MLAGAN alignments in the ENCODE regions are described in the MLAGAN Alignment track description. PhastCons The phastCons program predicts conserved elements and produces base-by-base conservation scores using a two-state phylogenetic hidden Markov model. The model consists of a state for conserved regions and a state for nonconserved regions, each of which is associated with a phylogenetic model. These two models are identical except that the branch lengths of the conserved phylogeny are multiplied by a scaling parameter rho (0 < rho < 1). For determining the conservation for the ENCODE alignments, the nonconserved model was estimated from four-fold degenerate coding sites within the ENCODE regions using the program phyloFit. The parameter rho was then estimated by maximum likelihood, conditional on the nonconserved model, using the EM algorithm implemented in phastCons. Parameter estimation was based on a single large alignment, constructed by concatenating the alignments for all conserved regions. PhastCons was run with the options --expected-lengths 15 and --target-coverage 0.01 to obtain the desired level of "smoothing" and a final coverage by conserved elements of 5%. The conservation score at each base is the posterior probability that the base was generated by the conserved state of the phylo-HMM. It can be interpreted as the probability that the base is in a conserved element, given the assumptions of the model and the estimated parameters. Scores range from 0 to 1, with higher scores corresponding to higher levels of conservation. More details on phastCons can be found in Siepel et. al. (2005) cited below. GERP The GERP score is the expected substitution rate minus the observed substitution rate at a particular human base. Scores are estimated on a column-by-column basis using multiple sequence alignments of mammalian genomic DNA. The scores are both positive and negative, with negative values (i.e. observed > expected) corresponding to neutral or unconstrained sites and positive values (i.e. observed < expected) corresponding to constrained or slowly evolving sites. The expected and observed rates are both calculated on a phylogenic tree using the same fixed topology. The branch lengths of the expected tree are based on the average substitutions at neutral sites. The branch lengths of the observed tree, which is calculated separately for each human base, are based on the substitutions seen at the column of the multiple alignment at that base. Species that have gaps at a particular column are not considered in the scoring for that column. Higher scores correspond to human bases in alignment columns with higher degrees of similarity, i.e. bases that have evolved slowly, some of which have been under purifying selection. The opposite holds true for swiftly evolving (low similarity) columns. Scores are deterministic, given a maximum-likelihood model of nucleotide substitution, species topology, neutral tree, and alignment. Credits PhastCons was developed by Adam Siepel, Cold Spring Harbor Laboratory, while at the Haussler Lab at UCSC. GERP was developed primarily by Greg Cooper in the lab of Arend Sidow at Stanford University (Depts of Pathology and Genetics), in close collaboration with Eric Stone (Biostatistics, NC State), and George Asimenos and Eugene Davydov in the lab of Serafim Batzoglou (Dept. of Computer Science, Stanford). The GERP data for this track were generated by Greg Cooper. The PhastCons data were generated by Elliott Margulies, with assistance from Adam Siepel. References Cooper, G.M., Stone, E.A., Asimenos, G., NISC Comparative Sequencing Program, Green, E.D., Batzoglou, S. and Sidow, A. Distribution and intensity of constraint in mammalian genomic sequence. Genome Res 15(7), 901-13 (2005). Margulies, E.H., Blanchette, M., NISC Comparative Sequencing Program, Haussler, D. and Green, E.D. Identification and characterization of multi-species conserved sequences. Genome Res 13(12), 2507-18 (2003). Siepel, A., Bejerano, G., Pedersen, J.S., Hinrichs, A., Hou, M., Rosenbloom, K., Clawson, H., Spieth, J., Hillier, L.W. et al. Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. Genome Res 15(8), 1034-50 (2005). References for the MLAGAN alignment tools can be found on the MLAGAN Alignment track description page. encodeMlaganGerpCons MLAGAN GERP Cons MLAGAN GERP Conservation ENCODE Comparative Genomics encodeMlaganPhastCons MLAGAN PhastCons MLAGAN PhastCons Conservation ENCODE Comparative Genomics encodeMlaganElements MLAGAN Elements MLAGAN Conserved Elements ENCODE Comparative Genomics Description This track displays multi-species conserved sequences (MCSs) derived from phastCons, binCons, and GERP (Genomic Evolutionary Rate Profiling), conservation scoring of human ENCODE genomic DNA alignments to 27 other vertebrates using the MLAGAN alignment package. The multiple sequence alignments may be viewed in the MLAGAN Alignments track. Another related track, MLAGAN Cons, shows the conservation scoring. The descriptions accompanying these tracks detail the methods used to create the alignments and conservation. Display Conventions and Configuration The locations of conserved elements are indicated by blocks in the graphical display. This composite annotation track consists of several subtracks that show conserved elements derived by the methods listed above. To view only selected subtracks, uncheck the boxes next to the tracks you wish to hide. The display may also be filtered to show only those items with unnormalized scores that meet or exceed a certain threshold. To set a threshold, type the minimum score into the text box at the top of the description page. Display characteristics specific to certain subtracks are described in the respective Methods sections below. Methods PhastCons-based Elements The predicted MCSs are segments of the alignment that are likely to have been "generated" by the conserved state of the phylo-HMM, i.e. maximal segments in which the maximum-likelihood (Viterbi) path remains in the conserved state. BinCons-based Elements The binCons score is based on the cumulative binomial probability of detecting the observed number of identical bases (or greater) in sliding 25 bp windows (moving one bp at a time) between the reference sequence and each other species, given the neutral rate at four-fold degenerate sites. Neutral rates are calculated separately at each targeted region. For targets with no gene annotations, the average percent identity across all alignable sequence was instead used to weight the individual species binomial scores; this latter weighting scheme was found to closely match 4D weights. The negative log of these p-values was then averaged across all human-referenced pairwise combinations, and the highest scoring overlapping 25 bp window for each base was the resulting score. This track shows the plotting of a ranked percentile score normalized between 0 and 1 across all ENCODE regions, such that the top 5% most conserved sequence across all ENCODE regions have a score of 0.95 or greater, the top 10% have a score of 0.9 or greater, and so on. For each ENCODE target, a conservation score threshold was picked to match the number of conserved bases predicted by phastCons, an alternative method for measuring conservation. This latter method has been found slightly more reliable for predicting the expected fraction of conserved sequence in each target. Clusters of bases that exceeded the given conservation score threshold were designated as MCSs. The minimum length of an MCS is 25 bases. Strict cutoffs were used: if even one base fell below the conservation score threshold, it separated an MCS into two distinct regions. More details on binCons can be found in Margulies et. al. (2003) cited below. GERP-based Elements GERP constrained elements exhibit significant evidence of the effects of purifying selection. Elements are scored according to the inferred intensity of purifying selection and are measured as "rejected substitutions" (RSs). RSs capture the magnitude of difference between the number of "observed" substitutions (estimated using maximum likelihood) and the number that would be "expected" under a neutral model of evolution. The RS is displayed as part of the item name. Items with higher RSs are displayed in a darker shade of blue. The score shown on the details page, which has been scaled by 300 for display purposes, is generally not as accurate as the RS count that is part of the item name. "Constrained elements" are identified as those groups of consecutive human bases that have an observed rate of evolution that is smaller than the expected rate. These groups of columns are merged if they are less than a few nucleotides apart and are scored according to the sum of the site-by-site difference between observed and expected rates (RS). Permutations of the actual alignments were analyzed, and the "constrained elements" identified in these permuted alignments were treated as "false positives". Subsequently, an RS threshold was picked such that the total length of "false positive" constrained elements (identified in the permuted alignments) was less than 5% of the length of constrained elements identified in the actual alignment. Thus, all annotated constrained elements are significant at better than 95% confidence, and the total fraction of the ENCODE regions annotated as constrained is 5-7%. Credits PhastCons was developed by Adam Siepel, Cold Spring Harbor Laboratory, while at the Haussler Lab at UCSC. BinCons was developed by Elliott Margulies of NHGRI, while at the Eric Green lab. BinCons and phastCons MCS data were contributed by Elliott Margulies, with assistance from Adam Siepel of UCSC. GERP was developed primarily by Greg Cooper in the lab of Arend Sidow at Stanford University (Depts of Pathology and Genetics), in close collaboration with Eric Stone (Biostatistics, NC State), and George Asimenos and Eugene Davydov in the lab of Serafim Batzoglou (Dept. of Computer Science, Stanford). References See the MLAGAN Alignment and MLAGAN Cons tracks for references. encodeMlaganGerpEl MLAGAN GERP MLAGAN GERP Conserved Elements ENCODE Comparative Genomics encodeMlaganBinConsEl MLAGAN BinCons MLAGAN BinCons Conserved Elements ENCODE Comparative Genomics encodeMlaganPhastConsEl MLAGAN PhastCons MLAGAN PhastCons Conserved Elements ENCODE Comparative Genomics nscan N-SCAN N-SCAN Gene Predictions Genes and Gene Predictions Description N-SCAN and N-SCAN EST gene predictions. Methods N-SCAN N-SCAN combines biological-signal modeling in the target genome sequence along with information from a multiple-genome alignment to generate de novo gene predictions. It extends the TWINSCAN target-informant genome pair to allow for an arbitrary number of informant sequences as well as richer models of sequence evolution. It models the phylogenetic relationships between the aligned genome sequences, context-dependent substitution rates, insertions, and deletions. Human N-SCAN uses mouse (mm5) as the informant and iterative pseudogene masking. N-SCAN EST N-SCAN EST combines EST alignments into N-SCAN. Similar to the conservation sequence models in TWINSCAN, separate probability models are developed for EST alignments to genomic sequence in exons, introns, splice sites and UTRs, reflecting the EST alignment patterns in these regions. N-SCAN EST is more accurate than N-SCAN while retaining the ability to discover novel genes to which no ESTs align. Human N-SCAN EST uses mouse (mm5), rat (rn3), and chicken (galGal2) as informants. Credits Thanks to Michael Brent's Computational Genomics Group at Washington University St. Louis for providing this data. Special thanks for this implementation of N-SCAN to Aaron Tenney in the Brent lab, and Robert Zimmermann, currently at Max F. Perutz Laboratories in Vienna, Austria. References Gross SS, Brent MR. Using multiple alignments to improve gene prediction. In Proc. 9th Int'l Conf. on Research in Computational Molecular Biology (RECOMB '05):374-388 and J Comput Biol. 2006 Mar;13(2):379-93. Korf I, Flicek P, Duan D, Brent MR. Integrating genomic homology into gene structure prediction. Bioinformatics. 2001 Jun 1;17(90001)S140-8. van Baren MJ, Brent MR. Iterative gene prediction and pseudogene removal improves genome annotation. Genome Res. 2006 May;16(5):678-85. nscanGene N-SCAN N-SCAN Gene Predictions Genes and Gene Predictions Description This track shows gene predictions using the N-SCAN gene structure prediction software provided by the Computational Genomics Lab at Washington University in St. Louis, MO, USA. Methods N-SCAN combines biological-signal modeling in the target genome sequence along with information from a multiple-genome alignment to generate de novo gene predictions. It extends the TWINSCAN target-informant genome pair to allow for an arbitrary number of informant sequences as well as richer models of sequence evolution. N-SCAN models the phylogenetic relationships between the aligned genome sequences, context-dependent substitution rates, insertions, and deletions. Human N-SCAN uses mouse (mm5) as the informant and iterative pseudogene masking. Credits Thanks to Michael Brent's Computational Genomics Group at Washington University St. Louis for providing this data. Special thanks for this implementation of N-SCAN to Aaron Tenney in the Brent lab, and Robert Zimmermann, currently at Max F. Perutz Laboratories in Vienna, Austria. References Gross SS, Brent MR. Using multiple alignments to improve gene prediction. J Comput Biol. 2006 Mar;13(2):379-93. PMID: 16597247 Haas BJ, Delcher AL, Mount SM, Wortman JR, Smith RK Jr, Hannick LI, Maiti R, Ronning CM, Rusch DB, Town CD et al. Improving the Arabidopsis genome annotation using maximal transcript alignment assemblies. Nucleic Acids Res. 2003 Oct 1;31(19):5654-66. PMID: 14500829; PMC: PMC206470 Korf I, Flicek P, Duan D, Brent MR. Integrating genomic homology into gene structure prediction. Bioinformatics. 2001;17 Suppl 1:S140-8. PMID: 11473003 van Baren MJ, Brent MR. Iterative gene prediction and pseudogene removal improves genome annotation. Genome Res. 2006 May;16(5):678-85. PMID: 16651666; PMC: PMC1457044 nscanEstGene N-SCAN+EST N-SCAN+EST Gene Predictions Genes and Gene Predictions encodeIndels NHGRI DIPs NHGRI Deletion/Insertion Polymorphisms in ENCODE regions ENCODE Variation Description This track shows deletion/insertion polymorphisms (DIPs). In packed and full modes, the sequence variation is shown to the left of the DIP. The naming convention "-/sequence" is used for deletions; "sequence/-" is used for insertions. The details page shows the name of the trace used to define the polymorphism, the quality score, and the strand on which the trace aligns to the reference sequence. The quality score reflects the minimum PHRED quality value over the entire range of the DIP within the trace, plus 5 flanking bases. PHRED quality scores are expressed as log probabilities using the formula: Q = -10 * log10(Pe) where Pe is the estimated probability of an error at that base. PHRED quality scores typically vary from 0 to 40, where 0 indicates complete uncertainty about the base and 40 implies odds of 10,000 to 1 that the base is correct. Sometimes a PHRED value of 50 or higher is used to denote finished sequence. A color gradient is used to distinguish quality scores in the browser display: brighter shading indicates higher scores. The "Trace Pos" value on the details page indicates the 3' position of the DIP within the trace. The alleles are reported relative to the "+" strand of the reference sequence; however, the trace may actually align to the "-" strand. When viewing the chromatogram using the URL provided, if the trace aligned to the "-" strand, the DIP bases in the trace will be the reverse compliment of the variant allele given. Methods All human trace data from NCBI's trace archive were aligned to hg17 with ssahaSNP, followed by ssahaDIP post-processing to detect deletion/insertion polymorphisms. DIPs within ENCODE regions were extracted. Verification For verification, 500k traces from the mouse whole genome shotgun (WGS) sequencing effort were compared to mm6 using ssahaSNP and ssahaDIP. Because mm6 and these traces are from the same mouse strain, C57BL/6J, the DIP rate should be very low. Applying a quality threshold of Q23, the detected DIP rate was one DIP per 140k Neighborhood Quality Standard (NQS) bases. This level was ten-fold lower than the SNP rate for the same data set using ssahaSNP, which has been validated as having a 5% false positive rate. The detected DIP rate for human traces against hg17 is one DIP per 12k NQS bases, indicating a false positive rate of 12k/140k, or about 8%. Further validation experiments are in progress. Credits All analyses were performed by Jim Mullikin using ssahaSNP and ssahaDIP. The trace data were contributed to the trace archive by many sequencing centers. References Ning Z, Cox AJ, Mullikin JC. SSAHA: A fast search method for large DNA databases. Genome Res. 2001 Oct;11(10):1725-9. The International SNP Map Working Group. A map of human genome sequence variation containing 1.4 million single nucleotide polymorphisms. Nature. 2001 Feb 15;409(6822):928-33. orfeomeMrna ORFeome Clones ORFeome Collaboration Gene Clones Genes and Gene Predictions Description This track shows alignments of human clones from the ORFeome Collaboration. The goal of the project is to be an "unrestricted source of fully sequence-validated full-ORF human cDNA clones in a format allowing easy transfer of the ORF sequences into virtually any type of expression vector. A major goal is to provide at least one fully-sequenced full-ORF clone for each human gene." This track is updated automatically as new clones become available. Display Conventions and Configuration The track follows the display conventions for gene prediction tracks. Methods ORFeome human clones were obtained from GenBank and aligned against the genome using the blat program. When a single clone aligned in multiple places, the alignment having the highest base identity was found. Only alignments having a base identity level within 0.5% of the best and at least 96% base identity with the genomic sequence were kept. Credits and References Visit the ORFeome Collaboration members page for a list of credits and references. xenoEst Other ESTs Non-Human ESTs from GenBank mRNA and EST Description This track displays translated blat alignments of expressed sequence tags (ESTs) in GenBank from organisms other than human. ESTs are single-read sequences, typically about 500 bases in length, that usually represent fragments of transcribed genes. Display Conventions and Configuration This track follows the display conventions for PSL alignment tracks. In dense display mode, the items that are more darkly shaded indicate matches of better quality. The strand information (+/-) for this track is in two parts. The first + or - indicates the orientation of the query sequence whose translated protein produced the match. The second + or - indicates the orientation of the matching translated genomic sequence. Because the two orientations of a DNA sequence give different predicted protein sequences, there are four combinations. ++ is not the same as --, nor is +- the same as -+. The description page for this track has a filter that can be used to change the display mode, alter the color, and include/exclude a subset of items within the track. This may be helpful when many items are shown in the track display, especially when only some are relevant to the current task. To use the filter: Type a term in one or more of the text boxes to filter the EST display. For example, to apply the filter to all ESTs expressed in a specific organ, type the name of the organ in the tissue box. To view the list of valid terms for each text box, consult the table in the Table Browser that corresponds to the factor on which you wish to filter. For example, the "tissue" table contains all the types of tissues that can be entered into the tissue text box. Multiple terms may be entered at once, separated by a space. Wildcards may also be used in the filter. If filtering on more than one value, choose the desired combination logic. If "and" is selected, only ESTs that match all filter criteria will be highlighted. If "or" is selected, ESTs that match any one of the filter criteria will be highlighted. Choose the color or display characteristic that should be used to highlight or include/exclude the filtered items. If "exclude" is chosen, the browser will not display ESTs that match the filter criteria. If "include" is selected, the browser will display only those ESTs that match the filter criteria. This track may also be configured to display base labeling, a feature that allows the user to display all bases in the aligning sequence or only those that differ from the genomic sequence. For more information about this option, go to the Base Coloring for Alignment Tracks page. Several types of alignment gap may also be colored; for more information, go to the Alignment Insertion/Deletion Display Options page. Methods To generate this track, the ESTs were aligned against the genome using blat. When a single EST aligned in multiple places, the alignment having the highest base identity was found. Only alignments having a base identity level within 0.5% of the best and at least 96% base identity with the genomic sequence were kept. Credits This track was produced at UCSC from EST sequence data submitted to the international public sequence databases by scientists worldwide. References Benson DA, Cavanaugh M, Clark K, Karsch-Mizrachi I, Lipman DJ, Ostell J, Sayers EW. GenBank. Nucleic Acids Res. 2013 Jan;41(Database issue):D36-42. PMID: 23193287; PMC: PMC3531190 Benson DA, Karsch-Mizrachi I, Lipman DJ, Ostell J, Wheeler DL. GenBank: update. Nucleic Acids Res. 2004 Jan 1;32(Database issue):D23-6. PMID: 14681350; PMC: PMC308779 Kent WJ. BLAT - the BLAST-like alignment tool. Genome Res. 2002 Apr;12(4):656-64. PMID: 11932250; PMC: PMC187518 xenoMrna Other mRNAs Non-Human mRNAs from GenBank mRNA and EST Description This track displays translated blat alignments of vertebrate and invertebrate mRNA in GenBank from organisms other than human. Display Conventions and Configuration This track follows the display conventions for PSL alignment tracks. In dense display mode, the items that are more darkly shaded indicate matches of better quality. The strand information (+/-) for this track is in two parts. The first + indicates the orientation of the query sequence whose translated protein produced the match (here always 5' to 3', hence +). The second + or - indicates the orientation of the matching translated genomic sequence. Because the two orientations of a DNA sequence give different predicted protein sequences, there are four combinations. ++ is not the same as --, nor is +- the same as -+. The description page for this track has a filter that can be used to change the display mode, alter the color, and include/exclude a subset of items within the track. This may be helpful when many items are shown in the track display, especially when only some are relevant to the current task. To use the filter: Type a term in one or more of the text boxes to filter the mRNA display. For example, to apply the filter to all mRNAs expressed in a specific organ, type the name of the organ in the tissue box. To view the list of valid terms for each text box, consult the table in the Table Browser that corresponds to the factor on which you wish to filter. For example, the "tissue" table contains all the types of tissues that can be entered into the tissue text box. Multiple terms may be entered at once, separated by a space. Wildcards may also be used in the filter. If filtering on more than one value, choose the desired combination logic. If "and" is selected, only mRNAs that match all filter criteria will be highlighted. If "or" is selected, mRNAs that match any one of the filter criteria will be highlighted. Choose the color or display characteristic that should be used to highlight or include/exclude the filtered items. If "exclude" is chosen, the browser will not display mRNAs that match the filter criteria. If "include" is selected, the browser will display only those mRNAs that match the filter criteria. This track may also be configured to display codon coloring, a feature that allows the user to quickly compare mRNAs against the genomic sequence. For more information about this option, go to the Codon and Base Coloring for Alignment Tracks page. Several types of alignment gap may also be colored; for more information, go to the Alignment Insertion/Deletion Display Options page. Methods The mRNAs were aligned against the human genome using translated blat. When a single mRNA aligned in multiple places, the alignment having the highest base identity was found. Only those alignments having a base identity level within 1% of the best and at least 25% base identity with the genomic sequence were kept. Credits The mRNA track was produced at UCSC from mRNA sequence data submitted to the international public sequence databases by scientists worldwide. References Benson DA, Cavanaugh M, Clark K, Karsch-Mizrachi I, Lipman DJ, Ostell J, Sayers EW. GenBank. Nucleic Acids Res. 2013 Jan;41(Database issue):D36-42. PMID: 23193287; PMC: PMC3531190 Benson DA, Karsch-Mizrachi I, Lipman DJ, Ostell J, Wheeler DL. GenBank: update. Nucleic Acids Res. 2004 Jan 1;32(Database issue):D23-6. PMID: 14681350; PMC: PMC308779 Kent WJ. BLAT - the BLAST-like alignment tool. Genome Res. 2002 Apr;12(4):656-64. PMID: 11932250; PMC: PMC187518 xenoRefGene Other RefSeq Non-Human RefSeq Genes Genes and Gene Predictions Description This track shows known protein-coding and non-protein-coding genes for organisms other than human, taken from the NCBI RNA reference sequences collection (RefSeq). The data underlying this track are updated weekly. Display Conventions and Configuration This track follows the display conventions for gene prediction tracks. The color shading indicates the level of review the RefSeq record has undergone: predicted (light), provisional (medium), reviewed (dark). The item labels and display colors of features within this track can be configured through the controls at the top of the track description page. Label: By default, items are labeled by gene name. Click the appropriate Label option to display the accession name instead of the gene name, show both the gene and accession names, or turn off the label completely. Codon coloring: This track contains an optional codon coloring feature that allows users to quickly validate and compare gene predictions. To display codon colors, select the genomic codons option from the Color track by codons pull-down menu. For more information about this feature, go to the Coloring Gene Predictions and Annotations by Codon page. Hide non-coding genes: By default, both the protein-coding and non-protein-coding genes are displayed. If you wish to see only the coding genes, click this box. Methods The RNAs were aligned against the human genome using blat; those with an alignment of less than 15% were discarded. When a single RNA aligned in multiple places, the alignment having the highest base identity was identified. Only alignments having a base identity level within 0.5% of the best and at least 25% base identity with the genomic sequence were kept. Credits This track was produced at UCSC from RNA sequence data generated by scientists worldwide and curated by the NCBI RefSeq project. References Kent WJ. BLAT--the BLAST-like alignment tool. Genome Res. 2002 Apr;12(4):656-64. PMID: 11932250; PMC: PMC187518 Pruitt KD, Brown GR, Hiatt SM, Thibaud-Nissen F, Astashyn A, Ermolaeva O, Farrell CM, Hart J, Landrum MJ, McGarvey KM et al. RefSeq: an update on mammalian reference sequences. Nucleic Acids Res. 2014 Jan;42(Database issue):D756-63. PMID: 24259432; PMC: PMC3965018 Pruitt KD, Tatusova T, Maglott DR. NCBI Reference Sequence (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Res. 2005 Jan 1;33(Database issue):D501-4. PMID: 15608248; PMC: PMC539979 picTar PicTar miRNA MicroRNA Target Sites in 3' UTRs Predicted by PicTar Regulation Description This track shows microRNA target sites in 3' UTRs as predicted by PicTar, based on the RefSeq annotation of 3' UTRs. Methods The original PicTar algorithm was published in Krek et al., 2005. The annotations displayed in this track are updated predictions as published in Lall et al., 2006. PicTar is a hidden Markov model that assigns probabilities to 3' UTR subsequences as a binding site for a microRNA, considers all possible ways the 3' UTR could be bound by microRNAs, and then uses a maximum likelihood method to compute the optimal likelihood under which the 3' UTR could be explained by microRNAs and background. The score is this likelihood divided by background, i.e., the local base composition of each 3' UTR is taken into account. To fit the track conventions of the UCSC browser (integers), all scores were scaled by the maximum score of all microRNA 3'-UTR scores observed. Note that the PicTar algorithm scores any 3' UTR that has at least one aligned conserved predicted binding site for a microRNA, but then incorporates all possible binding sites into the score, even if they appear to be non-conserved. Because the score for a 3' UTR is a "phylo" average over all orthologous 3' UTRs used, "scattered" sites that appear in many species may boost the score, and individual sites shown in the display may not be aligned and conserved in all species under consideration. Two levels of conservation can be chosen: -- conservation among four vertebrates: human, mouse, rat, and dog -- conservation among five vertebrates: human, mouse, rat, dog, and chicken The latter settings have improved quality, but lower sensitivity. For a detailed analysis of signal-to-noise ratios and sensitivity, please refer to Lall et al., 2006. Credits Thanks to the Dominic Grün, Yi-Lu Wang, and Nikolaus Rajewsky for providing this annotation. More detailed information about individual predictions, including links to other databases, can be found on the PicTar website, a project of the Rajewsky lab while at the New York University Center for Comparative Functional Genomics. References Krek A, Grun D, Poy MN, Wolf R, Rosenberg L, Epstein EJ, MacMenamin P, da Piedade I, Gunsalus KC, Stoffel M, Rajewsky N. Combinatorial microRNA target predictions. Nat Genet. 2005 May;37(5):495-500. Lall S, Grun D, Krek A, Chen K, Wang YL, Dewey CN, Sood P, Colombo T, Bray N, Macmenamin P, Kao HL, Gunsalus KC, Pachter L, Piano F, Rajewsky N. A genome-wide map of conserved microRNA targets in C. elegans. Curr Biol. 2006 Mar 7;16(5):460-71. picTarMiRNA5Way PicTar 5 Species PicTar microRNA Sites, 5 species Conservation Constraint: Human/Mouse/Rat/Dog/Chicken Regulation picTarMiRNA4Way PicTar 4 Species PicTar microRNA Sites, 4 Species Conservation Constraint: Human/Mouse/Rat/Dog Regulation encodePseudogene Pseudogenes ENCODE Pseudogene Predictions - All ENCODE Regions ENCODE Regions and Genes Description This track shows the pseudogenes located in ENCODE regions generated by five different methods—Yale Pipeline, GenCode manual annotation, two different UCSC methods, and Gene Identification Signature (GIS)—as well as a consensus pseudogenes subtrack based on the pseudogenes from all five methods. Datasets are displayed in separate subtracks within the annotation and are individually described below. The annotations are colored as follows: Type Color Description Processed_pseudogene pink Pseudogenes arising via retrotransposition (exon structure of parent gene lost) Unprocessed_pseudogene blue Pseudogenes arising via gene duplication (exon structure of parent gene retained) Pseudogene_fragment light blue Pseudogenes sequences that are single-exon and cannot be confidently assigned to either the processed or the duplicated category Undefined gray   Consensus Pseudogenes Description This subtrack shows pseudogenes derived from a consensus of the five methods listed above. In the pseudogene.org data freeze dated 6 Jan. 2006, 201 consensus pseudogenes were found. Here, pseudogenes are defined as genomic sequences that are similar to known genes but exhibit various inactivating disablements (e.g. premature stop codons or frameshifts) in their putative protein-coding regions and are flagged as either recently-processed or non-processed. Methods The pseudogene sets were processed as follows: Step I: The four data sets were filtered to remove pseudogenes that overlap with current Gencode coding exons/loci. Pseudogenes overlapping with introns or noncoding genes were kept. Subsequent filtering of pseudogene sets, excluding the Havana set, removed pseudogenes overlapping with exons of UCSC Known Genes. Step II: A union of the pseudogenes from each filtered set was created. If a pseudogenic region was annotated by more than one group, the lowest starting coordinate and highest ending coordinate were used as the boundaries. Step III: A parent protein for each pseudogene in the union was assigned using a protein set from UniProt. Pseudogenes without a matching protein were excluded. Step IV: Each pseudogene was realigned to its parent protein. Step V: The consensus list of pseudogenes was updated with boundaries derived from the alignment in Step IV. Step VI: The consensus list of pseudogenes was updated with the assigned parent proteins and new classifications (processed or non-processed). Verification of the Consensus Pseudogenes All pseudogenes in the list have been extensively curated by Adam Frankish and Jennifer Harrow at the The Wellcome Trust Sanger Institute. References More information about this data set is available from pseudogene.org/ENCODE. Havana-Gencode Annotated Pseudogenes and Immunglobulin Segments Description This track shows pseudogenes annotated by the HAVANA group at the Wellcome Trust Sanger Institute. Pseudogenes have homology to protein sequences but generally have a disrupted CDS. For all annotated pseudogenes, an active homologous gene (the parent) can be identified elsewhere in the genome. Pseudogenes are classified as processed or unprocessed. Methods Prior to manual annotation, finished sequence is submitted to an automated analysis pipeline for similarity searches and ab initio gene predictions. The searches are run on a computer farm and stored in an Ensembl MySQL database using the Ensembl analysis pipeline system (Searle et al., 2004, Harrow et al., 2006). A pseudogene is annotated where the total length of the protein homology to the genomic sequence is >20% of the length of the parent protein or >100 aa in length, whichever is shortest. If a gene structure has an ORF but has lost the structure of the parent gene, a pseudogene is annotated provided there is no evidence of transcription from the pseudogene locus. When an open but truncated reading frame is present, other evidence is used (for example, 3' genomic polyA tract) to allow classification as a pseudogene. When a parent gene has only a single coding exon (e.g. olfactory receptors), a small 5' or 3' truncation to the CDS at the pseudogene locus (compared to other family members) is sufficient to confirm pseudogene status where the truncation is predicted to significantly affect secondary structure by the literature and/or expert community. Processed and unprocessed pseudogenes are distinguished on the basis of structure and genomic context. Processed pseudogenes, which arise via retrotransposition, lose the intron-exon structure of the parent gene, often have an A-rich tract indicative of the insertion site at their 3' end, and are flanked by different genomic sequence to the parent gene. Unprocessed pseudogenes, which arise via gene duplication, share both the intron-exon structure and flanking genomic sequence with the parent gene. Transcribed pseudogenes are indicated by the annotation of a pseudogene and transcript variant alongside each other. References Harrow J, Denoeud F, Frankish A, Reymond A, Chen CK, Chrast J, Lagarde J, Gilbert JG, Storey R, Swarbreck D, et al. GENCODE: Producing a reference annotation for ENCODE. Genome Biol. 2006;7 Suppl 1:S4.1-9. Searle SM, Gilbert J, Iyer V, Clamp M. The otter annotation system. Genome Res. 2004 May;14(5):963-70. Yale Pseudogenes Description This subtrack shows pseudogenes in the ENCODE regions identified by the Yale Pseudogene Pipeline. In this analysis, pseudogenes are defined as genomic sequences that are similar to known genes with various inactivating disablements (e.g. premature stop codons or frameshifts) in their putative protein-coding regions. Pseudogenes are flagged as recently processed, recently duplicated, or of uncertain origin (either ancient fragments or resulting from a single-exon parent). Methods Step I: Repeat-masked human genome sequence was used as the target for a six-frame TBLASTN where the query was the nonredundant human proteome set (European Bioinformatics Institute). Only high-quality human protein sequences from SWISS-PROT and TrEMBL were used, because this set included processed or duplicated pseudogenes. Step II: BLAST hits that had a significant overlap with annotated multiple-exon Ensembl genes were removed from consideration. Step III: The set of BLAST hits was reduced by selecting hits in decreasing significance level and removing matches that overlapped by more than 10 amino acids or 30 bp with a picked match. Step IV: Adjacent matches on a chromosome were merged together if they were thought to belong to the same pseudogene locus. Merged matches were extended on both sides to include the length of the query protein to which they matched along with an extra 30 bp buffer on either side. Step V: The FASTA program was used to re-align these extended hits to the genome. Redundant hits were removed and hits with gaps greater than 60 bp were split into two alignments. Step VI: Alignments with possible artifactual frameshifts or stop codons introduced by the alignment process were closely inspected. Step VII: False positives (E-value less than 10-10 or amino acid sequence of less than 40% identity) and sequences matching protein queries containing repeats or low-complexity regions were removed. Potential functional genes were also removed. These were defined as having no frameshift disruptions, less than 95% sequence identity to the query protein, and translatable to a protein sequence longer than 95% of the length of the query protein. Step VIII: The remaining putative pseudogene sequences were classified based on several criteria. The intron-exon structure of the functional gene was further used to infer whether a pseudogene was recently duplicated or processed. A duplicated pseudogene retains the intron-exon structure of its parent functional gene, whereas a processed pseudogene shows evidence that this structure has been spliced out. Those sequences where the insertions were 50% or more repeats (as detected by RepeatMasker) are "Disrupted" processed pseudogenes. Small pseudogene sequences that cannot be confidently assigned to either the processed or duplicated category may be ancient fragments. Further details can be found in the references below. Verification of Yale Pseudogenes All pseudogenes in the list have been manually checked. References Zhang Z, Harrison PM, Liu Y, Gerstein M. Millions of years of evolution preserved: a comprehensive catalog of the processed pseudogenes in the human genome. Genome Res. 2003 Dec;13(12):2541-58. Zheng D, Zhang Z, Harrison PM, Karro J, Carriero N, Gerstein M. Integrated pseudogene annotation for human chromosome 22: evidence for transcription. J Mol Biol. 2005 May 27;349(1):27-45. UCSC Retrogene Predictions Description The Retrogene subtrack shows processed mRNAs that have been inserted back into the genome since the mouse/human split. Retrogenes can be functional genes that have acquired a promoter from a neighboring gene, non-functional pseudogenes, or transcribed pseudogenes. Methods Step I: All GenBank mRNAs for a particular species were aligned to the genome using blastz. Step II: mRNAs that aligned twice in the genome (once with introns and once without introns) were initially screened. Step III: A series of features were scored to determine candidates for retrotranspostion events. These features included position and length of the polyA tail, degree of synteny with mouse, coverage of repetitive elements, number of exons that can still be aligned to the retroGene, and degree of divergence from the parent gene. Retrogenes are classified using a threshold score function that is a linear combination of this set of features. Retrogenes in the final set have a score threshold greater than 425 based on a ROC plot against the Vega annotated pseudogenes. The "type" field has four possible values: singleExon: the parent gene is a single exon gene mrna: the parent gene is a spliced mrna that has no annotation in NCBI refSeq, UCSC knownGene or Mammalian Gene Collection (MGC) annotated: the parent gene has been annotated by one of refSeq, knownGene or MGC expressed: an mRNA overlaps the retrogene, indicating probable transcription These features can be downloaded from the table pseudoGeneLink in many formats using the Table Browser option on the menubar. References Kent WJ, Baertsch R, Hinrichs A, Miller W, Haussler D. Evolution's cauldron: Duplication, deletion, and rearrangement in the mouse and human genomes. Proc Natl Acad Sci USA. 2003 Sep 30;100(20):11484-9. Schwartz S, Kent WJ, Smit A, Zhang Z, Baertsch R, Hardison R, Haussler D, Miller W. Human-mouse alignments with BLASTZ. Genome Res. 2003 Jan;13(1):103-7. UCSC Pseudogene Predictions Methods Step I: A set of pre-aligned human known genes was mapped across the human genome through the human Blastz Self Alignment using HomoMap (homologous mapping method). The fragments identified by HomoMap are homologs of genes from the Known Genes set. Step II: Each homologous fragment was compared with its known reference gene and a set of features was then collected. The features included sequence identity, Ka/Ks ratio (asynonymous substitution per codon vs. synonymous substitution per codon), splicing sites, and the number of premature stop codons. These homologous fragments are either genes or pseudogenes. Step III: Homologous fragments that overlapped known reference genes were labeled as positive samples; those overlapping known pseudogenes were labeled as negative samples. Step IV: These positive and negative sets were used to train support vector machines (SVMs) to separate coding fragments from pseudo fragments. The trained SVMs were used to classify all homologous fragments into potential coding elements or potential pseudo elements. Step V: Finally, a heuristic filter was used to correct some misclassified fragments and to generate the final potential pseudogene set. GIS-PET Pseudogene Predictions Description This subtrack shows retrotransposed pseudogenes predicted by multiple mapped GIS-PETs (gene identification signature-pair end ditags) collected from two different cancer cell lines HCT116 and MCF7. A total of 49 non-redundant processed pseudogenes predicted in the ENCODE regions are presented in this dataset. Each pseudogene is labeled with an ID of the format AAA-GISPgene-XX, where "AAA" indicates the parental gene name, "GISPgene" is the GIS pseudogene, and "XX" is the unique ID for each pseudogene. Methods PETs were generated from full-length transcripts and computationally mapped onto the human genome to demarcate the transcript start and end positions. The PETs that mapped to multiple genome locations were grouped into PET-based gene families that include parent gene and pseudogenes. A representative member—the shortest PET as defined by genomic coordinates—was selected from each family. This representative PET was aligned to the hg17 genome using in order to identify all the putative pseudogenes at the whole genome level. All hits with an identity >=70% and coverage >=50% within ENCODE regions were reported. In this context, "coverage" refers to alignment coverage of the query sequence, i.e. a measure of how complete the predicted pseudogene is relative to the query sequence. Verification of GIS-PET Pseudogene Predictions Pseudogenes were verified by manual examination. Credits These data were generated by the ENCODE Pseudogene Annotation group: Jennifer Harrow, Wei Chia-Lin, Siew Woh Choo Adam Frankish, Robert Baertsch, France Denoeud, Deyou Zheng, Yontao Lu, Alexandre Reymond, Roderic Guigo Serra, Tom Gingeras, Suganthi Balasubramanian and Mark Gerstein. encodePseudogeneGIS GIS Pseudogenes Genome Institute of Singapore (GIS) Pseudogenes ENCODE Regions and Genes encodePseudogeneUcsc2 UCSC Pseudogenes UCSC Pseudogene Predictions ENCODE Regions and Genes encodePseudogeneUcsc UCSC Retrogenes UCSC Retrogene Predictions ENCODE Regions and Genes encodePseudogeneYale Yale Pseudogenes Yale Pseudogene Predictions ENCODE Regions and Genes encodePseudogeneHavana Havana-Gencode Pseudogenes Havana-Gencode Annotated Pseudogenes and Immunoglobulin Segments ENCODE Regions and Genes encodePseudogeneConsensus Consensus Pseudogenes Consensus of Yale, Havana-Gencode, UCSC and GIS ENCODE Pseudogenes ENCODE Regions and Genes recombRate Recomb Rate Recombination Rate from deCODE, Marshfield, or Genethon Maps (deCODE default) Mapping and Sequencing Description The recombination rate track represents calculated sex-averaged rates of recombination based on either the deCODE, Marshfield, or Genethon genetic maps. By default, the deCODE map rates are displayed. Female- and male-specific recombination rates, as well as rates from the Marshfield and Genethon maps, can also be displayed by choosing the appropriate filter option on the track description page. Methods The deCODE genetic map was created at deCODE Genetics and is based on 5,136 microsatellite markers for 146 families with a total of 1,257 meiotic events. For more information on this map, see Kong, et al., 2002. The Marshfield genetic map was created at the Center for Medical Genetics and is based on 8,325 short tandem repeat polymorphisms (STRPs) for 8 CEPH families consisting of 134 individuals with 186 meioses. For more information on this map, see Broman et al., 1998. The Genethon genetic map was created at Genethon and is based on 5,264 microsatellites for 8 CEPH families consisting of 134 individuals with 186 meioses. For more information on this map, see Dib et al., 1996. Each base is assigned the recombination rate calculated by assuming a linear genetic distance across the immediately flanking genetic markers. The recombination rate assigned to each 1 Mb window is the average recombination rate of the bases contained within the window. Using the Filter This track has a filter that can be used to change the map or gender-specific rate displayed. The filter is located at the top of the track description page, which is accessed via the small button to the left of the track's graphical display or through the link on the track's control menu. To view a particular map or gender-specific rate, select the corresponding option from the "Map Distances" pulldown list. By default, the browser displays the deCODE sex-averaged distances. When you have finished configuring the filter, click the Submit button. Credits This track was produced at UCSC using data that are freely available for the Genethon, Marshfield, and deCODE genetic maps (see above links). Thanks to all who played a part in the creation of these maps. References Broman KW, Murray JC, Sheffield VC, White RL, Weber JL. Comprehensive human genetic maps: individual and sex-specific variation in recombination. Am J Hum Genet. 1998 Sep;63(3):861-9. PMID: 9718341; PMC: PMC1377399 Dib C, Fauré S, Fizames C, Samson D, Drouot N, Vignal A, Millasseau P, Marc S, Hazan J, Seboun E et al. A comprehensive genetic map of the human genome based on 5,264 microsatellites. Nature. 1996 Mar 14;380(6570):152-4. PMID: 8600387 Kong A, Gudbjartsson DF, Sainz J, Jonsdottir GM, Gudjonsson SA, Richardsson B, Sigurdardottir S, Barnard J, Hallbeck B, Masson G et al. A high-resolution recombination map of the human genome. Nat Genet. 2002 Jul;31(3):241-7. PMID: 12053178 rgdQtl RGD QTL Quantitative Trait Locus (from RGD) Phenotype and Disease Associations Description A quantitative trait locus (QTL) is a polymorphic locus that contains alleles which differentially affect the expression of a continuously distributed phenotypic trait. Usually a QTL is a marker described by statistical association to quantitative variation in the particular phenotypic trait that is thought to be controlled by the cumulative action of alleles at multiple loci. For a comprehensive review of QTL mapping techniques in the rat, see Rapp, J. (2000). Genetic Analysis of Inherited Hypertension in the Rat. Physiol. Rev., 90:135-172. Methods The annotation data file, human_QTL.gff, was downloaded from:        ftp://rgd.mcw.edu/pub/RGD_genome_annotations/human/archive/human_QTL.gff.020805 and processed to create two UCSC Genome Browser tables — rgdQtl and rgdQtlLink — to enable this track. Credits Thanks to the RGD for providing this annotation. RGD is funded by grant HL64541 entitled "Rat Genome Database", awarded to Dr. Howard J Jacob, Medical College of Wisconsin, from the National Heart Lung and Blood Institute (NHLBI) of the National Institutes of Health (NIH). encodeRikenCage Riken CAGE Riken CAGE - Predicted Gene Start Sites ENCODE Transcript Levels Description This track shows the number of 5' cap analysis gene expression (CAGE) tags that map to the genome on the "plus" and "minus" strands at a specific location. For clarity, only the first 5' nucleotide in the tag (relative to the transcript direction) is considered. Areas in which many tags map to the same region may indicate a significant transcription start site. Display Conventions and Configuration The position of the first 5' nucleotide in the tag is represented by a solid block. The height of the block indicates the number of 5' cDNA starts that map at that location. This composite annotation track contains multiple subtracks that may be configured in a variety of ways to highlight different aspects of the displayed data. The graphical configuration options are shown at the top of the track description page, followed by a list of subtracks. For more information about the graphical configuration options, click the Graph configuration help link. To display only selected subtracks, uncheck the boxes next to the tracks you wish to hide. Methods The CAGE tags are sequenced from the 5' ends of full-length cDNAs produced using RIKEN full-length cDNA technology. To create the tag, a linker was attached to the 5' end of full-length cDNAs which were selected by cap trapping. The first 20 bp of the cDNA were cleaved using class II restriction enzymes, followed by PCR amplification and then concatamers of the resulting 32 bp tags were formed for more efficient sequencing. For more information on CAGE analysis, see Shiraki et al. (2003) below. Refer to the RIKEN website for information about RIKEN full-length cDNA technologies. The mapping methodology employed in this annotation will be described in upcoming publications. Verification The techniques used to verify these data will be described in upcoming publications. Credits These data were contributed by the Functional Annotation of Mouse (FANTOM) Consortium, RIKEN Genome Science Laboratory and RIKEN Genome Exploration Research Group (Genome Network Project Core Group). FANTOM Consortium: P. Carninci, T. Kasukawa, S. Katayama, Gough, M. Frith, N. Maeda, R. Oyama, T. Ravasi, B. Lenhard, C. Wells, R. Kodzius, K. Shimokawa, V. B. Bajic, S. E. Brenner, S. Batalov, A. R. R. Forrest, M. Zavolan, M. J. Davis, L. G. Wilming, V. Aidinis, J. Allen, A. Ambesi-Impiombato, R. Apweiler, R. N. Aturaliya, T. L. Bailey, M. Bansal, K. W. Beisel, T. Bersano, H. Bono, A. M. Chalk, K. P. Chiu, V. Choudhary, A. Christoffels, D. R. Clutterbuck, M. L. Crowe, E. Dalla, B. P. Dalrymple, B. de Bono, G. Della Gatta, D. di Bernardo, T. Down, P. Engstrom, M. Fagiolini, G. Faulkner, C. F. Fletcher, T. Fukushima, M. Furuno, S. Futaki, M. Gariboldi, P. Georgii-Hemming, T. R. Gingeras, T. Gojobori, R. E. Green, S. Gustincich, M. Harbers, V. Harokopos, Y. Hayashi, S. Henning, T. K. Hensch, N. Hirokawa, D. Hill, L. Huminiecki, M. Iacono, K. Ikeo, A. Iwama, T. Ishikawa, M. Jakt, A. Kanapin, M. Katoh, Y. Kawasawa, J. Kelso, H. Kitamura, H. Kitano, G. Kollias, S. P. T. Krishnan, A.F. Kruger, K. Kummerfeld, I. V. Kurochkin, L. F. Lareau, L. Lipovich, J. Liu, S. Liuni, S. McWilliam, M. Madan Babu, M. Madera, L. Marchionni, H. Matsuda, S. Matsuzawa, H. Miki, F. Mignone, S. Miyake, K. Morris, S. Mottagui-Tabar, N. Mulder, N. Nakano, H. Nakauchi, P. Ng, R. Nilsson, S. Nishiguchi, S. Nishikawa, F. Nori, O. Ohara, Y. Okazaki, V. Orlando, K. C. Pang, W. J. Pavan, G. Pavesi, G. Pesole, N. Petrovsky, S. Piazza, W. Qu, J. Reed, J. F. Reid, B. Z. Ring, M. Ringwald, B. Rost, Y. Ruan, S. Salzberg, A. Sandelin, C. Schneider, C. Schoenbach, K. Sekiguchi, C. A. M. Semple, S. Seno, L. Sessa, Y. Sheng, Y. Shibata, H. Shimada, K. Shimada, B. Sinclair, S. Sperling, E. Stupka, K. Sugiura, R. Sultana, Y. Takenaka, K. Taki, K. Tammoja, S. L. Tan, S. Tang, M. S. Taylor, J. Tegner, S. A. Teichmann, H. R. Ueda, E. van Nimwegene, R. Verardo, C. L. Wei, K. Yagi, H. Yamanishi, E. Zabarovsky, S. Zhu, A. Zimmer, W. Hide, C. Bult, S. M. Grimmond, R. D. Teasdale, E. T. Liu, V. Brusic, J. Quackenbush, C. Wahlestedt, J. Mattick, D. Hume. RIKEN Genome Exploration Research Group: C. Kai, D. Sasaki, Y. Tomaru, S. Fukuda, M. Kanamori-Katayama, M. Suzuki, J. Aoki, T. Arakawa, J. Iida, K. Imamura, M. Itoh, T. Kato, H. Kawaji, N. Kawagashira, T. Kawashima, M. Kojima, S. Kondo, H. Konno, K. Nakano, N. Ninomiya, T. Nishio, M. Okada, C. Plessy, K. Shibata, T. Shiraki, S. Suzuki, M. Tagami, K Waki, A. Watahiki, Y. Okamura-Oho, H. Suzuki, J. Kawai. General Organizer: Y. Hayashizaki References Shiraki, T., Kondo, S., Katayama, S., Waki, K., Kasukawa, T., Kawaji, H., Kodzius, R., Watahiki, A., Nakamura, M. et al. Cap analysis gene expression for high-throughput analysis of transcriptional starting point and identification of promoter usage. Proc Natl Acad Sci U S A. 100(26), 15776-81 (2003). encodeRikenCageMinus Riken CAGE - Riken CAGE Minus Strand - Predicted Gene Start Sites ENCODE Transcript Levels encodeRikenCagePlus Riken CAGE + Riken CAGE Plus Strand - Predicted Gene Start Sites ENCODE Transcript Levels encodeSangerGenoExprAssociation Sanger Assoc Sanger Genotype-Expression Association ENCODE Variation Description This track displays associations among gene expression data from the 60 unrelated Centre d'Etude du Polymorphisme Humain (CEPH) individuals of the International HapMap Project with SNPs genotyped by HapMap. The CEPH population is composed of Utah residents with ancestry from northern and western Europe. The expression data were generated with the Illumina platform at the Wellcome Trust Sanger Institute. Display Conventions and Configuration In the graphical display, an association is displayed as a block drawn at the location of the associated SNP. In pack or full modes, the name of the associated gene is drawn to the left of the block. The shading of the block indicates the strength of the association: light gray indicates a (-log10) P-value close to 0 and black indicates a P-value of 2 or more. Methods An association analysis was performed for each ENCODE RefSeq gene with the genotypes of SNPs in the same ENCODE region (cis). Expression values were initially log2 transformed and subsequently normalized with quantile normalization to ensure homogeneous levels between arrays. Analysis of variance (ANOVA) was then performed with 1 or 2 degrees of freedom (depending on whether only two or all three genotypes in the population were available), using the genotype as a categorical variable and the normalized/transformed expression values as the response. The values presented here are the -log10 P-value. Verification There were six technical replicates for each sample; the average values from these were used for the ANOVA. Credits The following people contributed to this analysis: Barbara Stranger, Matthew Forrest, Panos Deloukas, and Manolis Dermitzakis from Wellcome Trust Sanger Institute and Simon Tavare from Cambridge University. References Dausset, J., Cann, H., Cohen, D., Lathrop, M., Lalouel, J.M. and White, R. Centre d'Etude du Polymorphisme Humain (CEPH): collaborative genetic mapping of the human genome. Genomics 6(3), 575-7 (1990). encodeSangerChip Sanger ChIP Sanger ChIP-chip (histones H3,H4 ab in GM06990, K562, HeLa, HFL-1, MOLT4, and PTR8 cells) ENCODE Chromatin Immunoprecipitation Description ENCODE region-wide location analysis of H3 and H4 histones was conducted employing ChIP-chip using chromatin extracted from GM06990 (lymphoblastoid), K562 (myeloid leukemia-derived), HeLaS3 (cervix carcinoma), HFL-1 (embryonic lung fibroblast), MOLT-4 (lymphoblastic leukemia), and PTR8 cells. Experiments were conducted with antibodies to the following histones: H3K4me1, H3K4me2, H3K4me3, H3K9me3, H3K27me3, H3K36me3, H3K79me3, H3ac, H4ac, and CTCF. Histone methylation and acetylation serves as a stable genomic imprint that regulates gene expression and other epigenetic phenomena. These histones are found in transcriptionally active domains called euchromatin. Display Conventions and Configuration This annotation follows the display conventions for composite "wiggle" tracks. The subtracks within this annotation may be configured in a variety of ways to highlight different aspects of the displayed data. The graphical configuration options are shown at the top of the track description page, followed by a list of subtracks. To display only selected subtracks, uncheck the boxes next to the tracks you wish to hide. For more information about the graphical configuration options, click the Graph configuration help link. Methods Chromatin from the cell line was cross-linked with 1% formaldehyde, precipitated with antibody binding to the histone, and sheared and hybridized to the Sanger ENCODE3.1.1 DNA microarray. DNA was not amplified prior to hybridization. The raw and transformed data files reflect fold enrichment over background, averaged over six replicates. Verification There are six replicates: two technical replicates (immunoprecipitations) for each of the three biological replicates (cell cultures). Raw and transformed (averaged) data can be downloaded from the Wellcome Trust Sanger Institute via the ENCODE data access web site or the ENCODE FTP site. Credits The data for this track were generated by the ENCODE investigators at the Wellcome Trust Sanger Institute, Hinxton, UK. encodeSangerChipH3K4me3Ptr8 SI H3K4me3 PTR8 Sanger Institute ChIP-chip (H3K4me3 ab, PTR8 cells) ENCODE Chromatin Immunoprecipitation encodeSangerChipH3K4me2Ptr8 SI H3K4me2 PTR8 Sanger Institute ChIP-chip (H3K4me2 ab, PTR8 cells) ENCODE Chromatin Immunoprecipitation encodeSangerChipH3K4me1Ptr8 SI H3K4me1 PTR8 Sanger Institute ChIP-chip (H3K4me1 ab, PTR8 cells) ENCODE Chromatin Immunoprecipitation encodeSangerChipH4acMolt4 SI H4ac MOLT4 Sanger Institute ChIP-chip (H4ac ab, MOLT4 cells) ENCODE Chromatin Immunoprecipitation encodeSangerChipH3acMolt4 SI H3ac MOLT4 Sanger Institute ChIP-chip (H3ac ab, MOLT4 cells) ENCODE Chromatin Immunoprecipitation encodeSangerChipH3K4me3Molt4 SI H3K4me3 MOLT4 Sanger Institute ChIP-chip (H3K4me3 ab, MOLT4 cells) ENCODE Chromatin Immunoprecipitation encodeSangerChipH3K4me2Molt4 SI H3K4me2 MOLT4 Sanger Institute ChIP-chip (H3K4me2 ab, MOLT4 cells) ENCODE Chromatin Immunoprecipitation encodeSangerChipH3K4me1Molt4 SI H3K4me1 MOLT4 Sanger Institute ChIP-chip (H3K4me1 ab, MOLT4 cells) ENCODE Chromatin Immunoprecipitation encodeSangerChipH4acHFL1 SI H4ac HFL-1 Sanger Institute ChIP-chip (H4ac ab, HFL-1 cells) ENCODE Chromatin Immunoprecipitation encodeSangerChipH3acHFL1 SI H3ac HFL-1 Sanger Institute ChIP-chip (H3ac ab, HFL-1 cells) ENCODE Chromatin Immunoprecipitation encodeSangerChipH3K4me3HFL1 SI H3K4me3 HFL-1 Sanger Institute ChIP-chip (H3K4me3 ab, HFL-1 cells) ENCODE Chromatin Immunoprecipitation encodeSangerChipH3K4me2HFL1 SI H3K4me2 HFL-1 Sanger Institute ChIP-chip (H3K4me2 ab, HFL-1 cells) ENCODE Chromatin Immunoprecipitation encodeSangerChipH3K4me1HFL1 SI H3K4me1 HFL-1 Sanger Institute ChIP-chip (H3K4me1 ab, HFL-1 cells) ENCODE Chromatin Immunoprecipitation encodeSangerChipH4acHeLa SI H4ac HeLa Sanger Institute ChIP-chip (H4ac ab, HeLa cells) ENCODE Chromatin Immunoprecipitation encodeSangerChipH3acHeLa SI H3ac HeLa Sanger Institute ChIP-chip (H3ac ab, HeLa cells) ENCODE Chromatin Immunoprecipitation encodeSangerChipH3K4me3HeLa SI H3K4me3 HeLa Sanger Institute ChIP-chip (H3K4me3 ab, HeLa cells) ENCODE Chromatin Immunoprecipitation encodeSangerChipH3K4me2HeLa SI H3K4me2 HeLa Sanger Institute ChIP-chip (H3K4me2 ab, HeLa cells) ENCODE Chromatin Immunoprecipitation encodeSangerChipH3K4me1HeLa SI H3K4me1 HeLa Sanger Institute ChIP-chip (H3K4me1 ab, HeLa cells) ENCODE Chromatin Immunoprecipitation encodeSangerChipH4acK562 SI H4ac K562 Sanger Institute ChIP-chip (H4ac ab, K562 cells) ENCODE Chromatin Immunoprecipitation encodeSangerChipH3acK562 SI H3ac K562 Sanger Institute ChIP-chip (H3ac ab, K562 cells) ENCODE Chromatin Immunoprecipitation encodeSangerChipH3K4me3K562 SI H3K4me3 K562 Sanger Institute ChIP-chip (H3K4me3 ab, K562 cells) ENCODE Chromatin Immunoprecipitation encodeSangerChipH3K4me2K562 SI H3K4me2 K562 Sanger Institute ChIP-chip (H3K4me2 ab, K562 cells) ENCODE Chromatin Immunoprecipitation encodeSangerChipCTCF SI CTCF GM06990 Sanger Institute ChIP-chip (CTCF ab, GM06990 cells) ENCODE Chromatin Immunoprecipitation encodeSangerChipH3K79me3 SI H3K79me3 GM06990 Sanger Institute ChIP-chip (H3K79me3 ab, GM06990 cells) ENCODE Chromatin Immunoprecipitation encodeSangerChipH3K36me3 SI H3K36me3 GM06990 Sanger Institute ChIP-chip (H3K36me3 ab, GM06990 cells) ENCODE Chromatin Immunoprecipitation encodeSangerChipH3K27me3 SI H3K27me3 GM06990 Sanger Institute ChIP-chip (H3K27me3 ab, GM06990 cells) ENCODE Chromatin Immunoprecipitation encodeSangerChipH3K9me3 SI H3K9me3 GM06990 Sanger Institute ChIP-chip (H3K9me3 ab, GM06990 cells) ENCODE Chromatin Immunoprecipitation encodeSangerChipH4ac SI H4ac GM06990 Sanger Institute ChIP-chip (H4ac ab, GM06990 cells) ENCODE Chromatin Immunoprecipitation encodeSangerChipH3ac SI H3ac GM06990 Sanger Institute ChIP-chip (H3ac ab, GM06990 cells) ENCODE Chromatin Immunoprecipitation encodeSangerChipH3K4me3 SI H3K4m3 GM6990 Sanger Institute ChIP-chip (H3K4me3 ab, GM06990 cells) ENCODE Chromatin Immunoprecipitation encodeSangerChipH3K4me2 SI H3K4m2 GM6990 Sanger Institute ChIP-chip (H3K4me2 ab, GM06990 cells) ENCODE Chromatin Immunoprecipitation encodeSangerChipH3K4me1 SI H3K4m1 GM6990 Sanger Institute ChIP-chip (H3K4me1 ab, GM06990 cells) ENCODE Chromatin Immunoprecipitation encodeSangerChipHits Sanger ChIP Hits Sanger ChIP-chip Hits and Peak Centers ENCODE Chromatin Immunoprecipitation Description This track displays hit regions and peak centers for Sanger ChIP-chip data, as identified by hidden Markov model (HMM) analysis. Display Conventions and Configuration This annotation follows the display conventions for composite tracks. The subtracks within this annotation may be configured in a variety of ways to highlight different aspects of the displayed data. The graphical configuration options are shown at the top of the track description page, followed by a list of subtracks. To display only selected subtracks, uncheck the boxes next to the tracks you wish to hide. For more information about the graphical configuration options, click the Graph configuration help link. Methods Data for each replicate was normalized with the Tukey-Biweight Method using R (as recommended by NimbleGen). The log base 2 ratio of the normalized intensities was used for downstream data processing. A two-state HMM was used to analyze the data. The states of the HMM represent regions of the tile path corresponding to antibody binding locations. State emission probabilities were determined by comparing the cumulative distribution of the experimental data for each replicate on each ENCODE region to a fitted cumulative normal distribution. The fitted distribution was calculated using the Levenberg-Marquart curve-fitting technique and six fitting points ranging from 0.05 to 0.45 of the cumulative distribution. Initial fitting parameters were set from the experimental data. This model is robust through a range of sensible transition probabilities. Bound regions were identified by finding the optimal state sequence from the HMM using the Viterbi algorithm, and the resulting region data was post-processed to develop the hit list. Hits were defined as contiguous portions of the tile path identified as bound by the HMM. The score of a hit was determined by taking the summation of the median enrichment values of the tiles in the contiguous portions (i.e. the area under the peak). For the purpose of this analysis, hits that were within 1000 base pairs of adjacent hits were combined into hit regions. The start position of the oligo with the highest enrichment value in the hit region was deemed the center of the peak. The ranking of hits was based on the total score of all hits in a hit region. It is recommended that analysis based on this data use the peak centers expanded to a convenient size for the analysis. Credits The ChIP-chip data were generated by Ian Dunham's lab at the Sanger Institute. Contacts: Ian Dunham and Christoph Koch. The HMM analysis was performed at the EBI by Paul Flicek. Raw data may be downloaded from the Sanger Institute website at ftp://ftp.sanger.ac.uk/pub/encode. encodeSangerChipCenterH4acHeLa SI H4ac HeLa Sanger Institute ChIP-chip Peak Centers (H4ac ab, HeLa cells) ENCODE Chromatin Immunoprecipitation encodeSangerChipCenterH3acHeLa SI H3ac HeLa Sanger Institute ChIP-chip Peak Centers (H3ac ab, HeLa cells) ENCODE Chromatin Immunoprecipitation encodeSangerChipCenterH3K4me3HeLa SI H3K4me3 HeLa Sanger Institute ChIP-chip Peak Centers (H3K4me3 ab, HeLa cells) ENCODE Chromatin Immunoprecipitation encodeSangerChipCenterH3K4me2HeLa SI H3K4me2 HeLa Sanger Institute ChIP-chip Peak Centers (H3K4me2 ab, HeLa cells) ENCODE Chromatin Immunoprecipitation encodeSangerChipCenterH3K4me1HeLa SI H3K4me1 HeLa Sanger Institute ChIP-chip Peak Centers (H3K4me1 ab, HeLa cells) ENCODE Chromatin Immunoprecipitation encodeSangerChipCenterH4acK562 SI H4ac K562 Sanger Institute ChIP-chip Peak Centers (H4ac ab, K562 cells) ENCODE Chromatin Immunoprecipitation encodeSangerChipCenterH3acK562 SI H3ac K562 Sanger Institute ChIP-chip Peak Centers (H3ac ab, K562 cells) ENCODE Chromatin Immunoprecipitation encodeSangerChipCenterH3K4me3K562 SI H3K4me3 K562 Sanger Institute ChIP-chip Peak Centers (H3K4me3 ab, K562 cells) ENCODE Chromatin Immunoprecipitation encodeSangerChipCenterH3K4me2K562 SI H3K4me2 K562 Sanger Institute ChIP-chip Peak Centers (H3K4me2 ab, K562 cells) ENCODE Chromatin Immunoprecipitation encodeSangerChipCenterH4acGM06990 SI H4ac GM06990 Sanger Institute ChIP-chip Peak Centers (H4ac ab, GM06990 cells) ENCODE Chromatin Immunoprecipitation encodeSangerChipCenterH3acGM06990 SI H3ac GM06990 Sanger Institute ChIP-chip Peak Centers (H3ac ab, GM06990 cells) ENCODE Chromatin Immunoprecipitation encodeSangerChipCenterH3K4me3GM06990 SI H3K4m3 GM6990 Sanger Institute ChIP-chip Peak Centers(H3K4me3 ab, GM06990 cells) ENCODE Chromatin Immunoprecipitation encodeSangerChipCenterH3K4me2GM06990 SI H3K4m2 GM6990 Sanger Institute ChIP-chip Peak Centers(H3K4me2 ab, GM06990 cells) ENCODE Chromatin Immunoprecipitation encodeSangerChipCenterH3K4me1GM06990 SI H3K4m1 GM6990 Sanger Institute ChIP-chip Peak Centers (H3K4me1 ab, GM06990 cells) ENCODE Chromatin Immunoprecipitation encodeSangerChipHitH4acHeLa SI H4ac HeLa Sanger Institute ChIP-chip Hits (H4ac ab, HeLa cells) ENCODE Chromatin Immunoprecipitation encodeSangerChipHitH3acHeLa SI H3ac HeLa Sanger Institute ChIP-chip Hits (H3ac ab, HeLa cells) ENCODE Chromatin Immunoprecipitation encodeSangerChipHitH3K4me3HeLa SI H3K4me3 HeLa Sanger Institute ChIP-chip Hits (H3K4me3 ab, HeLa cells) ENCODE Chromatin Immunoprecipitation encodeSangerChipHitH3K4me2HeLa SI H3K4me2 HeLa Sanger Institute ChIP-chip Hits (H3K4me2 ab, HeLa cells) ENCODE Chromatin Immunoprecipitation encodeSangerChipHitH3K4me1HeLa SI H3K4me1 HeLa Sanger Institute ChIP-chip Hits (H3K4me1 ab, HeLa cells) ENCODE Chromatin Immunoprecipitation encodeSangerChipHitH4acK562 SI H4ac K562 Sanger Institute ChIP-chip Hits (H4ac ab, K562 cells) ENCODE Chromatin Immunoprecipitation encodeSangerChipHitH3acK562 SI H3ac K562 Sanger Institute ChIP-chip Hits (H3ac ab, K562 cells) ENCODE Chromatin Immunoprecipitation encodeSangerChipHitH3K4me3K562 SI H3K4me3 K562 Sanger Institute ChIP-chip Hits (H3K4me3 ab, K562 cells) ENCODE Chromatin Immunoprecipitation encodeSangerChipHitH3K4me2K562 SI H3K4me2 K562 Sanger Institute ChIP-chip Hits (H3K4me2 ab, K562 cells) ENCODE Chromatin Immunoprecipitation encodeSangerChipHitH4acGM06990 SI H4ac GM06990 Sanger Institute ChIP-chip Hits (H4ac ab, GM06990 cells) ENCODE Chromatin Immunoprecipitation encodeSangerChipHitH3acGM06990 SI H3ac GM06990 Sanger Institute ChIP-chip Hits (H3ac ab, GM06990 cells) ENCODE Chromatin Immunoprecipitation encodeSangerChipHitH3K4me3GM06990 SI H3K4m3 GM6990 Sanger Institute ChIP-chip (H3K4me3 ab, GM06990 cells) ENCODE Chromatin Immunoprecipitation encodeSangerChipHitH3K4me2GM06990 SI H3K4m2 GM6990 Sanger Institute ChIP-chip Hits (H3K4me2 ab, GM06990 cells) ENCODE Chromatin Immunoprecipitation encodeSangerChipHitH3K4me1GM06990 SI H3K4m1 GM6990 Sanger Institute ChIP-chip Hits (H3K4me1 ab, GM06990 cells) ENCODE Chromatin Immunoprecipitation chainSelf Self Chain Human Chained Self Alignments Repeats Description This track shows alignments of the human genome with itself, using a gap scoring system that allows longer gaps than traditional affine gap scoring systems. The system can also tolerate gaps in both sets of sequence simultaneously. After filtering out the "trivial" alignments produced when identical locations of the genome map to one another (e.g. chrN mapping to chrN), the remaining alignments point out areas of duplication within the human genome. The pseudoautosomal regions of chrX and chrY are an exception: in this assembly, these regions have been copied from chrX into chrY, resulting in a large amount of self chains aligning in these positions on both chromosomes. The chain track displays boxes joined together by either single or double lines. The boxes represent aligning regions. Single lines indicate gaps that are largely due to a deletion in the query assembly or an insertion in the target assembly. Double lines represent more complex gaps that involve substantial sequence in both the query and target assemblies. This may result from inversions, overlapping deletions, an abundance of local mutation, or an unsequenced gap in one of the assemblies. In cases where multiple chains align over a particular region of the human genome, the chains with single-lined gaps are often due to processed pseudogenes, while chains with double-lined gaps are more often due to paralogs and unprocessed pseudogenes. Chains have both a score, and a normalized score. The score is derived by comparing sequence similarity, while penalizing both mismatches and gaps in a per base fashion. This leads to longer chains having greater scores, even if a smaller chain provides a better match. The normalized score divides the score by the length of the alignment, providing a more comparable score value not dependent on the match length. Display Conventions and Configuration By default, the chains are colored by the normalized score. This can be changed to color based on which chromosome they map to in the aligning organism. There is also an option to color all the chains black. To display only the chains of one chromosome in the aligning organism, enter the name of that chromosome (e.g. chr4) in box next to: Filter by chromosome. In the "pack" and "full" display modes, the individual feature names indicate the chromosome, strand, and location (in thousands) of the match for each matching alignment. Methods The genome was aligned to itself using blastz. Trivial alignments were filtered out, and the remaining alignments were converted into axt format using the lavToAxt program. The axt alignments were fed into axtChain, which organizes all alignments between a single target chromosome and a single query chromosome into a group and creates a kd-tree out of the gapless subsections (blocks) of the alignments. A dynamic program was then run over the kd-trees to find the maximally scoring chains of these blocks. Chains scoring below a threshold were discarded; the remaining chains are displayed in this track. Credits Blastz was developed at Pennsylvania State University by Minmei Hou, Scott Schwartz, Zheng Zhang, and Webb Miller with advice from Ross Hardison. Lineage-specific repeats were identified by Arian Smit and his RepeatMasker program. The axtChain program was developed at the University of California at Santa Cruz by Jim Kent with advice from Webb Miller and David Haussler. The browser display and database storage of the chains were generated by Robert Baertsch and Jim Kent. References Chiaromonte F, Yap VB, Miller W. Scoring pairwise genomic sequence alignments. Pac Symp Biocomput 2002, 115-26 (2002). Kent WJ, Baertsch R, Hinrichs A, Miller W, Haussler D. Evolution's cauldron: duplication, deletion, and rearrangement in the mouse and human genomes. Proc Natl Acad Sci U S A. 2003 Sep 30;100(20):11484-9. Schwartz S, Kent WJ, Smit A, Zhang Z, Baertsch R, Hardison RC, Haussler D, Miller W. Human-mouse alignments with BLASTZ. Genome Res. 2003 Jan;13(1):103-7. sangamoDnaseHs SGMO/EIO CD34 HS Sangamo - Eur. Inst. Oncology DNase Hypersensitive Sites Regulation Description Genes in metazoa are controlled by a complex array of cis-regulatory elements that include core and distal promoters, enhancers, insulators, silencers etc (Levine and Tjian, 2003). A unifying feature of such elements active in cell nuclei is a chromatin-based epigenetic signature known as a nuclease hypersensitive site (Elgin, 1988; Gross and Garrard, 1988; Wolffe, 1998). This track presents the results of a collaboration between Sangamo BioSciences, Inc. and the European Institute of Oncology to isolate such regulatory elements from human CD34+ hematopoietic stem cells (Urnov et al., submitted). This effort made use of a method developed at Sangamo BioSciences to isolate such nuclease hypersensitive sites from living cells with minimal, if any, contamination from bulk DNA (Urnov et al., submitted; US patent pending). Display Conventions The track annotates the location of 3,314 cis-regulatory elements in human CD34+ cells in the human genome in the form of 40 bp tags. Note the method identifies a specific position in chromatin that is hypersensitive to nuclease, but does not map the boundaries of the regulatory element per se. A conservative estimate of element size would be that occupied by one nucleosome, i.e., 180 - 200 bp surrounding the tag, although there is precedent in the literature for nuclease hypersensitive sites that span more than the length of one nucleosome (Turner, 2001; Wolffe, 1998). Methods CD34+ cells (enriched in hematopoietic stem cells) were prepared from healthy donors following guidelines established by the Ethics Committee of the European Institute of Oncology (IEO), Milan. Mobilization of CD34+ cells to the peripheral blood was stimulated by G-CSF treatment according to standard procedures. After mobilization, donors were subjected to leukophoresis, and 95% of the total cells. Cells were immediately used for the isolation of cis-regulatory DNA elements using the nuclease hypersensitive site isolation protocol developed at Sangamo (Urnov et al., submitted). Verification In collaboration with scientists at the J. Craig Venter Institute and the European Institute of Oncology, the method was initially validated on human tissue culture cells by examining the colocalization of DNA fragments isolated from cells with experimentally determined nuclease hypersensitive sites in chromatin as mapped by indirect end-labeling and Southern blotting (Nedospasov and Georgiev, 1980; Wu, 1980). Nineteen out of nineteen randomly chosen clones from those libraries represented bona fide DNAse I hypersensitive sites in chromatin (Urnov et al., submitted). These data confirmed that the method yields very high-content libraries of active cis- regulatory DNA elements, supporting its application to human CD34+ cells. Analysis of libraries of cis-regulatory elements prepared using this method from CD34+ cells showed that 50 out of 55 randomly chosen clones — 91% — coincided with DNAse I hypersensitive sites (Urnov et al., submitted). Credits The library of regulatory DNA elements from human CD34+ cells was prepared, sequenced, and validated by Saverio Minucci and colleagues at the European Institute of Oncology, using a method developed by Fyodor Urnov, Alan Wolffe, and colleagues at Sangamo BioSciences, and validated in collaboration with Sam Levy and colleagues (J. Craig Venter Institute). References Elgin SC. The formation and function of DNase I hypersensitive sites in the process of gene activation. J Biol Chem. 1988 Dec 25;263(36):19259-62. Gross DS, Garrard WT. Nuclease hypersensitive sites in chromatin. Ann Rev Biochem. 1988;57:159-197. Levine M, Tjian R. Transcription regulation and animal diversity. Nature. 2003 Jul 10;424(6945):147-51. Nedospasov S, Georgiev G. Non-random cleavage of SV40 DNA in the compact minichromosome and free in solution by micrococcal nuclease. Biochem Biophys Res Commun. 1980 Jan 29;92(2):532-9. Turner BM. Chromatin and Gene Regulation: Mechanisms in Epigenetics. Blackwell Publishers, Oxford. 2001. Urnov FD, Minucci S, Levy S et al. Genome-wide chromatin-based isolation of active cis-regulatory DNA elements from human cells. Submitted. Wolffe AP. Chromatin Structure and Function. Academic Press, San Diego, CA. 1998. Wu C. The 5' ends of Drosophila heat shock genes in chromatin are hypersensitive to DNase I. Nature. 1980 Aug 28;286(5776):854-60. sgpGene SGP Genes SGP Gene Predictions Using Mouse/Human Homology Genes and Gene Predictions Description This track shows gene predictions from the SGP2 homology-based gene prediction program developed by Roderic Guigó's "Computational Biology of RNA Processing" group, which is part of the Centre de Regulació Genòmica (CRG) in Barcelona, Catalunya, Spain. To predict genes in a genomic query, SGP2 combines geneid predictions with tblastx comparisons of the genome of the target species against genomic sequences of other species (reference genomes) deemed to be at an appropriate evolutionary distance from the target. Credits Thanks to the "Computational Biology of RNA Processing" group for providing these data. wgRna sno/miRNA C/D and H/ACA Box snoRNAs, scaRNAs, and microRNAs from snoRNABase and miRBase Genes and Gene Predictions Description This track displays positions of four different types of RNA in the human genome: microRNAs from the miRBase at the Wellcome Trust Sanger Institute(WTSI). small nucleolar RNAs (C/D box and H/ACA box snoRNAs) and Cajal body-specific RNAs (scaRNAs) from the snoRNABase maintained at the Laboratoire de Biologie Moléculaire Eucaryote C/D box and H/ACA box snoRNAs are guides for the 2'O-ribose methylation and the pseudouridilation, respectively, of rRNAs and snRNAs, although many of them have no documented target RNA. The scaRNAs guide modifications of the spliceosomal snRNAs transcribed by RNA polymerase II, and often contain both C/D and H/ACA domains. Display Conventions and Configuration This track follows the general display conventions for gene prediction tracks. The miRNA precursor forms (pre-miRNA) are represented by red blocks. C/D box snoRNAs, H/ACA box snoRNAs and scaRNAs are represented by blue, green and magenta blocks, respectively. At a zoomed-in resolution, arrows superimposed on the blocks indicate the sense orientation of the snoRNAs. Methods Precursor miRNA genomic locations from miRBase were calculated using wublastn for sequence alignment with the requirement of 100% identity. The extents of the precursor sequences were not generally known and were predicted based on base-paired hairpin structure. miRBase is described in Griffiths-Jones, S. (2004) and Weber, M.J. (2005) in the References section below. The snoRNAs and scaRNAs from the snoRNABase were aligned against the human genome using blat. Credits Genome coordinates for this track were obtained from the miRBase sequences FTP site and from snoRNABase coordinates download page. References When making use of these data, please cite the folowing articles in addition to the primary sources of the miRNA sequences: Griffiths-Jones S, Saini HK, van Dongen S, Enright AJ. miRBase: tools for microRNA genomics. Nucleic Acids Res. 2008 Jan 1;36(Database issue):D154-8. Griffiths-Jones S, Grocock RJ, van Dongen S, Bateman A, Enright AJ. miRBase: microRNA sequences, targets and gene nomenclature. Nucleic Acids Res. 2006 Jan 1;34(Database issue):D140-4. Griffiths-Jones S. The microRNA Registry. Nucleic Acids Res. 2004 Jan 1;32(Database issue):D109-11. Weber MJ. New human and mouse microRNA genes found by homology search. You may also want to cite The Wellcome Trust Sanger Institute miRBase and The Laboratoire de Biologie Moleculaire Eucaryote snoRNABase. The following publication provides guidelines on miRNA annotation: Ambros V. et al., A uniform system for microRNA annotation. RNA. 2003;9(3):277-9. encodeRecombHotspot SNP Recomb Hots Oxford Recombination Hotspots from ENCODE resequencing data ENCODE Variation Description This track shows the location of recombination hotspots detected from patterns of genetic variation. It is based on the HapMap ENCODE data, in the ten ENCODE regions that have been resequenced: ENr112 (chr2) ENr131 (chr2) ENr113 (chr4) ENm010 (chr7) ENm013 (chr7) ENm014 (chr7) ENr321 (chr8) ENr232 (chr9) ENr123 (chr12) ENr213 (chr18) Observations from sperm studies (Jeffreys et al., 2001) and patterns of genetic variation (McVean et al., 2004; Crawford et al., 2004) show that recombination rates in the human genome vary extensively over kilobase scales and that much recombination occurs in recombination hotspots. This provides an explanation for the apparent block-like structure of linkage disequlibrium (Daly et al., 2001; Gabriel et al., 2002). Recombination hotspot estimates provide a new route to understanding the molecular mechanisms underlying human recombination. A better understanding of the genomic landscape of human recombination hotspots would facilitate the efficient design and analysis of disease association studies and greatly improve inferences from polymorphism data about selection and human demographic history. Methods Recombination hotspots are identified using the likelihood-ratio test described in McVean et al. (2004) and Winckler et al. (2005), referred to as LDhot. For successive intervals of 200 kb, the maximum likelihood of a model with a constant recombination rate is compared to the maximum likelihood of a model in which the central 2 kb is a recombination hotspot (likelihoods are approximated by the composite likelihood method of Hudson 2001). The observed difference in log composite likelihood is compared against the null distribution, which is obtained by simulations. Simulations are matched for sample size, SNP density, background recombination rate and an approximation to the ascertainment scheme (a panel of 12 individuals with a Poisson number of chromosomes, mean 1, sampled from this panel, using a single hit ascertainment scheme for dbSNP and resequencing of 16 individuals for the 10 HapMap ENCODE regions). Evidence for a hotspot was assessed in each analysis panel separately (YRI, CEU and combined CHB+JPT), and p-values were combined such that a hotspot requires that two of the three populations show some evidence of a hotspot (p < 0.05) and at least one population showed stronger evidence for a hotspot (p < 0.01). Hotspot centers were estimated at those locations where distinct recombination rate estimate peaks occurred with at least a factor of two separation between peaks, within the low p-value intervals. Validation This approach has been validated in three ways: by extensive simulation studies and by comparisons with independent estimates of recombination rates, both over large scales from the genetic map and over fine scales from sperm analysis. Full details of validation can be found in McVean et al. (2004) and Winckler et al. (2005). Credits The data are based on HapMap release 16a. The recombination hotspots were ascertained by Simon Myers from the Mathematical Genetics Group at the University of Oxford. References Crawford, D.C., Bhangale, T., Li, N., Hellenthal, G., Rieder, M.J., Nickerson, D.A. and Stephens, M. Evidence for substantial fine-scale variation in recombination rates across the human genome. Nat Genet. 36(7), 700-6 (2004). Daly, M.J., Rioux, J.D., Schaffner, S.F., Hudson, T.J. and Lander, E.S. High-resolution haplotype structure in the human genome. Nat Genet. 29(2), 229-32 (2001). Gabriel, S.B., Schaffner, S.F., Nguyen, H., Moore, J.M., Roy, J., Blumenstiel, B., Higgins, J., DeFelice, M., Lochner, A., Faggart, M. et al. The structure of haplotype blocks in the human genome. Science 296(5576), 2225-9 (2002). Hudson, R. R. Two-locus sampling distributions and their application. Genetics 159(4):1805-1817 (2001). Jeffreys, A.J,. Kauppi, L. and Neumann, R. Intensely punctate meiotic recombination in the class II region of the major histocompatibility complex. Nat Genet. 29(2), 217-22 (2001). McVean, G.A., Myers, S.R., Hunt, S., Deloukas, P., Bentley, D.R. and Donnelly, P. The fine-scale structure of recombination rate variation in the human genome. Science 304(5670), 581-4 (2004). Winckler, W., Myers, S.R., Richter, D.J., Onofrio, R.C., McDonald, G.J., Bontrop, R.E., McVean, G.A., Gabriel, S.B., Reich, D., Donnelly, P. et al. Comparison of fine-scale recombination rates in humans and chimpanzees. Science 308(5718), 107-11 (2005). encodeRecombRate SNP Recomb Rates Oxford Recombination Rates from ENCODE resequencing data ENCODE Variation Description This track shows recombination rates measured in centiMorgans/Megabase in ten ENCODE regions that have been resequenced: ENr112 (chr2) ENr131 (chr2) ENr113 (chr4) ENm010 (chr7) ENm013 (chr7) ENm014 (chr7) ENr321 (chr8) ENr232 (chr9) ENr123 (chr12) ENr213 (chr18) Observations from sperm studies (Jeffreys et al., 2001) and patterns of genetic variation (McVean et al., 2004; Crawford et al., 2004) show that recombination rates in the human genome vary extensively over kilobase scales and that much recombination occurs in recombination hotspots. This provides an explanation for the apparent block-like structure of linkage disequlibrium (Daly et al., 2001; Gabriel et al., 2002). Fine-scale recombination rate estimates provide a new route to understanding the molecular mechanisms underlying human recombination. A better understanding of the genomic landscape of human recombination rate variation would facilitate the efficient design and analysis of disease association studies and greatly improve inferences from polymorphism data about selection and human demographic history. Display Conventions and Configuration This annotation track may be configured in a variety of ways to highlight different aspects of the displayed data. The graphical configuration options are shown at the top of the track description page. For more information, click the Graph configuration help link. Methods Fine-scale recombination rates are estimated using the reversible-jump Markov chain Monte Carlo method (McVean et al., 2004). This approach explores the posterior distribution of fine-scale recombination rate profiles, where the state-space considered is the distribution of piece-wise constant recombination maps. The Markov chain explores the distribution of both the number and location of change-points, in addition to the rates for each segment. A prior is set on the number of change-points that increases the smoothing effect of trans-dimensional MCMC, which is necessary because of the composite-likelihood scheme employed. This method is implemented in the package LDhat, which includes full details of installation and implementation. For the ENCODE regions, a block-penalty of 5 was used (calibrated by simulation and comparison to data from sperm-typing studies). Each region was analyzed as a single run with 10,000,000 iterations, sampling every 5000th iteration and discarding the first third of all samples as burn-in. The mean posterior rate for each SNP interval is the value reported. Because of the non-independence of the composite likelihood scheme, the quantiles of the sampling distribution do not reflect true uncertainty and are therefore not given. Estimates were generated separately from each of the four ENCODE resequencing populations, and then combined to give a single figure. Differences between populations are not significant. Validation This approach has been validated in three ways: by extensive simulation studies and by comparisons with independent estimates of recombination rates, both over large scales from the genetic map and over fine scales from sperm analysis. Full details of validation can be found in McVean et al. (2004) and Winckler et al. (2005). Credits The data is based on HapMap release 16. The recombination rates were ascertained by Gil McVean from the Mathematical Genetics Group at the University of Oxford. References Crawford, D.C., Bhangale, T., Li, N., Hellenthal, G., Rieder, M.J., Nickerson, D.A. and Stephens, M. Evidence for substantial fine-scale variation in recombination rates across the human genome. Nat Genet. 36(7), 700-6 (2004). Daly, M.J., Rioux, J.D., Schaffner, S.F., Hudson, T.J. and Lander, E.S. High-resolution haplotype structure in the human genome. Nat Genet. 29(2), 229-32 (2001). Gabriel, S.B., Schaffner, S.F., Nguyen, H., Moore, J.M., Roy, J., Blumenstiel, B., Higgins, J., DeFelice, M., Lochner, A., Faggart, M. et al. The structure of haplotype blocks in the human genome. Science 296(5576), 2225-9 (2002). Jeffreys, A.J,. Kauppi, L. and Neumann, R. Intensely punctate meiotic recombination in the class II region of the major histocompatibility complex. Nat Genet. 29(2), 217-22 (2001). McVean, G.A., Myers, S.R., Hunt, S., Deloukas, P., Bentley, D.R. and Donnelly, P. The fine-scale structure of recombination rate variation in the human genome. Science 304(5670), 581-4 (2004). Winckler, W., Myers, S.R., Richter, D.J., Onofrio, R.C., McDonald, G.J., Bontrop, R.E., McVean, G.A., Gabriel, S.B., Reich, D., Donnelly, P. et al. Comparison of fine-scale recombination rates in humans and chimpanzees. Science 308(5718), 107-11 (2005). encodeStanfordMeth Stanf Meth Stanford Methylation Digest: Be2C, CRL1690, HCT116, HT1080, HepG2, JEG3, Snu182, U87 ENCODE Chromosome, Chromatin and DNA Structure Description This track displays experimentally determined regions of unmethylated CpGs in the ENCODE regions. These experiments were performed in eight cell lines, each of which is displayed as a separate subtrack: Cell LineClassificationIsolated From BE(2)-Cneuroblastomabrain (metastatic, from bone marrow) CRL-1690™hybridomaB lymphocyte HCT 116colorectal carcinomacolon HT-1080fibrosarcomaconnective tissue HepG2hepatocellular carcinomaliver JEG-3choriocarcinomaplacenta SNU-182hepatocellular carcinomaliver U-87 MGglioblastoma-astrocytomabrain Display Conventions and Configuration This annotation follows the display conventions for composite tracks. The subtracks within this annotation may be configured in a variety of ways to highlight different aspects of the displayed data. The graphical configuration options are shown at the top of the track description page, followed by a list of subtracks. To display only selected subtracks, uncheck the boxes next to the tracks you wish to hide. For more information about the graphical configuration options, click the Graph configuration help link. Methods High molecular weight genomic DNA was prepared from each cell line. The genomic DNA was digested with a cocktail of six methyl-sensitive restriction enzymes (AciI, HhaI, BstUI, HpaII, HgaI, and HpyCH4IV) and size selected to deplete the genome of unmethylated regions. Digested and undigested DNA (control) were amplified, labeled, and hybridized to oligo tiling arrays produced by NimbleGen. The data for each array were median subtracted (log 2 ratios) and normalized (divided by the standard deviation). The value given for each array probe is the transformed mean ratio of undigested:digested genomic DNA. Higher scores in this track indicate regions that are more strongly methylated, due to the greater difference between the undigested and digested hybridization signals. Verification Three biological replicates and two technical replicates were done for each of the eight cell lines. The Myers lab is currently testing the specificity and sensitivity using real-time PCR. Credits These data were generated in the Richard M. Myers lab at Stanford University (now at HudsonAlpha Institute for Biotechnology). Please contact David Johnson for further information regarding the methods and the data for this track. encodeStanfordMethU87 Stan Meth U87 Stanford Methylation Digest (U87 cells) ENCODE Chromosome, Chromatin and DNA Structure encodeStanfordMethSnu182 Stan Meth Snu182 Stanford Methylation Digest (Snu182 cells) ENCODE Chromosome, Chromatin and DNA Structure encodeStanfordMethJEG3 Stan Meth JEG3 Stanford Methylation Digest (JEG3 cells) ENCODE Chromosome, Chromatin and DNA Structure encodeStanfordMethHepG2 Stan Meth HepG2 Stanford Methylation Digest (HepG2 cells) ENCODE Chromosome, Chromatin and DNA Structure encodeStanfordMethHT1080 Stan Meth HT1080 Stanford Methylation Digest (HT1080 cells) ENCODE Chromosome, Chromatin and DNA Structure encodeStanfordMethHCT116 Stan Meth HCT116 Stanford Methylation Digest (HCT116 cells) ENCODE Chromosome, Chromatin and DNA Structure encodeStanfordMethCRL1690 Stan Meth CRL1690 Stanford Methylation Digest (CRL1690 cells) ENCODE Chromosome, Chromatin and DNA Structure encodeStanfordMethBe2C Stan Meth Be2C Stanford Methylation Digest (Be2C cells) ENCODE Chromosome, Chromatin and DNA Structure encodeStanfordMethSmoothed Stanf Meth Score Stanford Methylation Digest Smoothed Score ENCODE Chromosome, Chromatin and DNA Structure Description This track displays smoothed (sliding-window mean) scores for experimentally determined regions of unmethylated CpGs in the ENCODE regions. These experiments were performed in eight cell lines, each of which is displayed as a separate subtrack: Cell LineClassificationIsolated From BE(2)-Cneuroblastomabrain (metastatic, from bone marrow) CRL-1690™hybridomaB lymphocyte HCT 116colorectal carcinomacolon HT-1080fibrosarcomaconnective tissue HepG2hepatocellular carcinomaliver JEG-3choriocarcinomaplacenta SNU-182hepatocellular carcinomaliver U-87 MGglioblastoma-astrocytomabrain Display Conventions and Configuration This annotation follows the display conventions for composite tracks. The subtracks within this annotation may be configured in a variety of ways to highlight different aspects of the displayed data. The graphical configuration options are shown at the top of the track description page, followed by a list of subtracks. To display only selected subtracks, uncheck the boxes next to the tracks you wish to hide. For more information about the graphical configuration options, click the Graph configuration help link. Methods High molecular weight genomic DNA was prepared from each cell line. The genomic DNA was digested with a cocktail of six methyl-sensitive restriction enzymes (AciI, HhaI, BstUI, HpaII, HgaI, and HpyCH4IV) and size selected to deplete the genome of unmethylated regions. Digested and undigested DNA (control) were amplified, labeled, and hybridized to oligo tiling arrays produced by NimbleGen. The data for each array were median subtracted (log 2 ratios) and normalized (divided by the standard deviation). The transformed mean ratios of undigested:digested genomic DNA for all probes were then smoothed by calculating a sliding-window mean. Windows of six neighboring probes (sliding two probes at a time) were used; within each window, the highest and lowest value were dropped, and the remaining four values were averaged. In order to increase the contrast between high and low values for visual display, the average was converted to a score by the formula: score = 8^(average) * 10 These scores are for visualization purposes; for all analyses, the raw ratios, which are available in the Stanf Meth track, should be used. Higher scores in this track indicate regions that are more strongly methylated, due to the greater difference between the undigested and digested hybridization signals. Verification Three biological replicates and two technical replicates were done for each of the eight cell lines. The Myers lab is currently testing the specificity and sensitivity using real-time PCR. Credits These data were generated in the Richard M. Myers lab at Stanford University (now at HudsonAlpha Institute for Biotechnology). Please contact David Johnson for further information regarding the methods and the data for this track. encodeStanfordMethSmoothedU87 Stan Meth Sc U87 Stanford Methylation Digest Smoothed Score (U87 cells) ENCODE Chromosome, Chromatin and DNA Structure encodeStanfordMethSmoothedSnu182 Stan Meth Sc Snu182 Stanford Methylation Digest Smoothed Score (Snu182 cells) ENCODE Chromosome, Chromatin and DNA Structure encodeStanfordMethSmoothedJEG3 Stan Meth Sc JEG3 Stanford Methylation Digest Smoothed Score (JEG3 cells) ENCODE Chromosome, Chromatin and DNA Structure encodeStanfordMethSmoothedHepG2 Stan Meth Sc HepG2 Stanford Methylation Digest Smoothed Score (HepG2 cells) ENCODE Chromosome, Chromatin and DNA Structure encodeStanfordMethSmoothedHT1080 Stan Meth Sc HT1080 Stanford Methylation Digest Smoothed Score (HT1080 cells) ENCODE Chromosome, Chromatin and DNA Structure encodeStanfordMethSmoothedHCT116 Stan Meth Sc HCT116 Stanford Methylation Digest Smoothed Score (HCT116 cells) ENCODE Chromosome, Chromatin and DNA Structure encodeStanfordMethSmoothedCRL1690 Stan Meth Sc CRL1690 Stanford Methylation Digest Smoothed Score (CRL1690 cells) ENCODE Chromosome, Chromatin and DNA Structure encodeStanfordMethSmoothedBe2C Stan Meth Sc Be2C Stanford Methylation Digest Smoothed Score (Be2C cells) ENCODE Chromosome, Chromatin and DNA Structure encodeStanfordPromoters Stanf Promoter Stanford Promoter Activity ENCODE Transcript Levels Description This track displays activity levels of 643 putative promoter fragments in the ENCODE regions, based on high-throughput transient transfection luciferase reporter assays. The activity of each putative promoter is indicated by color, ranging from black (no activity) to red (strong activity). Each of the fragments was tested in a panel of 16 cell lines: Cell LineClassificationIsolated From AGSgastric adenocarcinomastomach BE(2)-Cneuroblastomabrain (metastatic, from bone marrow) T98G (CRL-1690)glioblastomabrain G-402renal leiomyoblastomakidney HCT 116colorectal carcinomacolon HMCBmelanomaskin HT-1080fibrosarcomaconnective tissue SK-N-SH (HTB-11)neuroblastomabrain (metastatic, from bone marrow) HeLaadenocarcinomacervix HepG2hepatocellular carcinomaliver JEG-3choriocarcinomaplacenta MG-63osteosarcomabone MRC-5fibroblastlung PANC-1epithelioid carcinomapancreas (duct) SNU-182hepatocellular carcinomaliver U-87 MGglioblastoma-astrocytomabrain Methods Promoters in the ENCODE region were predicted using a variation on methods previously described (Trinklein et al., 2003, Trinklein et al., 2004). Using BLAT alignments of human cDNAs in Genbank to the genome, those with at least one bp of exon overlap were merged, generating gene models. The transcription start sites were predicted by assigning the 5' end of each gene model as one transcription start site and alternative 5' ends that were at least 500 bp downstream and supported by full-length cDNAs as other start sites. Promoters were defined as the regions approximately 600 bp upstream and 100 bp downstream of each transcription start site. Primer3 was used to pick primers yielding approximately 500 bp amplicons containing the predicted transcription start site. Each fragment of DNA represented in this track was cloned into a luciferase reporter vector (pGL3-Basic, Promega) using the BD Clontech Infusion Cloning System. The Dual Luciferase system (Promega) was used to co-transfect the experimental DNA along with a control plasmid expressing Renilla - to control for variation in transcription efficiency - in 96-well format into one of the sixteen cell types using FuGENE Transfection Reagent (Roche). Each transfection was done in duplicate. Data are reported as normalized and log2 transformed averages of the Luciferase/Renilla ratio. This normalization was based on the activity of 102 random genomic fragments (negative controls) derived from exons and intergenic regions. Such a normalization allows for a meaningful comparison between cell types. The average log transformed Luciferase/Renilla ratio was scaled linearly to create a score where the maximum value is 1000 and the minimum value is 0. This score is arbitrary and for visualization purposes only; the raw ratio values should be used for all analyses. Verification Data were verified by repeating the preparation and measurement of 48 random fragments. No significant variation between the two preparations was detected. A spreadsheet containing the negative control data can be downloaded here. Credits This work was done in collaboration at the Myers Lab at Stanford University (now at HudsonAlpha Institute for Biotechnology). The following people contributed: Sara J. Cooper, Nathan D. Trinklein, Elizabeth D. Anton, Loan Nguyen, and Richard M. Myers. References Cooper SJ, Trinklein ND, Anton ED, Nguyen L, Myers RM. Comprehensive analysis of transcriptional promoter structure and function in 1% of the human genome. Genome Res. 2006 Jan;16(1):1-10. Epub 2005 Dec 12. Trinklein ND, Aldred SJ, Saldanha AJ, Myers RM. Identification and functional analysis of human transcriptional promoters. Genome Res. 2003 Feb;13(2):308-12. Trinklein ND, Aldred SF, Hartman SJ, Schroeder DI, Otillar RP, Myers RM. An abundance of bidirectional promoters in the human genome. Genome Res. 2004 Jan;14(1):62-6. encodeStanfordPromotersAverage Stan Pro Average Stanford Promoter Activity (Average) ENCODE Transcript Levels encodeStanfordPromotersU87 Stan Pro U87 Stanford Promoter Activity (U87 cells) ENCODE Transcript Levels encodeStanfordPromotersSnu182 Stan Pro Snu182 Stanford Promoter Activity (Snu182 cells) ENCODE Transcript Levels encodeStanfordPromotersPanc1 Stan Pro Panc1 Stanford Promoter Activity (Panc1 cells) ENCODE Transcript Levels encodeStanfordPromotersMRC5 Stan Pro MRC5 Stanford Promoter Activity (MRC5 cells) ENCODE Transcript Levels encodeStanfordPromotersMG63 Stan Pro MG63 Stanford Promoter Activity (MG63 cells) ENCODE Transcript Levels encodeStanfordPromotersJEG3 Stan Pro JEG3 Stanford Promoter Activity (JEG3 cells) ENCODE Transcript Levels encodeStanfordPromotersHepG2 Stan Pro HepG2 Stanford Promoter Activity (HepG2 cells) ENCODE Transcript Levels encodeStanfordPromotersHela Stan Pro Hela Stanford Promoter Activity (HeLa cells) ENCODE Transcript Levels encodeStanfordPromotersHTB11 Stan Pro HTB11 Stanford Promoter Activity (HTB11 cells) ENCODE Transcript Levels encodeStanfordPromotersHT1080 Stan Pro HT1080 Stanford Promoter Activity (HT1080 cells) ENCODE Transcript Levels encodeStanfordPromotersHMCB Stan Pro HMCB Stanford Promoter Activity (HMCB cells) ENCODE Transcript Levels encodeStanfordPromotersHCT116 Stan Pro HCT116 Stanford Promoter Activity (HCT116 cells) ENCODE Transcript Levels encodeStanfordPromotersG402 Stan Pro G402 Stanford Promoter Activity (G402 cells) ENCODE Transcript Levels encodeStanfordPromotersCRL1690 Stan Pro CRL1690 Stanford Promoter Activity (CRL1690 cells) ENCODE Transcript Levels encodeStanfordPromotersBe2C Stan Pro Be2c Stanford Promoter Activity (Be2c cells) ENCODE Transcript Levels encodeStanfordPromotersAGS Stan Pro AGS Stanford Promoter Activity (AGS cells) ENCODE Transcript Levels encodeStanfordRtPcr Stanf RTPCR Stanford Endogenous Transcript Levels in HCT116 Cells ENCODE Transcript Levels Description This track displays absolute transcript copy numbers for 136 genes and 12 negative control intergenic regions, determined by RTPCR in HCT116 cells. Display Conventions and Configuration The genomic regions are indicated by solid blocks. The shade of an item gives a rough indication of its count, ranging from light gray for zero to black for a count of 7000 or greater. To display only those items that exceed a specific unnormalized score, enter a minimum score between 0 and 1000 in the text box at the top of the track description page. Methods Total RNA was prepared in quadruplicate from HCT116 cells grown in culture. cDNA was prepared as described in Trinklein et al. (2004). Duplicate primer pairs were designed to each gene, and the absolute number of cDNA molecules containing each amplicon were determined by real-time PCR. The submitted data are the calculated number of molecules of each transcript containing the defined amplicon. Verification Four biological replicates were performed, and two primer pairs were used to measure the abundance of each transcript. Credits These data were generated in the Richard M. Myers lab at Stanford University (now at HudsonAlpha Institute for Biotechnology). References Trinklein, N.D., Chen, W.C., Kingston, R.E. and Myers, R.M. Transcriptional regulation and binding of HSF1 and HSF2 to 32 human heat shock genes during thermal stress and differentiation. Cell Stress Chaperones 9(1), 21-28 (2004). stsMap STS Markers STS Markers on Genetic (blue) and Radiation Hybrid (black) Maps Mapping and Sequencing Description This track shows locations of Sequence Tagged Site (STS) markers along the draft assembly. These markers have been mapped using either genetic mapping (Genethon, Marshfield, and deCODE maps), radiation hybridization mapping (Stanford, Whitehead RH, and GeneMap99 maps) or YAC mapping (the Whitehead YAC map) techniques. Since August 2001, this track no longer displays fluorescent in situ hybridization (FISH) clones, which are now displayed in a separate track. Genetic map markers are shown in blue; radiation hybrid map markers are shown in black. When a marker maps to multiple positions in the genome, it is shown in a lighter color. Methods Positions of STS markers are determined using both full sequences and primer information. Full sequences are aligned using blat, while isPCR (Jim Kent) and ePCR are used to find locations using primer information. Both sets of placements are combined to give final positions. In nearly all cases, full sequence and primer-based locations are in agreement, but in cases of disagreement, full sequence positions are used. Sequence and primer information for the markers were obtained from the primary sites for each of the maps, and from NCBI UniSTS (now part of NCBI Probe). Using the Filter The track filter can be used to change the color or include/exclude a set of map data within the track. This is helpful when many items are shown in the track display, especially when only some are relevant to the current task. To use the filter: In the pulldown menu, select the map whose data you would like to highlight or exclude in the display. By default, the "All Genetic" option is selected. Choose the color or display characteristic that will be used to highlight or include/exclude the filtered items. If "exclude" is chosen, the browser will not display data from the map selected in the pulldown list. If "include" is selected, the browser will display only data from the selected map. When you have finished configuring the filter, click the Submit button. Credits This track was designed and implemented by Terry Furey. Many thanks to the researchers who worked on these maps, and to Greg Schuler, Arek Kasprzyk, Wonhee Jang, and Sanja Rogic for helping process the data. Additional data on the individual maps can be found at the following links: Genethon map Marshfield map deCODE map GeneMap99 GB4 and G3 maps Stanford TNG (Center has closed) Whitehead YAC and RH maps switchDbTss SwitchGear TSS SwitchGear Genomics Transcription Start Sites Regulation Description This track describes the location of transcription start sites (TSS) throughout the human genome along with a confidence measure for each TSS based on experimental evidence. The TSSs of a gene are important landmarks that help define the promoter regions of a gene. These TSSs were determined by SwitchGear Genomics by integrating experimental data using an empirically derived scoring function. Each TSS has a unique identifier that associates it with a gene model (see details below), and each TSS is color-coded to reflect its confidence score. These TSSs are also available in a searchable format at SwitchDB, an open-access online database of human TSSs. Expermental tools are available through SwitchGear to study the function of the promoter regions associated with these TSSs. Methods The predicted TSSs are associated with a genome-wide set of gene models. SwitchGear gene models are defined as clusters of cDNA alignments that have overlapping exons on the same strand. These gene models were created from over 250,000 human cDNA alignments to construct a genome-wide set of ~37,000 gene models. Each gene model is identified by its chromosome number, strand, and unique identifier. For example, ID CHR7_P0362 indicates a cDNA cluster (0362) aligning to the plus strand (P) of chromosome 7 (CHR7). Existing gene annotation is mapped to the gene models through the NCBI annotation associated with Refseq accession numbers. The SwitchGear TSS prediction algorithm identifies the most likely sites of transcription initiation for each gene model. The algorithm employs a scoring metric to assign a confidence level to each TSS prediction based on existing experimental evidence. In addition to the ~250,000 human cDNAs listed in Genbank, more than 5 million additional 5' human cDNA sequence tags have been generated using a combination of approaches. While these short sequence reads do not reveal gene structure, they provide a significant amount of experimental evidence for identifying transcript start sites. For each gene model, the algorithm counts the number of TSSs (defined as the 5' end of a cDNA) within 200 bp of one another. The TSS score is based on the total number of TSSs identified within this window, with each TSS weighted according to several discriminating features: cDNA library source, relative location within the gene model, and exon structure of the transcript. Furthermore, the TSSs for each gene model are ranked to identify the TSS representing the most likely transcription initiation site for a gene model. Rankings are indicated in the TSS unique identifier by the addition of a suffix (i.e. CHR7_P0362_R1 or CHR7_P0362_R2). Using the Filter This track has a filter that can be used to change the TSS elements displayed by the browser. This filter is based on the score of the TSS element. The filter is located at the top of the track description page, which is accessed via the small button to the left of the track's graphical display or through the link on the track's control menu. By default the track displays only those TSSs with a score of 10 or above. By default, the TSSs for predicted pseudogenes are not displayed. If you would like to display them, check the box next to the Include TSSs for predicted pseudogenes label. When you have finished configuring the filter, click the Submit button. Credits This track was created by Nathan Trinklein and Shelley Force Aldred of SwitchGear Genomics. targetScanS T-ScanS miRNA TargetScanS miRNA Regulatory Sites Regulation Description This track shows conserved mammalian microRNA regulatory target sites in the 3' UTR regions of Refseq Genes, as predicted by TargetScanS. Method Putative miRNA binding sites on mRNAs were identified using six-nucleotide seed sequences from all known miRNAs conserved among human, mouse, rat, dog and chicken. These seed sequences were probed against the 3' end of mRNAs conserved over the five genomes. For further details of the methods used to generate this annotation, see Lewis et al. (2005). Credit Thanks to Benjamin Lewis at MIT and the Whitehead Institute for providing this annotation. Additional information on microRNA target prediction is available on the TargetScan website. References Lewis BP, Burge CB, Bartel DP. Conserved seed pairing, often flanked by adenosines, indicates that thousands of human genes are microRNA targets. Cell. 2005 Jan 14;120(1):15-20. encodeTbaAlign TBA Alignment TBA Alignments ENCODE Comparative Genomics Description This track displays human-centric multiple sequence alignments in the ENCODE regions for the 28 vertebrates included in the September 2005 ENCODE MSA freeze, based on comparative sequence data generated for the ENCODE project as well as whole-genome assemblies residing at UCSC, as listed: human (May 2004, hg17) armadillo (NISC and May 2005 Broad Assisted Assembly v 1.0) baboon (NISC) chicken (Feb 2004, galGal2) chimp (Nov 2003, panTro1) colobus_monkey (NISC) cow (BCM) dog (July 2004, canFam1) dusky_titi (NISC) elephant (NISC and May 2005 Broad Assisted Assembly v 1.0) fugu (Aug 2002, fr1) galago (NISC) hedgehog (NISC) macaque (Jan 2005, rheMac1) marmoset (NISC) monodelphis (Oct 2004, monDom1) mouse (Mar 2005, mm6) mouse_lemur (NISC) owl_monkey (NISC) platypus (NISC and Aug 2005 Mullikin Phusion Assembly of WUGSC Traces) rabbit (NISC and May 2005 Broad Assisted Assembly v 1.0) rat (June 2003, rn3) rfbat (NISC) shrew (NISC and Sep 2005 Mullikin Phusion Assembly of Broad Traces) tenrec (Apr 2005 Mullikin Phusion Assembly of Broad Traces) tetraodon (Feb 2004, tetNig1) xenopus (Oct 2004, xenTro1) zebrafish (June 2004, danRer2) The alignments in this track were generated using the Threaded Blockset Aligner (TBA). The Genome Browser companion tracks, TBA Cons and TBA Elements, display conservation scoring and conserved elements for these alignments based on various conservation methods. Display Conventions and Configuration In full display mode, this track shows pairwise alignments of each species aligned to the human genome. In dense mode, the alignments are depicted using a gray-scale density gradient. The checkboxes in the track configuration section allow the exclusion of species from the pairwise display. When zoomed-in to the base-display level, the track shows the base composition of each alignment. The numbers and symbols on the "Gaps" line indicate the lengths of gaps in the human sequence at those alignment positions relative to the longest non-human sequence. If there is sufficient space in the display, the size of the gap is shown; if not, and if the gap size is a multiple of 3, a "*" is displayed, otherwise "+" is shown. To view detailed information about the alignments at a specific position, zoom in the display to 30,000 or fewer bases, then click on the alignment. Methods The TBA was used to align sequences in the September 2005 ENCODE sequence data freeze. Multiple alignments were seeded from a series of combinatorial pairwise blastz alignments (not referenced to any one species). The specific combinations were determined by the species guide tree. The resulting multiple alignments were projected onto the human reference sequence. Credits The TBA multiple alignments were created by Elliott Margulies of NHGRI, while at the Green Lab. The programs Blastz and TBA, which were used to generate the alignments, were provided by Minmei Hou, Scott Schwartz and Webb Miller of the Penn State Bioinformatics Group. The phylogenetic tree is based on Murphy et al. (2001). References Blanchette M, Kent WJ, Reimer C, Elnitski L, Smit A, Roskin K, Baertsch R, Rosenbloom KR, Clawson H et al. Aligning Multiple Genomic Sequences With the Threaded Blockset Aligner. Genome Res. 2004;14(4):708-15. Chiaromonte F, Yap VB, Miller W. Scoring pairwise genomic sequence alignments. Pac Symp Biocomput. 2002;115-26. Murphy WJ, Eizirik E, O'Brien SJ, Madsen O, Scally M, Douady CJ, Teeling E, Ryder OA, Stanhope MJ, de Jong WW et al. Resolution of the early placental mammal radiation using Bayesian phylogenetics. Science. 2001;294(5550):2348-51. Schwartz S, Kent WJ, Smit A, Zhang Z, Baertsch R, Hardison R, Haussler D, Miller W. Human-Mouse Alignments with BLASTZ. Genome Res. 2003;13(1):103-7. encodeTbaCons TBA Cons TBA Conservation ENCODE Comparative Genomics Description This track displays different measurements of conservation based on the multiple sequence alignments of ENCODE regions generated by the Threaded Blockset Aligner (TBA) and shown in the TBA Alignment track. The conservation scoring used to create this track was generated by three programs: phastCons (phylogenetic hidden-Markov model method) GERP (Genomic Evolutionary Rate Profiling) SCONE (from Harvard Genetics) A related track, TBA Elements, shows multi-species conserved sequences (MCSs) based on the conservation measurements displayed in this track. For details on the conservation scores generated by each program, refer to the individual Methods subsections. Display Conventions and Configuration The subtracks within this composite annotation track may be configured in a variety of ways to highlight different aspects of the displayed data. The graphical configuration options are shown at the top of the track description page, followed by a list of subtracks. A subtrack may be hidden from view by checking the box to the left of the track name in the list. For more information about the graphical configuration options, click the Graph configuration help link. Color differences among the subtracks are arbitrary; they provide a visual cue for distinguishing the different gene prediction methods. See the Methods section for display information specific to each subtrack. Methods The methods used to create the TBA alignments in the ENCODE regions are described in the TBA Alignment track description. PhastCons The phastCons program predicts conserved elements and produces base-by-base conservation scores using a two-state phylogenetic hidden Markov model. The model consists of a state for conserved regions and a state for nonconserved regions, each of which is associated with a phylogenetic model. These two models are identical except that the branch lengths of the conserved phylogeny are multiplied by a scaling parameter rho (0 < rho < 1). For determining the conservation for the ENCODE alignments, the nonconserved model was estimated from four-fold degenerate coding sites within the ENCODE regions using the program phyloFit. The parameter rho was then estimated by maximum likelihood, conditional on the nonconserved model, using the EM algorithm implemented in phastCons. Parameter estimation was based on a single large alignment, constructed by concatenating the alignments for all conserved regions. PhastCons was run with the options --expected-lengths 15 and --target-coverage 0.01 to obtain the desired level of "smoothing" and a final coverage by conserved elements of 5%. The conservation score at each base is the posterior probability that the base was generated by the conserved state of the phylo-HMM. It can be interpreted as the probability that the base is in a conserved element, given the assumptions of the model and the estimated parameters. Scores range from 0 to 1, with higher scores corresponding to higher levels of conservation. More details on phastCons can be found in Siepel et. al. (2005) cited below. GERP The GERP score is the expected substitution rate minus the observed substitution rate at a particular human base. Scores are estimated on a column-by-column basis using multiple sequence alignments of mammalian genomic DNA. The scores are both positive and negative, with negative values (i.e. observed > expected) corresponding to neutral or unconstrained sites and positive values (i.e. observed < expected) corresponding to constrained or slowly evolving sites. The expected and observed rates are both calculated on a phylogenic tree using the same fixed topology. The branch lengths of the expected tree are based on the average substitutions at neutral sites. The branch lengths of the observed tree, which is calculated separately for each human base, are based on the substitutions seen at the column of the multiple alignment at that base. Species that have gaps at a particular column are not considered in the scoring for that column. Higher scores correspond to human bases in alignment columns with higher degrees of similarity, i.e. bases that have evolved slowly, some of which have been under purifying selection. The opposite holds true for swiftly evolving (low similarity) columns. Scores are deterministic, given a maximum-likelihood model of nucleotide substitution, species topology, neutral tree, and alignment. SCONE SCONE is a probabilistic measure of purifying selection expressed as a p-value that a given position evolves neutrally. It has a model of evolution that considers both sequence-contextual effects on substitution rates and insertion/deletion events. This model may be used to compute the probability of any transitional event along a lineage. The score is computed for any column in a multiple sequence alignment by first parsimoniously inferring the evolutionary history of the site, using a given phylogenetic tree with known branch-lengths. Subsequently, transition probabilities are computed for each branch in the tree. A heuristic score is computed using the formula: S = ln(product(all i in M)/product(all j in C)) where M and C are the set of all branches in the tree that contain mutations and the set of all branches in the tree that do not contain mutations, respectively. This heuristic score serves to effectively partition sites according to the influence of purifying selection on the site. This heuristic score is used to compute a p-value by comparing it against the expected distribution of neutral scores as determined by Monte-Carlo simulation. Forward simulation of evolution is performed along the phylogenetic tree using the SCONE model of mutation events, and the above heuristic score is computed for a simulated tree. Repeated simulation produces a distribution of scores that reflects the neutral expected distribution. A p-value score may be computed by counting the fraction of simulated heuristic scores that fall below the heuristic score for the site. Credits PhastCons was developed by Adam Siepel, Cold Spring Harbor Laboratory, while at the Haussler lab at UCSC. GERP was developed primarily by Greg Cooper in the lab of Arend Sidow at Stanford University (Depts of Pathology and Genetics), in close collaboration with Eric Stone (Biostatistics, NC State), and George Asimenos and Eugene Davydov in the lab of Serafim Batzoglou (Dept. of Computer Science, Stanford). SCONE was developed by Saurabh Asthana in the lab of Shamil Sunyaev at Harvard Medical School and Brigham & Women's Hospital (Department of Medicine/Division of Genetics). TBA was provided by Minmei Hou, Scott Schwartz and Webb Miller of the Penn State Bioinformatics Group. The GERP data for this track was generated by Greg Cooper. The PhastCons data was generated by Elliott Margulies, with assistance from Adam Siepel. The SCONE data was generated by Saurabh Asthana. References Asthana S, Roytberg M, Stamatoyannopoulos J, Sunyaev S. Analysis of Sequence Conservation at Nucleotide Resolution. PLoS Comput. Biol. 2007 Dec 28:3(12):e254. Blanchette M, Kent WJ, Reimer C, Elnitski L, Smit A, Roskin K, Baertsch R, Rosenbloom KR, Clawson H, Green ED, et al. Aligning Multiple Genomic Sequences With the Threaded Blockset Aligner. Genome Res. 2004 Apr:14(4):708-15. Cooper GM, Stone EA, Asimenos G, NISC Comparative Sequencing Program, Green ED, Batzoglou , Sidow A. Distribution and intensity of constraint in mammalian genomic sequence. Genome Res. 2005 Jul;15(7):901-13. Epub 2005 Jun 17. Margulies EH, Blanchette M, NISC Comparative Sequencing Program, Haussler D, Green ED. Identification and characterization of multi-species conserved sequences. Genome Res. 2003 Dec;13(12):2507-18. Siepel A, Bejerano G, Pedersen JS, Hinrichs A, Hou M, Rosenbloom K, Clawson H, Spieth J, Hillier LW, Richards S, et al. Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. Genome Res. 2005 Aug;15(8):1034-50. Epub 2005 Jul 15. encodeTbaSconeConsUpdate TBA SCONE Cons TBA SCONE Conservation ENCODE Comparative Genomics encodeTbaGerpCons TBA GERP Cons TBA GERP Conservation ENCODE Comparative Genomics encodeTbaPhastCons TBA PhastCons TBA PhastCons Conservation ENCODE Comparative Genomics encodeTbaElements TBA Elements TBA Conserved Elements ENCODE Comparative Genomics Description This track displays multi-species conserved sequences (MCSs) derived from phastCons, binCons, GERP (Genomic Evolutionary Rate Profiling), and SCONE conservation scoring of Threaded Blockset Aligner (TBA) multiple sequence alignments in the ENCODE regions. The multiple sequence alignments may be viewed in the TBA Alignment track. Another related track, TBA Cons, shows the conservation scoring. The descriptions accompanying these tracks detail the methods used to create the alignments and conservation scoring. Display Conventions and Configuration The locations of conserved elements are indicated by blocks in the graphical display. This composite annotation track consists of several subtracks that show conserved elements derived by the various methods listed above. To view only selected subtracks, uncheck the boxes next to the tracks you wish to hide. The display may also be filtered to show only those items with unnormalized scores that meet or exceed a certain threshold. To set a threshold, type the minimum score into the text box at the top of the description page. Display characteristics specific to certain subtracks are described in the respective Methods sections below. Methods PhastCons-based Elements The predicted MCSs are segments of the alignment that are likely to have been "generated" by the conserved state of the phylo-HMM, i.e., maximal segments in which the maximum-likelihood (Viterbi) path remains in the conserved state. BinCons-based Elements The binCons score is based on the cumulative binomial probability of detecting the observed number of identical bases (or greater) in sliding 25 bp windows (moving one bp at a time) between the reference sequence and each other species, given the neutral rate at four-fold degenerate sites. Neutral rates are calculated separately at each targeted region. For targets with no gene annotations, the average percent identity across all alignable sequence was instead used to weight the individual species binomial scores; this latter weighting scheme was found to closely match 4D weights. The negative log of these p-values was then averaged across all human-referenced pairwise combinations, and the highest-scoring overlapping 25 bp window for each base was the resulting score. This track shows the plotting of a ranked percentile score normalized between 0 and 1 across all ENCODE regions, such that the top 5% most conserved sequence across all ENCODE regions have a score of 0.95 or greater, the top 10% have a score of 0.9 or greater, and so on. For each ENCODE target, a conservation score threshold was picked to match the number of conserved bases predicted by phastCons, an alternative method for measuring conservation. This latter method has been found slightly more reliable for predicting the expected fraction of conserved sequence in each target. Clusters of bases that exceeded the given conservation score threshold were designated as MCSs. The minimum length of an MCS is 25 bases. Strict cutoffs were used: if even one base fell below the conservation score threshold, it separated an MCS into two distinct regions. More details on binCons can be found in Margulies et. al. (2003) cited below. GERP-based Elements GERP elements are scored according to the inferred intensity of purifying selection and are measured as "rejected substitutions" (RSs). RSs capture the magnitude of difference between the number of "observed" substitutions (estimated using maximum likelihood) and the number that would be "expected" under a neutral model of evolution. The RS is displayed as part of the item name. Items with higher RSs are displayed in a darker shade of blue. The score shown on the details page, which has been scaled by 300 for display purposes, is generally not as accurate as the RS count that is part of the item name. "Constrained elements" are identified as those groups of consecutive human bases that have an observed rate of evolution that is smaller than the expected rate. These groups of columns are merged if they are less than a few nucleotides apart and are scored according to the sum of the site-by-site difference between observed and expected rates (RS). Permutations of the actual alignments were analyzed, and the "constrained elements" identified in these permuted alignments were treated as "false positives". Subsequently, an RS threshold was picked such that the total length of "false positive" constrained elements (identified in the permuted alignments) was less than 5% of the length of constrained elements identified in the actual alignment. Thus, all annotated constrained elements are significant at better than 95% confidence, and the total fraction of the ENCODE regions annotated as constrained is 5-7%. SCONE-based Elements SCONE provides p-value scores per base. Constrained elements are defined based on SCONE site-specific scores as follows. An additive score is first defined as the sum of (-log p + log t) along an interval, where p is the SCONE score and t is some threshold value for conservation. This additive score may be treated as a random walk; elements are defined as the intervals between local minima and maxima along this walk. Subsequently, a cutoff is set for the additive scores for each element. This cutoff is chosen such that the elements scoring above the cutoff for a random sequence of scores draw from a uniform distribution [0,1] with threshold t = 0.25 will cover no more than 0.25% of the sequence. Credits PhastCons was developed by Adam Siepel, Cold Spring Harbor Laboratory, while at the Haussler Lab at UCSC. BinCons was developed by Elliott Margulies of NHGRI, while at the Eric Green lab. BinCons and phastCons MCS data were contributed by Elliott Margulies, with assistance from Adam Siepel of UCSC. GERP was developed primarily by Greg Cooper in the lab of Arend Sidow at Stanford University (Depts of Pathology and Genetics), in close collaboration with Eric Stone (Biostatistics, NC State), and George Asimenos and Eugene Davydov in the lab of Serafim Batzoglou (Dept. of Computer Science, Stanford). SCONE was developed by Saurabh Asthana in the lab of Shamil Sunyaev at Harvard Medical School and Brigham & Women's Hospital (Department of Medicine/Division of Genetics). TBA was provided by Minmei Hou, Scott Schwartz and Webb Miller of the Penn State Bioinformatics Group. References See the TBA Alignment and TBA Cons tracks for references. encodeTbaSconeElUpdate TBA SCONE TBA SCONE Conserved Elements ENCODE Comparative Genomics encodeTbaGerpEl TBA GERP TBA GERP Conserved Elements ENCODE Comparative Genomics encodeTbaBinConsEl TBA BinCons TBA BinCons Conserved Elements ENCODE Comparative Genomics encodeTbaPhastConsEl TBA PhastCons TBA PhastCons Conserved Elements ENCODE Comparative Genomics tfbsConsSites TFBS Conserved HMR Conserved Transcription Factor Binding Sites Regulation Description This track contains the location and score of transcription factor binding sites conserved in the human/mouse/rat alignment. A binding site is considered to be conserved across the alignment if its score meets the threshold score for its binding matrix in all 3 species. The score and threshold are computed with the Transfac Matrix Database (v7.0) created by Biobase. The data are purely computational, and as such not all binding sites listed here are biologically functional binding sites. In the graphical display, each box represents one conserved tfbs. Clicking on a box brings up detailed information on the binding site, namely its Transfac I.D., a link to its Transfac Matrix (free registration with Transfac required), its location in the human genome (chromosome, start, end, and strand), its length in bases, its raw score, and its Z score. All binding factors that are known to bind to the particular binding matrix of the binding site are listed along with their species, SwissProt ID, and a link to that factor's page on the UCSC Protein Browser if such an entry exists. Methods The Transfac Matrix Database contains position-weight matrices for 398 transcription factor binding sites, as characterized through experimental results in the scientific literature. Only binding matrices for known transcription factors in human, mouse, or rat were used for this track (258 of the 398). A typical (in this case ficticious) matrix (call it mat) will look something like: A C G T 01 15 15 15 15 N 02 20 10 15 15 N 03 0 0 60 0 G 04 60 0 0 0 A 05 0 0 0 60 T The above matrix specifies the results of 60 (the sum of each row) experiments. In the experiments, the first position of the binding site was A 15 times, C 15 times, G 15 times, and T 15 times (and so on for each position.) The consensus sequence of the above binding site as characterized by the matrix is NNGAT. The format of the consensus sequence is the deduced consensus in the IUPAC 15-letter code. In the general case, the goal is to find all matches to a matrix of length n that are conserved across ns sequences. For this example, n=5 and ns=3 (human, mouse, and rat.) Denote the multispecies alignment s, such that sji is the nucleotide at position j of species i. Also, define an ns x 4 background matrix (call it back) giving the background frequencies of each nucleotide in each species. A sliding window (of length n) calculates the "species score" for each species at each position: From this, a log-odds score is calculated for each species (normalizing by the length of the matrix and the number of species in the alignment): These scores are then summed for all species, yielding a final log-odds score for the current position: Note that the log-odds score of each species must exceed the threshold for that species. The threshold is calculated for each species such that the only hits that will be reported will have a Z score (to be discussed later) of 2.36 or higher in each species (corresponding to a p-value of 0.01). Next, the maximum and minimum possible log-odds scores are computed and summed across all species for the given binding matrix: These are then used to normalize the final, raw log-odds score so that its range is between 0 and 1: Next, the best raw score for each binding matrix is calculated for the 5,000 base upstream region of each human RefSeq gene (taken from the RefGene table for hg17.) The mean and standard deviation for each binding matrix are then calculated across all RefSeq genes. These are then used to create the threshold for each binding matrix, namely, one standard deviation above the mean. Tfloc is then run with this threshold on each chromosome for the 3-way multiz alignments. Finally, a Z score is calculated for each binding site hit h to matrix m according to the following formula: This final Z score can be interpreted as the number of standard deviations above the mean raw score for that binding matrix across the upstream regions of all RefSeq genes. After all hits have been recorded genome-wide, one final filtering step is performed. Due to the inherant redundancy of the Transfac database, several binding sites that all bind the same factor often appear together. For example, consider the following binding sites: 585 chr1 4021 4042 V$MEF2_02 875 - 2.83 585 chr1 4021 4042 V$MEF2_03 917 - 3.38 585 chr1 4021 4042 V$MEF2_04 844 - 3.45 585 chr1 4022 4037 V$HMEF2_Q6 810 - 2.34 585 chr1 4022 4037 V$MEF2_01 802 - 2.47 585 chr1 4022 4038 V$RSRFC4_Q2 875 - 2.65 585 chr1 4022 4039 V$AMEF2_Q6 823 - 2.44 585 chr1 4023 4038 V$RSRFC4_01 878 + 2.53 585 chr1 4024 4035 V$MEF2_Q6_01 913 + 2.41 585 chr1 4024 4039 V$MMEF2_Q6 861 - 2.39 These 10 overlapping binding sites bind a total of 19 factors. However, of these 19 factors, only 7 of them are unique. Many of the above binding sites are redundant (they add no additional factors). In fact, the first 3 binding sites all bind the same two factors (namely, aMEF-2 and MEF-2A). These ten binding sites can therefore be filtered down to the following four binding sites, without any loss of information (in terms of transcription factors). The final table entry then has the following four lines, since these four binding sites account for all 7 of the unique factors: 585 chr1 4021 4042 V$MEF2_04 844 - 3.45 585 chr1 4022 4038 V$RSRFC4_Q2 875 - 2.65 585 chr1 4024 4035 V$MEF2_Q6_01 913 + 2.41 585 chr1 4024 4039 V$MMEF2_Q6 861 - 2.39 In the event that multiple binding sites bind the same factors, the site with the highest Z score is chosen. Only binding sites which overlap each other and whose start positions are within 5 bases of each other are considered for merging. It should be noted that the positions of many of these conserved binding sites coincide with known exons and other highly conserved regions. Regions such as these are more likely to contain false positive matches, as the high sequence identity across the alignment increases the likelihood of a short motif that looks like a binding site to be conserved. Conversely, matches found in introns and intergenic regions are more likely to be real binding sites, since these regions are mostly poorly conserved. These data were obtained by running the program tfloc (Transcription Factor binding site LOCater) on multiz alignments of the May 2004 (mm5) mouse genome assembly and the June 2003 rat assembly (rn3) to the May 2004 human genome assembly (hg17.) Transcription factor information was culled from the Transfac Factor database, version 7.0. Table Format The format of the tfbsConsSites sql table is shown above. The columns are (from left to right): bin, chromosome, from, to, binding matrix ID, raw score, strand, and Z score. To get the corresponding transcription factor information for a given binding matrix, use the table tfbsConsFactors. The format of the tfbsConsFactors sql table is: V$MYOD_01 M00001 mouse MyoD P10085 V$E47_01 M00002 human E47 N V$CMYB_01 M00004 mouse c-Myb P06876 V$AP4_01 M00005 human AP-4 Q01664 V$MEF2_01 M00006 mouse aMEF-2 Q60929 V$MEF2_01 M00006 rat MEF-2 N V$MEF2_01 M00006 human MEF-2A Q02078 V$ELK1_01 M00007 human Elk-1 P19419 V$SP1_01 M00008 human Sp1 P08047 V$EVI1_06 M00011 mouse Evi-1 P14404 The columns are (from left to right): transfac binding matrix id, transfac binding matrix accession number, transcription factor species, transcription factor name, SwissProt accesssion number. When no factor species, name, or id information exists in the transfac factor database for a binding matrix, an 'N' appears in the corresponding column(s). Notice also that if more than one transcription factor is known for one binding matrix, each occurs on its own line, so multiple lines can exist for one binding matrix. Credits These data were generated using the Transfac Matrix and Factor databases created by Biobase. The tfloc program was developed at The Pennsylvania State University (with numerous updates done at UCSC) by Matt Weirauch. This track was created by Matt Weirauch and Brian Raney at The University of California at Santa Cruz. tigrGeneIndex TIGR Gene Index Alignment of TIGR Gene Index TCs Against the Human Genome mRNA and EST Description This track displays alignments of the TIGR Gene Index (TGI) against the human genome. The TIGR Gene Index is based largely on assemblies of EST sequences in the public databases. See the following page for more information about TIGR Gene Indices. Credits Thanks to Foo Cheung and Razvan Sultana of the The Institute for Genomic Research, for converting these data into a track for the browser. encodeUCDavisChip UCD Ng ChIP UC Davis ChIP-chip NimbleGen (E2F1, c-Myc, TAF, POLII) ENCODE Chromatin Immunoprecipitation Description ChIP analysis was performed using antibodies to E2F1, c-Myc, TAFI and PolII in HeLa, GM06990 and/or HelaS3 cells. E2F1 and c-Myc protein are transcription factors related to growth. E2F1 is important in controlling cell division, and c-Myc is associated with cell proliferation and neoplastic disease. TAFI is a general transcription factor that is a key part of the pre-initiation complex found on the promoter. PolII is RNA polymerase II. For E2F1 and c-Myc, three independently crosslinked preparations of HeLa cells were used to provide three independent biological replicates. ChIP assays were performed (with minor modifications which can be provided upon request) using the protocol found at The Farnham Laboratory. Array hybridizations were performed using standard NimbleGen Systems conditions. For TAFI and PolII, crosslinked cells were officially supplied by the ENCODE Consortium (for reference, see The Human Genetic Cell Repository). Hence, this data may be compared to other tracks using this exact source of cells. (Note that this is different from the E2F1 and c-myc subtracks — those Hela cells were grown in the Farnham lab.) ChIP-chip and amplification procedures are according to standard protocols available in detail from the Farnham Lab website. Whole Genome Amplification (WGA) was used for these samples. Array processing was performed by NimbleGen, Inc. The supplied array data is the result of three biological replicates in each case. Display Conventions and Configuration The subtracks within this annotation may be configured in a variety of ways to highlight different aspects of the displayed data. The graphical configuration options are shown at the top of the track description page, followed by a list of subtracks. To display only selected subtracks, uncheck the boxes next to the tracks you wish to hide. For more information about the graphical configuration options, click the Graph configuration help link. Methods Ratio intensity values (antibody vs. total) for each of three biological replicates were calculated and converted to log2. Each set of ratio values was then independently scaled by its Tukey biweight mean. The three replicates were then combined by taking the median scaled log2 ratio for each oligo. Verification For E2F1, primers were chosen to correspond to 13 individual peaks. PCR reactions were performed for each of the 13 primer sets using amplicons derived from each of three biological samples (39 reactions). The PCR reactions confirmed that all of the 13 chosen peaks were bound by E2F1 in all three biological samples. For PolII, simple verification of the ChIP sample was performed at a known positive target (the promoter for POLII) and known negative target (the DHFR 3' UTR region). Quantitative PCR verifications of sites are in progress. Credits These data were contributed by Mike Singer, Kyle Munn, Nan Jiang, Xinmin Zhang, Todd Richmond and Roland Green of NimbleGen Systems, Inc., and Matt Oberley, David Inman, Mark Bieda, Shally Xu and Peggy Farnham of Farnham Lab. Reference Bieda M, Xu X, Singer MA, Green R, Farnham PJ. Unbiased location analysis of E2F1-binding sites suggests a widespread role for E2F1 in the human genome. Genome Res. 2006 May;16(5):595-605. encodeUCDavisTafHelaS3 UCD Taf_HelaS3 UC Davis ChIP-chip NimbleGen (TAF, HelaS3 Cells) ENCODE Chromatin Immunoprecipitation encodeUCDavisTafGM UCD Taf_GM UC Davis ChIP-chip NimbleGen (TAF, GM06990 Cells) ENCODE Chromatin Immunoprecipitation encodeUCDavisPolIIHelaS3 UCD PolII_HelaS3 UC Davis ChIP-chip NimbleGen (PolII, HelaS3 Cells) ENCODE Chromatin Immunoprecipitation encodeUCDavisPolIIGM UCD PolII_GM UC Davis ChIP-chip NimbleGen (PolII, GM06990 Cells) ENCODE Chromatin Immunoprecipitation encodeUCDavisChipMyc UCD c-Myc UC Davis ChIP-chip NimbleGen (c-Myc ab, HeLa Cells) ENCODE Chromatin Immunoprecipitation encodeUCDavisE2F1Median UCD E2F1 UC Davis ChIP-chip NimbleGen (E2F1 ab, HeLa Cells) ENCODE Chromatin Immunoprecipitation encodeUcDavisChipHits UCD Ng ChIP Hits UC Davis ChIP-chip Hits NimbleGen (E2F1, Myc ab, HeLa Cells) ENCODE Chromatin Immunoprecipitation Description ChIP analysis was performed using antibodies to E2F1 and Myc in HeLa cells. E2F1 and Myc protein are transcription factors related to growth. E2F1 is important in controlling cell division, and C-Myc is associated with cell proliferation and neoplastic disease. Three independently crosslinked preparations of HeLa cells were used to provide three independent biological replicates. ChIP assays were performed using the protocol found at Farnham Lab Protocols. Array hybridizations were performed using standard NimbleGen Systems conditions. Methods Ratio intensity values (antibody vs. total) for each of three biological replicates were calculated and converted to log2. Peaks were identified independently for each of the three E2F1 and the three Myc ChIP-chip experiments using the Tamalpais program. The identified peaks from the L1 categories for the three E2F1 or three Myc experiments were then compared. All regions reported here as binding sites were identified in at least two of the three E2F1 or at least two of the three Myc ChIP-chip assays. Verification Primers were chosen to correspond to 13 individual peaks. PCR reactions were performed for each of the 13 primer sets using amplicons derived from each of three biological samples (39 reactions). The PCR reactions confirmed that all of the 13 chosen peaks were bound by E2F1 in all three biological samples. Credits These data were contributed by Mike Singer, Kyle Munn, Nan Jiang, Todd Richmond and Roland Green of NimbleGen Systems, Inc., and Matt Oberley, David Inman, Mark Bieda, Shally Xu and Peggy Farnham of Farnham Lab. Reference Bieda M, Xu X, Singer MA, Green R, Farnham PJ. Unbiased location analysis of E2F1-binding sites suggests a widespread role for E2F1 in the human genome. Genome Res. 2006 May;16(5):595-605. encodeUcDavisChipHitsMyc UCD c-Myc Hits UC Davis ChIP-chip Hits NimbleGen (c-Myc ab, HeLa Cells) ENCODE Chromatin Immunoprecipitation encodeUcDavisChipHitsE2F1 UCD E2F1 Hits UC Davis ChIP-chip Hits NimbleGen (E2F1 ab, HeLa Cells) ENCODE Chromatin Immunoprecipitation encodeEvoFold UCSC EvoFold UCSC, RNA secondary structure predicted by EvoFold (id_strand_score) ENCODE Regions and Genes Description This track shows RNA secondary structure predictions made with the EvoFold program, a comparative method that exploits the evolutionary signal of genomic multiple-sequence alignments for identifying conserved functional RNA structures. Display Conventions and Configuration Track elements are labeled using the convention ID_strand_score. When zoomed out beyond the base level, secondary structure prediction regions are indicated by blocks, with the stem-pairing regions shown in a darker shade than unpaired regions. Arrows indicate the predicted strand. When zoomed in to the base level, the specific secondary structure predictions are shown in parenthesis format. The confidence score for each position is indicated in grayscale, with darker shades corresponding to higher scores. The details page for each track element shows the predicted secondary structure (labeled SS anno), together with details of the multiple species alignments at that location. Substitutions relative to the human sequence are color-coded according to their compatibility with the predicted secondary structure (see the color legend on the details page). Each prediction is assigned an overall score and a sequence of position-specific scores. The overall score measures evidence for any functional RNA structures in the given prediction region, while the position-specific scores (0 - 9) measure the confidence of the base-specific annotations. Base-pairing positions are annotated with the same pair symbol. The offsets are provided to ease visual navigation of the alignment in terms of the human sequence. The offset is calculated (in units of ten) from the start position of the element on the positive strand or from the end position when on the negative strand. The graphical display may be filtered to show only those track elements with scores that meet or exceed a certain threshhold. To set a threshhold, type the minimum score into the text box at the top of the description page. Methods Evofold makes use of phylogenetic stochastic context-free grammars (phylo-SCFGs), which are combined probabilistic models of RNA secondary structure and primary sequence evolution. The predictions consist of both a specific RNA secondary structure and an overall score. The overall score is essentially a log-odd score between a phylo-SCFG modeling the constrained evolution of stem-pairing regions and one which only models unpaired regions. The predictions for this track were based on the conserved elements of the 28-way threaded blockset aligner (TBA) alignments present in the ENCODE regions (see the TBA Alignment track for more information). Credits The EvoFold program and browser track were developed by Jakob Skou Pedersen of the UCSC Genome Bioinformatics Group. The 28-way TBA multiple alignments were created by Elliott Margulies of the Green Lab at NHGRI. TBA was provided by Minmei Hou, Scott Schwartz and Webb Miller of the Penn State Bioinformatics Group. References EvoFold Pedersen JS, Bejerano G, Siepel A, Rosenbloom K, Lindblad-Toh K, Lander ES, Kent J, Miller W, Haussler D. Identification and classification of conserved RNA secondary structures in the human genome. PLoS Comput Biol. 2006 Apr;2(4):e33. Phylo-SCFGs Knudsen B, Hein J. RNA secondary structure prediction using stochastic context-free grammars and evolutionary history. Bioinformatics. 1999 Jun;15(6):446-54. Pedersen JS, Meyer IM, Forsberg R, Simmonds P, Hein J. A comparative method for finding and folding RNA secondary structures within protein-coding regions. Nucleic Acids Res. 2004 Sep 24;32(16):4925-36. PhastCons Siepel A, Bejerano G, Pedersen JS, Hinrichs AS, Hou M, Rosenbloom K, Clawson H, Spieth J, Hillier LW, Richards S, Weinstock GM, Wilson RK, Gibbs RA, Kent WJ, Miller W, Haussler D. Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. Genome Res. 2005 Aug;15(8):1034-50. encodeUncFaire UNC FAIRE UNC FAIRE (Formaldehyde Assisted Isolation of Regulatory Elements) ENCODE Chromosome, Chromatin and DNA Structure Description Formaldehyde-Assisted Isolation of Regulatory Elements (FAIRE) is a procedure used to isolate chromatin that is resistant to the formation of protein-DNA cross-links. These tracks display FAIRE data from 2091 fibroblast cells hybridized to high-resolution NimbleGen arrays that tile the ENCODE regions. The four datasets, in practical terms, can be thought of as independent replicates. However, because they were part of a series of experiments aimed at optimizing cross-linking conditions in human cells, the data represent different cross-linking times (1, 2, 4, and 7 minutes). Although the individual replicates are not displayed, the replicate data and also the signal averages and the peaks for the averages can be downloaded. Display Conventions and Configuration The FAIRE data are represented by three subtracks. One subtrack shows the average normalized log2 ratios for the tiled probes; the other two subtracks display peaks. The peaks in one set were determined using PeakFinder software supplied by NimbleGen. A false positive rate (FPR) was estimated for the peak set using a permutation-based method. All peaks had an FPR of < 0.01. The peaks in the other set (Apr. 2006 update) were identified by ChIPOTle, a peak-finding algorithm that uses a sliding window to identify statistically significant signals that comprise a peak. A null distribution was determined by reflecting the negative data, which is presumed to be noise, about zero and a Gaussian distribution was fitted to it. Windows were considered significant with a p-value < 1e-25, after using the Benjamini-Hochberg correction for multiple tests. This annotation follows the display conventions for composite tracks. The subtracks within this annotation may be configured in a variety of ways to highlight different aspects of the displayed data. The graphical configuration options are shown at the top of the track description page, followed by a list of subtracks. To display only one subtrack, uncheck the box next to the track you wish to hide. For more information about the graphical configuration options, click the Graph configuration help link. Note that the graphical configuration options are available only for the Signal subtrack; the Peaks subtracks are fixed. Methods To perform FAIRE, proteins were cross-linked to DNA using 1% formaldehyde solution, the complex was sheared using sonication, and a phenol/chloroform extraction was performed to remove DNA fragments crosslinked to protein. The DNA recovered in the aqueous phase was fluorescently-labeled and hybridized to a microarray along with fluorescently-labeled genomic DNA as a control. Ratios were scaled by subtracting the Tukey Bi-weight mean for the log-ratio values from each log-ratio value, as recomended by NimbleGen. Results in yeast were consistent with enrichment for nucleosome-depleted regions of the genome. Therefore, the method may have utility as a positive selection for genomic regions with properties normally detected by assays like DNAse hypersensitivity. Verification The data were verified using PCR with primers designed to promoters enriched with FAIRE and downstream coding regions. Credits Cell culture, fixing, and DNA amplification were performed by Jonghwan Kim in the Vishy Iyer lab at the University of Texas, Austin. FAIRE was done by Paul Giresi in the Jason Lieb lab at the University of North Carolina at Chapel Hill. Paul Giresi of NimbleGen did the sample labeling and hybridization with the help of Mike Singer and Roland Green. Nan Jiang at NimbleGen supplied the Software used for the permutation analysis. References Buck, M.J., Nobel, A.B., and Lieb, J.D. ChIPOTle: a user-friendly tool for the analysis of ChIP-chip data. Genome Biol. 6(11), R97 (2005). Nagy, P.L., Cleary, M.L., Brown, P.O., and Lieb, J.L. Genomewide demarcation of RNA polymerase II transcription units revealed by physical fractionation of chromatin. PNAS 100(11), 6364-9 (2003). encodeUncFairePeaksChipotle FAIRE ChIPOTle University of North Carolina FAIRE Peaks (ChIPOTle) ENCODE Chromosome, Chromatin and DNA Structure encodeUncFairePeaks FAIRE PeakFinder University of North Carolina FAIRE Peaks (PeakFinder) ENCODE Chromosome, Chromatin and DNA Structure encodeUncFaireSignal FAIRE Signal University of North Carolina FAIRE Signal ENCODE Chromosome, Chromatin and DNA Structure uniGene_2 UniGene UniGene Alignments and SAGE Info mRNA and EST Description Serial analysis of gene expression (SAGE) is a quantitative measurement of gene expression. Data are presented for every cluster contained in the browser window and the selected cluster name is highlighted in the table. All data are from the repository at the SageMap project built on UniGene version Hs 171. Click on a UniGene cluster name on the track details page to display SageMap's page for that cluster. Please note that data are not available for every cluster. There is no data available for clusters that lie entirely within the bounds of larger clusters. Methods SAGE counts are produced by sequencing small "tags" of DNA believed to be associated with a gene. These tags were generated by attaching poly-A RNA to oligo-dT beads. After synthesis of double-stranded cDNA, transcripts were cleaved by an anchoring enzyme (usually NlaIII). Then, small tags were produced by ligation with a linker containing a type IIS restriction enzyme site and cleavage with the tagging enzyme (usually BsmFI). The tags were concatenated together and sequenced. The frequency of each tag was counted and used to infer expression level of transcripts that could be matched to that tag. Credits All SAGE data presented here were mapped to UniGene transcripts by the SageMap project at NCBI. encodeUppsalaChip Uppsala ChIP Uppsala University, Sweden ChIP-chip ENCODE Chromatin Immunoprecipitation Description This track displays the results of ENCODE region-wide localization for three transcription factors (HNF-3b, HNF-4a and USF-1) and acetylated histone H3 (H3ac). The heights of the peaks in the graphical display indicate the ratio of enriched non-amplified DNA to input DNA. The data for each of the transcription factors and H3ac are displayed in individual subtracks. The analysis cut-off threshold is indicated in each subtrack by a horizontal line. Tentative binding sites (TBSs) in spots passing the cut-off are displayed in a separate subtrack, ChIP-chip (HepG2) Sites. These sites are numbered corresponding to the ranking of spots based on enrichment ratios. Each TBS is assigned a value indicating how often it was found in separate BioProspector software runs for the prediction of TBSs (e.g. 1000 indicates that a TBS was found in ten out of ten runs). The raw data for this track is available at EBI ArrayExpress, as experiment E-MEXP-452. Methods Chromatin from HepG2 cells was cross-linked with formaldehyde and sonicated to produce DNA fragments of size 0.5-2 kb. Chromatin was precipitated using antibodies against HNF-4a, HNF-3b, USF-1 or H3ac. DNA from a single ChIP reaction was labeled with Cy5, and a fraction of the total input was labeled with Cy3. There was no amplification of the ChIP DNA or the input DNA prior to this step to avoid introducing bias. This DNA was combined and hybridized to PCR-based tiling path ENCODE arrays. Most array elements were printed only once on the slide, but X-chromosomal regions (ENm006 and ENr324) were printed in duplicate. There were approximately 19,000 spots/slide. The array provided about 75% coverage of the ENCODE regions. Spots flagged as bad by the image processing step were removed; those that remained were normalized. The average log2 ratio was calculated for spots that were replicated on the array. A log odds score for differential enrichment with the negative control was calculated using an empirical Bayes method. There were four log odds scores for each spot, one for each antibody. If this score was greater than 0 and the log2 ratio was greater than 1.25 (indicative of a strong positive signal), based on at least 2 replicates, the spots were considered to be enriched. Binding sites were identified using the BioProspector software. Because the software is non-deterministic, different runs may produce different results for the same data. Predictions consistent across many runs are more likely to be correct; therefore, the analysis was repeated, keeping all binding sites occurring in each top-scoring motif to generate a set of candidates. TBSs present in at least five out of ten runs were selected. Further method details are described in Rada-Iglesias et al. (2005). In the graphical display, overlapping sequences were removed by changing the start position of downstream spots to generate a continuous track. To give each track a comparable scale, the values for the most enriched spots were lowered to 15. Spots deemed as false positives, when compared to a no antibody ChIP-chip experiment, were assigned a value of 0. Verification A negative control was done using no antibody for the ChIP-chip to reduce the number of false positives. Three independent biological replicates were performed for each antibody; three negative control ChIPs were also analyzed. Semi-quantitative PCR was used to verify enrichment in at least ten positive spots for each antibody. Credits These experiments were performed in the Claes Wadelius lab. The statistical analysis was done at the Linnaeus Centre for Bioinformatics at Uppsala University. Microarrays were produced at the Sanger Institute. References Rada-Iglesias A, Wallerman O, Koch C, Ameur A, Enroth S, Clelland G, Wester K, Wilcox S, Dovey OM, Ellis PD et al. Binding sites for metabolic disease related transcription factors inferred at base pair resolution by chromatin immunoprecipitation and genomic microarrays. Hum Mol Genet. 2005 Nov 15;14(22):3435-47. encodeUppsalaChipSites UU Sites Uppsala University, Sweden ChIP-chip (HepG2) Sites ENCODE Chromatin Immunoprecipitation encodeUppsalaChipUsf1 UU USF-1 HepG2 Uppsala University, Sweden ChIP-chip (USF-1, HepG2) ENCODE Chromatin Immunoprecipitation encodeUppsalaChipHnf4a UU HNF-4a HepG2 Uppsala University, Sweden ChIP-chip (HNF-4a, HepG2) ENCODE Chromatin Immunoprecipitation encodeUppsalaChipHnf3b UU HNF-3b HepG2 Uppsala University, Sweden ChIP-chip (HNF-3b, HepG2) ENCODE Chromatin Immunoprecipitation encodeUppsalaChipAch3 UU H3ac HepG2 Uppsala University, Sweden ChIP-chip (H3ac, HepG2) ENCODE Chromatin Immunoprecipitation encodeUppsalaChipBut Uppsala ChIP Butyrate Uppsala University, Sweden ChIP-chip Na-butyrate time series ENCODE Chromatin Immunoprecipitation Description ENCODE regions were investigated by ChIP-chip, analyzing both histone H3 acetylation (H3ac; H3 acetylated lysines 9 and14) and histone H4 acetylation (H4ac; H4 acetylated lysined 5,8,12,16). This analysis was performed using ChIP material obtained from cells that were either untreated or treated with 5mM Na-Butyrate for 12 hours. Na-Butyrate is a histone deacetylase inhibitor (HDACi) that increases bulk levels of acetylated histones. Four tracks presented in the genome browser represent the ChIP-chip signal obtained for either H3ac or H4ac, using cells that were untreated or treated with butyrate: H3ac 0h, H3ac 12h, H4ac 0h, H4ac 12h. Two additional tracks indicate those spots where H3ac or H4ac levels are significantly changed by butyrate treatment. Methods Chromatin immunoprecipitation, DNA labelling and array hybridization were exactly as previously described (Rada-Iglesias, et al. 2005). A set of enriched spots was obtained for each of H3ac 0h, H3ac 12h, H4ac 0h and H4ac 12h using the same pre-processing and analysis procedures as in (Rada-Iglesias, et al.). Enriched spots showing different histone acetylation levels between 0h and 12h treatment were then detected through an empirical Bayes method (Smyth). All spots with B-score>0 were either classified as up or down depending on whether the acetylation was increased or decreased. For spots missing all measurements at one of the time points due to filtering, the B-score was instead calculated on un-filtered, print-tip lowess normalized (Yang, et al.) raw data. Enriched spots that were not present in any of the up or down groups were classified as unchanged. The raw data for this track is available at EBI ArrayExpress, as experiment E-MEXP-693. Verification New ChIPs were performed for both H3ac and H4ac, both for untreated cells and cells treated with 5mM Na-butyrate for 12 hours. Furthermore, ChIP was performed in cells that were treated with 5mM Na-butyrate for 15 minutes, 2 hours, 6 hours and 12 hours+6 hours without butyrate. All these ChIP DNAs were analyzed by PCR, including 10 regions were loss of acetylation after 12 hours butyrate treatment was observed in ChIP-chip experiments, two regions where a trend towards increase acetylation was observed, one negative region where no acetylation and no change was observed and three control regions not included in the ENCODE array and covering promoter regions of previously known butyrate-responsive genes. Credits These experiments were performed in the Claes Wadelius lab, Department of Genetics and Pathology, Rudbeck Laboratory, Uppsala University. The statistical analysis was done at the Linnaeus Centre for Bioinformatics at Uppsala University. Microarrays were produced at the Sanger Institute. References Ameur A, Yankovski V, Enroth S, Spjuth O, Komorowski J. The LCB Data Warehouse. Bioinformatics. 2006 Apr 15;22(8):1024-6. Rada-Iglesias A, Wallerman O, Koch C, Ameur A, Enroth S, Clelland G, Wester K, Wilcox S, Dovey OM, Ellis PD et al. Binding sites for metabolic disease related transcription factors inferred at base pair resolution by chromatin immunoprecipitation and genomic microarrays. Hum Mol Genet. 2005 Nov 15;14(22):3435-47. Smyth GK. Linear models and empirical bayes methods for assessing differential expression in microarray experiments. Stat Appl Genet Mol Biol. 2004;3:Article3. Yang YH, Dudoit S, Luu P, Lin DM, Peng V, Ngai J, Speed TP. Normalization for cDNA microarray data: a robust composite method addressing single and multiple slide systematic variation. Nucleic Acids Res. 2002 Feb 15;30(4):e15. encodeUppsalaChipH4acBut0vs12 UU H4ac 0h vs 12h Uppsala University, Sweden ChIP-chip (H4ac 0h vs. 12h) ENCODE Chromatin Immunoprecipitation encodeUppsalaChipH3acBut0vs12 UU H3ac 0h vs 12h Uppsala University, Sweden ChIP-chip (H3ac 0h vs. 12h) ENCODE Chromatin Immunoprecipitation encodeUppsalaChipH4acBut12h UU H4ac HepG2 12h Uppsala University, Sweden ChIP-chip (H4ac, HepG2, Butyrate 12h) ENCODE Chromatin Immunoprecipitation encodeUppsalaChipH4acBut0h UU H4ac HepG2 0h Uppsala University, Sweden ChIP-chip (H4ac, HepG2, Butyrate 0h) ENCODE Chromatin Immunoprecipitation encodeUppsalaChipH3acBut12h UU H3ac HepG2 12h Uppsala University, Sweden ChIP-chip (H3ac, HepG2, Butyrate 12h) ENCODE Chromatin Immunoprecipitation encodeUppsalaChipH3acBut0h UU H3ac HepG2 0h Uppsala University, Sweden ChIP-chip (H3ac, HepG2, Butyrate 0h) ENCODE Chromatin Immunoprecipitation encodeRegulomeDnaseArray UW DNase-array UW DNaseI hypersensitivity by DNase-array ENCODE Chromosome, Chromatin and DNA Structure Description This track displays DNaseI sensitivity/hypersensitivity mapped over ENCODE regions in lymphoblastoid cells (ENCODE common cell line GM06990) using the DNase-array methodology described in Sabo et al. (2006). DNaseI hypersensitivity signifies chromatin accessibility following binding of trans-acting factors in place of a canonical nucleosome, and is a universal feature of active cis-regulatory sequences in vivo. Peaks in DNaseI sensitivity signal measured using DNase/Array represent DNaseI hypersensitive sites. Methods DNase-array comprises the following steps: (1) treatment of nuclear chromatin with DNaseI; (2) isolation of short (avg. length ~450 bp) DNA segments released by two DNaseI “hits” occurring in close proximity on the same nuclear chromatin template; (3) differential labeling of fragments and a control (DNaseI-treated naked DNA); (4) hybridization to a tiling DNA microarray (Nimblegen ENCODE array), without amplification. Signal peaks correspond to DNaseI hypersensitive sites. Validation The data have been extensively validated by conventional DNaseI hypersensitivity assays (indirect end-label + Southern blotting method). The data have an overall sensitivity of 91.7%, and specificity of >99.5% for DNaseI hypersensitive sites. Note that the tiling array covers only non-repetitive regions. Credits These data were generated by the UW ENCODE group. References Sabo PJ, Kuehn MS, Thurman R, Johnson BE, Johnson EM, Cao H, Yu M, Rosenzweig E, Goldy J, Haydock A, Weaver M, Shafer A, Lee K, Neri F, Humbert R, Singer MA, Richmond TA, Dorschner MO, McArthur M, Hawrylycz M, Green RD, Navas PA, Noble WS, Stamatoyannopoulos JA. Genome-scale mapping of DNase I sensitivity in vivo using tiling DNA microarrays. Nature Methods 3:511-18 (2006) encodeRegulomeDnaseGM06990Sites DnaseI HSs UW DNase-array GM06990 HSs ENCODE Chromosome, Chromatin and DNA Structure encodeRegulomeDnaseGM06990Sens DnaseI Sens UW DNase-array GM06990 Sensitivity ENCODE Chromosome, Chromatin and DNA Structure encodeUWRegulomeBase UW QCP DNaseI Sens UW QCP DNaseI Sensitivity ENCODE Chromosome, Chromatin and DNA Structure Description This track shows DNaseI sensitivity measured across ENCODE regions using the Quantitative Chromatin Profiling (QCP) method (Dorschner et al. (2004)). DNaseI has long been used to map general chromatin accessibility and the DNaseI "hyperaccessibility" or "hypersensitivity" that is a universal feature of active cis-regulatory sequences. The use of this method has lead to the discovery of functional regulatory elements that include enhancers, insulators, promotors, locus control regions and novel elements. QCP provides a quantitative high-throughout method for the mapping DNaseI sensitivity as a continuous function of genome position. The moving baseline of mean DNaseI sensitivity is computed using a locally-weighted least squares (LOWESS)-based algorithm. DNaseI-treated and untreated chromatin samples from the following cell lines/phenotypes were studied: Cell LineDescription Source CD4CD4+ lymphoidPrimary CaCo2intestinal cancer ATCC CaLU3lung cancerATCC EryAdultCD34-derived primary adult erythroblasts Primary EryFetalCD34-derived primary fetal erythroblasts Primary GM06990EBV-transformed lymphoblastoid Coriell HMECmammary epitheliumCambrex HRErenal epithelialCambrex HeLacervical cancerATCC HepG2hepaticATCC Huh7hepaticJCRB K562erythroidATCC NHBEbronchial epithelialCambrex PANCpancreaticATCC SAECsmall airway epithelialCambrex SKnSHneuralATCC Key for Source entry in table: ATCC: American Type Culture Collection Cambrex: Cambrex Corporation JCRB: Japanese Collection of Research Bioresources Display Conventions and Configuration DNaseI sensitivity is expressed in standard units, where each increment of 1 unit corresponds to an increase of 1 standard deviation from the baseline. The displayed values are calculated as copies in DNaseI-untreated / copies in DNaseI-treated. Thus, increasing values represent increasing sensitivity. Major DNaseI hypersensitive sites are readily identified as peaks in the signal that exceed 2 standard deviations (corresponding to the ~95% confidence bound on outliers). This is reflected in the default viewing parameters, which apply a lower y-axis threshold of 2 (i.e., showing only sites that exceed the 95% confidence bound). The subtracks within this composite annotation track correspond to data from different tissues, and may be configured in a variety of ways to highlight different aspects of the displayed data. Four tissue types are present throughout all ENCODE regions: GM06990, CaCo2, HeLa, and SKnSH. Several Relevant tissues were also studied for several ENCODE regions that contain tissue-specific genes. These include the alpha- and beta-globin loci (ENm008 and ENm009); the apolipoprotein A1/C3 loci (ENm003); and the Th2 cytokine locus (ENm002). Color differences among the subtracks are arbitrary; they provide a visual cue for distinguishing the different cell lines/phenotypes. The graphical configuration options are shown at the top of the track description page, followed by a list of subtracks. To display only selected subtracks, uncheck the boxes next to the tracks you wish to hide. For more information about the graphical configuration options, click the Graph configuration help link. Methods QCP was performed as described in Dorschner et al. Data were obtained from a tiling path across ENCODE that comprises 102,008 distinct amplicons (mean length = 243 +/- 13). The amplicon tiling path is available through UniSTS. The tiling path covers approximately 86% of ENCODE regions, including many repetitive regions. The Dorschner et al. article describes the methods of chromatin preparation, DNaseI digestion, and DNA purification utilized. DNaseI-treated and -untreated control samples were prepared from each tissue. For each tissue, 6-10 biological replicates (defined as replicate cultures grown from seed and harvested on different days) were pooled together to create a master sample. The relative number of intact copies of the genomic DNA sequence was quantified over the entire tiling path real-time PCR for both DNaseI-treated and -untreated samples. Four to eight technical replicates were performed for each measurement from each amplicon in each tissue. Data shown are the means of these technical replicates. The results were analyzed as described in Dorschner et al. to compute the moving baseline of mean DNaseI sensitivity and to identify outliers that correspond with DNaseI hypersensitive sites. The standard deviation of trimmed mean measurements was used to convert data to standard units. Verification Biological replicate samples were pooled as described above. Results were extensively validated by conventional DNaseI hypersensitivity assays using end-labeling/Southern blotting method (Navas et al., in preparation). Credits Data generation, analysis, and validation were performed by the following members of the ENCODE group at the University of Washington (UW) in Seattle. UW Medical Genetics: Patrick Navas, Man Yu, Hua Cao, Brent Johnson, Ericka Johnson, Tristan Frum, and George Stamatoyannopoulos. UW Genome Sciences: Michael O. Dorschner, Richard Humbert, Peter J. Sabo, Scott Kuehn, Robert Thurman, Anthony Shafer, Jeff Goldy, Molly Weaver, Andrew Haydock, Kristin Lee, Fidencio Neri, Richard Sandstrom, Shane Neff, Brendan Henry, Michael Hawrylycz, Janelle Kawamoto, Paul Tittel, Jim Wallace, William S. Noble, and John A. Stamatoyannopoulos. References Dorschner MO, Hawrylycz M, Humbert R, Wallace JC, Shafer A, Kawamoto J, Mack J, Hall R, Goldy J, Sabo PJ et al. High-throughput localization of functional elements by quantitative chromatin profiling. Nat Methods. 2004 Dec;1(3):219-25. encodeUWRegulomeBaseSKnSH SKnSH SKnSH DNaseI Sensitivity ENCODE Chromosome, Chromatin and DNA Structure encodeUWRegulomeBaseSAEC SAEC SAEC DNaseI Sensitivity ENCODE Chromosome, Chromatin and DNA Structure encodeUWRegulomeBasePANC PANC PANC DNaseI Sensitivity ENCODE Chromosome, Chromatin and DNA Structure encodeUWRegulomeBaseNHBE NHBE NHBE DNaseI Sensitivity ENCODE Chromosome, Chromatin and DNA Structure encodeUWRegulomeBaseK562 K562 K562 DNaseI Sensitivity ENCODE Chromosome, Chromatin and DNA Structure encodeUWRegulomeBaseHuh7 Huh7 Huh7 DNaseI Sensitivity ENCODE Chromosome, Chromatin and DNA Structure encodeUWRegulomeBaseHepG2 HepG2 HepG2 DNaseI Sensitivity ENCODE Chromosome, Chromatin and DNA Structure encodeUWRegulomeBaseHeLa HeLa HeLa DNaseI Sensitivity ENCODE Chromosome, Chromatin and DNA Structure encodeUWRegulomeBaseHRE HRE HRE DNaseI Sensitivity ENCODE Chromosome, Chromatin and DNA Structure encodeUWRegulomeBaseHMEC HMEC HMEC DNaseI Sensitivity ENCODE Chromosome, Chromatin and DNA Structure encodeUWRegulomeBaseGM GM GM DNaseI Sensitivity ENCODE Chromosome, Chromatin and DNA Structure encodeUWRegulomeBaseEryFetal EryFetal EryFetal DNaseI Sensitivity ENCODE Chromosome, Chromatin and DNA Structure encodeUWRegulomeBaseEryAdult EryAdult EryAdult DNaseI Sensitivity ENCODE Chromosome, Chromatin and DNA Structure encodeUWRegulomeBaseCaLU3 CaLU3 CaLU3 DNaseI Sensitivity ENCODE Chromosome, Chromatin and DNA Structure encodeUWRegulomeBaseCaCo2 CaCo2 CaCo2 DNaseI Sensitivity ENCODE Chromosome, Chromatin and DNA Structure encodeUWRegulomeBaseCD4 CD4 CD4 DNaseI Sensitivity ENCODE Chromosome, Chromatin and DNA Structure vegaGene Vega Genes Vega Annotations Genes and Gene Predictions Description and Methods This track shows gene annotations from the Vertebrate Genome Annotation (Vega) database. The following information is excerpted from the Vertebrate Genome Annotation home page: "The Vega database is designed to be a central repository for high-quality, frequently updated manual annotation of different vertebrate finished genome sequence. Vega attempts to present consistent high-quality curation of the published chromosome sequences. Finished genomic sequence is analysed on a clone-by-clone basis using a combination of similarity searches against DNA and protein databases as well as a series of ab initio gene predictions (GENSCAN, Fgenes). The annotation is based on supporting evidence only." "In addition, comparative analysis using vertebrate datasets such as the Riken mouse cDNAs and Genoscope Tetraodon nigroviridis Ecores (Evolutionary Conserved Regions) are used for novel gene discovery." NOTE: VEGA annotations do not appear on every chromosome in this assembly. Display Conventions and Configuration This track follows the display conventions for gene prediction tracks using the following color scheme to indicate the status of the gene annotation: Known (Dark blue): genes that are identical to known human complementary DNA or protein sequences and have an entry in the species-specific model organism database. Novel (Dark blue): genes that have an open reading frame (ORF) and are identical or homologous to human cDNAs, human ESTs, or proteins in all species. Novel transcripts are genes that fit the criteria of novel genes with the exception that an unambiguous ORF cannot be assigned. Putative (Medium blue): genes whose sequences are identical or homologous to human ESTs but do not contain an ORF. Predicted (Light blue): genes based on ab initio prediction and for which at least one exon is supported by biological data (unspliced ESTs, protein sequence similarity with mouse or tetraodon genomes, or expression data from Rosetta). Unclassified (Gray). The details pages show the only the Vega gene type and not the transcript type. A single gene can have more than one transcript which can belong to different classes, so the gene as a whole is classified according to the transcript with the "highest" level of classification. Transcript type (and other details) may be found by clicking on the transcript identifier which forms the outside link to the Vega transcript details page. Further information on the gene and transcript classification may be found here. Credits Thanks to Steve Searle at the Sanger Institute for providing the GTF and FASTA files for the Vega annotations. Vega gene annotations are generated by manual annotation from the following groups: Chromosome 6: The HAVANA group, Wellcome Trust Sanger Institute Relevant publication: Mungall AJ et al., The DNA sequence and analysis of human chromosome 6. Nature. 2003 Oct 23;425:805-11. Chromosome 7: Hillier et al., The Genome Center at Washington University Relevant publication: Hillier LW et al., The DNA sequence of human chromosome 7. Nature. 2003 Jul 10;424:157-64. Chromosome 9: The HAVANA group, Wellcome Trust Sanger Institute Relevant publication: Humphray SJ et al., The DNA sequence and analysis of human chromosome 9. Nature. 2004 May 27;429;369-74. Chromosome 10: The HAVANA group, Wellcome Trust Sanger Institute Relevant publication: Deloukas P et al., The DNA sequence and comparative analysis of human chromosome 10. Nature. 2004 May 27;429:375-81. Chromosome 13: The HAVANA group, Wellcome Trust Sanger Institute Relevant publication: Dunham A et al., The DNA sequence and analysis of human chromosome 13. Nature. 2001 Apr 1;428:522-8. Chromosome 14: Genoscope Relevant publication: Heilig R et al., The DNA sequence and analysis of human chromosome 14. Nature. 2003 Feb 6;421:601-7. Chromosome 20: The HAVANA Group, Wellcome Trust Sanger Institute Relevant publication: Deloukas P et al., The DNA sequence and comparative analysis of human chromosome 20. Nature. 2001 Dec 20;414:865-71. Chromosome 22: Chromosome 22 Group, Wellcome Trust Sanger Institute Relevant publications: — Collins JE et al., Reevaluating Human Gene Annotation: A Second-Generation Analysis of Chromosome 22. Genome Research. 2003 Jan;13(1):27-36. — Dawson E et al., A first-generation linkage disequilibrium map of human chromosome 22. Nature. 2002 Aug 1;418:544-8. — Dunham I, et al., The DNA sequence of human chromosome 22. Nature. 1999 Dec 2;402:489-95. Chromosome X: The HAVANA Group, Wellcome Trust Sanger Institute Relevant publication: Ross MT et al., The DNA sequence and comparative analysis of human chromosome X. Nature 2005 Mar 17;434:325-37. vegaPseudoGene Vega Pseudogenes Vega Annotated Pseudogenes and Immunoglobulin Segments Genes and Gene Predictions Description and Methods This track shows pseudogene annotations from the Vertebrate Genome Annotation (Vega) database. The following information is excerpted from the Vertebrate Genome Annotation home page: "The Vega database is designed to be a central repository for high-quality, frequently updated manual annotation of different vertebrate finished genome sequence. Vega attempts to present consistent high-quality curation of the published chromosome sequences. Finished genomic sequence is analysed on a clone-by-clone basis using a combination of similarity searches against DNA and protein databases as well as a series of ab initio gene predictions (GENSCAN, Fgenes). The annotation is based on supporting evidence only." "In addition, comparative analysis using vertebrate datasets such as the Riken mouse cDNAs and Genoscope Tetraodon nigroviridis Ecores (Evolutionary Conserved Regions) are used for novel gene discovery." NOTE: VEGA annotations do not appear on every chromosome in this assembly. Display Conventions and Configuration This track follows the display conventions for gene prediction tracks using the following color scheme to indicate the status of the gene annotation: Known (Dark blue): genes that are identical to known human complementary DNA or protein sequences and have an entry in the species-specific model organism database. Novel (Dark blue): genes that have an open reading frame (ORF) and are identical or homologous to human cDNAs, human ESTs, or proteins in all species. Novel transcripts are genes that fit the criteria of novel genes with the exception that an unambiguous ORF cannot be assigned. Putative (Medium blue): genes whose sequences are identical or homologous to human ESTs but do not contain an ORF. Predicted (Light blue): genes based on ab initio prediction and for which at least one exon is supported by biological data (unspliced ESTs, protein sequence similarity with mouse or tetraodon genomes, or expression data from Rosetta). Unclassified (Gray). The details pages show the only the Vega gene type and not the transcript type. A single gene can have more than one transcript which can belong to different classes, so the gene as a whole is classified according to the transcript with the "highest" level of classification. Transcript type (and other details) may be found by clicking on the transcript identifier which forms the outside link to the Vega transcript details page. Further information on the gene and transcript classification may be found here. Credits Thanks to Steve Searle at the Sanger Institute for providing the GTF and FASTA files for the Vega annotations. Vega gene annotations are generated by manual annotation from the following groups: Chromosome 6: The HAVANA group, Wellcome Trust Sanger Institute Relevant publication: Mungall AJ et al., The DNA sequence and analysis of human chromosome 6. Nature. 2003 Oct 23;425:805-11. Chromosome 7: Hillier et al., The Genome Institute at Washington University Relevant publication: Hillier LW et al., The DNA sequence of human chromosome 7. Nature. 2003 Jul 10;424:157-64. Chromosome 9: The HAVANA group, Wellcome Trust Sanger Institute Relevant publication: Humphray SJ et al., The DNA sequence and analysis of human chromosome 9. Nature. 2004 May 27;429;369-74. Chromosome 10: The HAVANA group, Wellcome Trust Sanger Institute Relevant publication: Deloukas P et al., The DNA sequence and comparative analysis of human chromosome 10. Nature. 2004 May 27;429:375-81. Chromosome 13: The HAVANA group, Wellcome Trust Sanger Institute Relevant publication: Dunham A et al., The DNA sequence and analysis of human chromosome 13. Nature. 2001 Apr 1;428:522-8. Chromosome 14: Genoscope Relevant publication: Heilig R et al., The DNA sequence and analysis of human chromosome 14. Nature. 2003 Feb 6;421:601-7. Chromosome 20: The HAVANA Group, Wellcome Trust Sanger Institute Relevant publication: Deloukas P et al., The DNA sequence and comparative analysis of human chromosome 20. Nature. 2001 Dec 20;414:865-71. Chromosome 22: Chromosome 22 Group, Wellcome Trust Sanger Institute Relevant publications: — Collins JE et al., Reevaluating Human Gene Annotation: A Second-Generation Analysis of Chromosome 22. Genome Research. 2003 Jan;13(1):27-36. — Dawson E et al., A first-generation linkage disequilibrium map of human chromosome 22. Nature. 2002 Aug 1;418:544-8. — Dunham I, et al., The DNA sequence of human chromosome 22. Nature. 1999 Dec 2;402:489-95. Chromosome X: The HAVANA Group, Wellcome Trust Sanger Institute Relevant publication: Ross MT et al., The DNA sequence and comparative analysis of human chromosome X. Nature 2005 Mar 17;434:325-37. encodeUViennaRnaz Vienna RNAz University of Vienna, RNA secondary structure predicted by RNAz ENCODE Regions and Genes Description This track displays regions containing putative functional RNA secondary structures as predicted by RNAz on the basis of thermodynamic stability and evolutionary conservation. Methods RNAz evaluates multiple sequence alignments for unusually stable and conserved RNA secondary structures, two typical characteristics for functional RNA structures that can be found in noncoding RNAs or cis-acting regulatory elements of mRNAs. The RNAz algorithm works as follows: First a consensus secondary structure is predicted using the RNAalifold approach (Hofacker et al., 2002), which is an extension of classical minimum free energy folding algorithms for aligned sequences. The significance of a predicted consensus structure is evaluated by calculating a structure conservation index, which is the ratio of unconstrained folding energies relative to the folding energies under the constraint that all aligned sequences are forced to fold into a common structure. Thermodynamical stability is evaluated by calculating a normalized z-score of the sequences in the alignment. The z-score indicates whether the given sequences are more stable than random sequences of the same length and base composition. Based on these two features, structure conservation index and z-score, an alignment is classified as structural RNA or "other" using a support vector machine classification algorithm (Washietl et al., 2005; Washietl et al. , 2007). This track shows the result of a RNAz screen of 28-way TBA/MULTIZ alignments. Alignments were sliced in overlapping windows of 120 nt in size and with a step size of 40 nt. Sequences with more than 25% gaps with respect to the human sequence were discarded. Only alignments with more than four sequences, a minimum size of 50 columns and at most 1% repeat masked letters were considered. RNAz can only handle alignments with up to six sequences. From alignments with more than six sequences we chose a subset of six. For subset selection, we used a greedy algorithm and iteratively selected sequences optimizing the set for a mean pairwise identity of around 80%. In cases of alignments with more than 10 sequences we sampled three different of such subsets. The windows were finally scored with RNAz version 0.1.1 in the forward and reverse complement direction. Overlapping hits with at least one sampled alignment with RNAz score > 0.5 were combined to a single genomic region. The track shows regions with at least one window in the cluster with an average RNAz score of all samples > 0.5 and at least one hit with RNAz score > 0.9. More details may be found in Washietl et al., 2007. Credits The RNAz program and browser track were developed by Stefan Washietl, Ivo Hofacker (Institute for Theoretical Chemistry, Univ. of Vienna) and Peter F. Stadler (Bioinformatics group, Department of Computer Science, Univ. of Leipzig). References Hofacker IL, Fekete M, Stadler PF. Secondary structure prediction for aligned RNA sequences. J. Mol. Biol. 2002 Jun 21;319(5):1059-66. Washietl S, Hofacker IL, Stadler PF. Fast and reliable prediction of noncoding RNAs. Proc. Natl. Acad. Sci. USA. 2005 Feb 15;102(7):2454-59. Washietl S, Pedersen JS, Korbel JO, Fried C, Gruber AR, Hackermuller J, Hertel J, Lindemeyer M, Missal K, Tanzer A, et al. Structured RNAs in the ENCODE Selected Regions of the Human Genome. Genome Res. 2007 Jun;17(6):852-64. celeraDupPositive WSSD Duplication Sequence Identified as Duplicate by High-Depth Celera Reads Mapping and Sequencing Description High-depth sequence reads from the Celera project were used to detect paralogy in the human genome reference sequence. This track shows confirmed segmental duplications, defined as having similarity to sequences in the Segmental Duplication Database (SDD) of greater than 90% over more than 250 bp of repeatmasked sequence. For a description of the whole-genome shotgun sequence detection (WSSD) "fuguization" method, see Bailey, J.A. et al. (2001) in the References section below. Credits The data were provided by Xinwei She and Evan Eichler as part of their efforts to map human paralogy at the University of Washington. References Bailey, J.A., et al., Recent segmental duplications in the human genome. Science 297(5583), 945-7 (2002). Bailey, J.A., et al., Segmental duplications: organization and impact within the current human genome project assembly, Genome Res. 11(6), 1005-17 (2001). She, X., et al., Shotgun sequence assembly and recent segmental duplications within the human genome. Nature 431(7011), 927-30 (2004). encodeYaleChipPval Yale ChIP pVal Yale ChIP-chip P-Value ENCODE Chromatin Immunoprecipitation Description This track shows the map of -log10(P-value) for ChIP-chip using DNA from immunoprecipitated chromatin from either human HelaS3 (cervix epithelial adenocarcinoma), GM06990 (lymphoblastoid) or K562 (myeloid leukemia-derived) cells hybridized to maskless photolithographic arrays. The arrays consist of 50-mer oligonucleotides tiled with 12-nt overlaps covering most of the non-repetitive DNA sequence of the ENCODE regions. Chromatin immunoprecipitation was carried out for each experiment using antibodies against the following targets: BAF155, BAF170, INI1/BAF47, c-Fos, c-Jun, TAF1/TAFII250, RNA polymerase II, histone H4 tetra-acetylated lysine (H4Kac4), histone H3 tri-methylated lysine (H3K27me3), STAT1, nuclear factor kappa B (NFKB) p65, SMARCA4/BRG1, SMARCA6 and NRSF. Additionally, HeLa S3 cells immunoprecipitated with STAT1 were pre-treated with interferon-alpha and HeLa S3 cells immunoprecipitated with NFKB antibody were pre-treated with tumor necrosis factor-alpha (TNF-alpha) (see table below). This track shows the combined results of three or four multiple biological replicates. For all arrays, the ChIP DNA was labeled with Cy5 and the control DNA was labeled with Cy3. These data are available at NCBI GEO (see table below for links), which also provides additional information about the experimental protocols. Target GEO Accession(s) Description BAF155 (H-76) GSE3549 (HeLa S3 cells) and GSE6898 (K562 cells) BAF155 (Brg1-Associated Factor, 155 kD) is a human homolog of yeast SWI3. The Swi-Snf chromatin-remodeling complex was first described in yeast, and similar proteins have been found in mammalian cells. The human Swi-Snf complex is comprised of at least nine polypeptides, including two ATPase subunits, Brm and Brg-1. Other members of the human Swi-Snf complex are termed BAFs for Brg1-associated factors. BAF155 is a conserved (core) component that stimulates the chromatin remodeling activity of Brg1. BAF170 (H-116) GSE3550 (HeLa S3 cells) and GSE6896 (K562 cells) BAF170 (Brg1-Associated Factor, 170 kD) is a human homolog of yeast SWI3, a protein important in chromatin remodeling. It is a conserved (core) component of the Swi-Snf complex that stimulates the chromatin remodeling activity of Brg1 (see the description for BAF155). INI1/BAF47 (H-300) GSE6897 (K562 cells) INI1 (Integrase Interactor 1) or BAF47 is a human homolog of yeast SNF5, a protein important in chromatin remodeling. c-Fos GSE3449 (HeLa S3 cells) c-Fos (transcription factor) is the cellular homolog of the v-fos viral oncogene. It is a member of the leucine zipper protein family and its transcriptional activity has been implicated in cell growth, differentiation, and development. Fos is induced by many stimuli, ranging from mitogens to pharmacological agents. c-Fos has been shown to be associated with another proto-oncogene, c-Jun, and together they bind to the AP-1 binding site to regulate gene transcription. Like CREB, c-Fos is regulated by p90Rsk. c-Jun GSE3448 (HeLa S3 cells) c-Jun (transcription factor), also known as AP-1 (activator protein 1), is the cellular homolog of the avian sarcoma virus oncogene v-jun, and as such can be referred to as a proto-oncogene. TAF1/TAFII250 GSE3450 (HeLa S3 cells) TAF1 (TATA box binding protein (TBP)-associated factor, with molecular weight 250 kD, also known as TAFII250) is involved in the initiation of transcription by RNA polymerase II. It has histone acetyltransferase activity, which can relieve the binding between DNA and histones in the nucleosome. It is the largest subunit of the basal transcription factor, TFIID. RNA polymerase II (N-20), N-terminus GSE6390 (HeLa S3 cells) and GSE6392 (GM06990 cells) RNA polymerase II (pol II) catalyzes transcription of DNA for the production of mRNAs and most snoRNAs. RNA polymerase II (8WG16), C-terminus GSE6391 (HeLa S3 cells) and GSE6394 (GM06990 cells) RNA polymerase II (pol II) catalyzes transcription of DNA for the production of mRNAs and most snoRNAs. This antibody targets the pre-initiation complex form recognizing the C-terminal hexapeptide repeat of the large subunit of pol II. The initiation-complex form of RNA polymerase II is associated with the transcription start site. H4Kac4 GSE6389 (HeLa S3 cells) and GSE6393 (GM06990 cells) H4Kac4 (Histone H4 tetra-acetylated lysine) is a post-translational modification of the histone which affects chromatin remodeling. Histone H4 is found in transcriptionally active euchromatin. H3K27me3 GSE8073 (HeLa S3 cells) H3K27me3 (Histone H3 tri-methylated lysine) is a post-translational modification of the histone which affects chromatin remodeling. It is known to be associated with heterochromatin. STAT1 p91 (C-24) GSE6892 (HeLa S3 cells, interferon-alpha stimulated) STAT1 (Signal Transducer and Activator of Transcription 1) responds to many cytokines and growth factors and regulates genes important for apoptosis, inflammation, and the immune system. NFKB p65, N-terminus GSE6900 (HeLa S3 cells, TNF-alpha stimulated) NFKB p65 (RelA) is the strongest transcriptional-activator among the five members of the mammalian NF-kB/Rel family and plays an essential role in regulating the induction of genes involved in several physiological processes, including immune and inflammatory responses. NFKB p65 (C-20), C-terminus GSE6899 (HeLa S3 cells, TNF-alpha stimulated) NFKB p65 (RelA) is the strongest transcriptional-activator among the five members of the mammalian NF-kB/Rel family and plays an essential role in regulating the induction of genes involved in several physiological processes, including immune and inflammatory responses. SMARCA4/BRG1 GSE7370 (HeLa S3 cells) SMARCA4 (BRG1) is a catalytic subunit of the SWI/SNF chromatin remodeling complex. It is a member of the SNF2 family of chromatin remodeling ATPases. SMARCA6 GSE7371 (HeLa S3 cells) SMARCA6 is a SNF2-like helicase linked to cell proliferation and DNA methylation. It is a member of the SNF2 family of chromatin remodeling ATPases. NRSF GSE7372 (HeLa S3 cells) NRSF (neuron-restrictive silencer factor) represses neuron-specific genes in non-neuronal cells. Display Conventions and Configuration The subtracks within this annotation may be configured in a variety of ways to highlight different aspects of the displayed data. The graphical configuration options are shown at the top of the track description page, followed by a list of subtracks. To display only selected subtracks, uncheck the boxes next to the tracks you wish to hide. For more information about the graphical configuration options, click the Graph configuration help link. Methods The data from replicates were quantile-normalized and median-scaled to each other (both Cy3 and Cy5 channels). Using a 1000 bp sliding window centered on each oligonucleotide probe, a signal map (estimating the fold enrichment [log2 scale] of ChIP DNA) was generated by computing the pseudomedian signal of all log2(Cy5/Cy3) ratios (median of pairwise averages) within the window, including replicates. Using the same procedure, a -log10(P-value) map (measuring significance of enrichment of oligonucleotide probes in the window) for all sliding windows was made by computing P-values using the Wilcoxon paired signed rank test comparing fluorescent intensity between Cy5 and Cy3 for each oligonucleotide probe (Cy5 and Cy3 signals from the same array). A binding site was determined by thresholding oligonucleotide positions with -log10(P-value) (>= 4), extending qualified positions upstream and downstream 250 bp, and requiring 1000 bp space between two sites. Top 400 sites are retained. Verification ChIP-chip binding sites were verified by comparing "hit lists" generated from combinations of different biological replicates. Only experiments that yielded a significant overlap (greater than 50 percent) were accepted. As an independent check (for maskless arrays), data on the microarray were randomized with respect to position and re-scored; significantly fewer hits (consistent with random noise) were generated this way. Credits These data were generated and analyzed by the labs of Michael Snyder, Mark Gerstein and Sherman Weissman at Yale University. References Cawley S, Bekiranov S, Ng HH, Kapranov P, Sekinger EA, Kampa D, Piccolboni A, Sementchenko V, Cheng J, Williams AJ et al. Unbiased mapping of transcription factor binding sites along human chromosomes 21 and 22 points to widespread regulation of noncoding RNAs. Cell 2004 Feb 20;116(4):499-509. Euskirchen G, Royce TE, Bertone P, Martone R, Rinn JL, Nelson FK, Sayward F, Luscombe NM, Miller P, Gerstein M et al. CREB binds to multiple loci on human chromosome 22. Mol Cell Biol. 2004 May;24(9):3804-14. Martone R, Euskirchen G, Bertone P, Hartman S, Royce TE, Luscombe NM, Rinn JL, Nelson FK, Miller P, Gerstein M et al. Distribution of NF-kappaB-binding sites across human chromosome 22. Proc Natl Acad Sci U S A. 2003 Oct 14;100(21):12247-52. Quackenbush J. Microarray data normalization and transformation Nat Genet. 2002 Dec;32(Suppl):496-501. encodeYaleChipPvalBaf47K562 YU BAF47 K562 P Yale ChIP-chip (BAF47 ab, K562 cells) P-Value ENCODE Chromatin Immunoprecipitation encodeYaleChipPvalBaf170K562 YU BAF170 K562 P Yale ChIP-chip (BAF170 ab, K562 cells) P-Value ENCODE Chromatin Immunoprecipitation encodeYaleChipPvalBaf155K562 YU BAF155 K562 P Yale ChIP-chip (BAF155 ab, K562 cells) P-Value ENCODE Chromatin Immunoprecipitation encodeYaleChipPvalH4kac4Gm06990 YU H4Kac4 GM P Yale ChIP-chip (H4Kac4 ab, GM06990 cells) P-Value ENCODE Chromatin Immunoprecipitation encodeYaleChipPvalPol2nGm06990 YU Pol2N GM P Yale ChIP-chip (Pol2 N-terminus ab, GM06990 cells) P-Value ENCODE Chromatin Immunoprecipitation encodeYaleChipPvalPol2Gm06990 YU Pol2 GM P Yale ChIP-chip (Pol2 ab, GM06990 cells) P-Value ENCODE Chromatin Immunoprecipitation encodeYaleChipPvalNrsfHela YU NRSF HeLa P Yale ChIP-chip (NRSF, HeLa S3 cells) P-Value ENCODE Chromatin Immunoprecipitation encodeYaleChipPvalSmarca6Hela YU SMARCA6 HeLa P Yale ChIP-chip (SMARCA6, HeLa S3 cells) P-Value ENCODE Chromatin Immunoprecipitation encodeYaleChipPvalSmarca4Hela YU SMARCA4 HeLa P Yale ChIP-chip (SMARCA4, HeLa S3 cells) P-Value ENCODE Chromatin Immunoprecipitation encodeYaleChipPvalP65cHelaTnfa YU P65-C HeLa TNF P Yale ChIP-chip (NFKB p65 C-terminus ab, HeLa S3 cells, TNF-alpha treated) P-Value ENCODE Chromatin Immunoprecipitation encodeYaleChipPvalP65nHelaTnfa YU P65-N HeLa TNF P Yale ChIP-chip (NFKB p65 N-terminus ab, HeLa S3 cells, TNF-alpha treated) P-Value ENCODE Chromatin Immunoprecipitation encodeYaleChipPvalStat1HelaIfna YU STAT1 HeLa IF P Yale ChIP-chip (STAT1 ab, HeLa S3 cells, Interferon-alpha treated) P-Value ENCODE Chromatin Immunoprecipitation encodeYaleChipPvalH3k27me3Hela YU H3K27me3 HeLa P Yale ChIP-chip (H3K27me3 ab, HeLa S3 cells) P-Value ENCODE Chromatin Immunoprecipitation encodeYaleChipPvalH4kac4Hela YU H4Kac4 HeLa P Yale ChIP-chip (H4Kac4 ab, HeLa S3 cells) P-Value ENCODE Chromatin Immunoprecipitation encodeYaleChipPvalPol2nHela YU Pol2N HeLa P Yale ChIP-chip (Pol2 N-terminus ab, HeLa S3 cells) P-Value ENCODE Chromatin Immunoprecipitation encodeYaleChipPvalPol2Hela YU Pol2 HeLa P Yale ChIP-chip (Pol2 ab, HeLa S3 cells) P-Value ENCODE Chromatin Immunoprecipitation encodeYaleChipPvalTaf YU TAF1 HeLa P Yale ChIP-chip (TAF1 ab, HeLa S3 cells) P-Value ENCODE Chromatin Immunoprecipitation encodeYaleChipPvalJun YU c-Jun HeLa P Yale ChIP-chip (c-Jun ab, HeLa S3 cells) P-Value ENCODE Chromatin Immunoprecipitation encodeYaleChipPvalFos YU c-Fos HeLa P Yale ChIP-chip (c-Fos ab, HeLa S3 cells) P-Value ENCODE Chromatin Immunoprecipitation encodeYaleChipPvalBaf170 YU BAF170 HeLa P Yale ChIP-chip (BAF170 ab, HeLa S3 cells) P-Value ENCODE Chromatin Immunoprecipitation encodeYaleChipPvalBaf155 YU BAF155 HeLa P Yale ChIP-chip (BAF155 ab, HeLa S3 cells) P-Value ENCODE Chromatin Immunoprecipitation encodeYaleChipRfbr Yale ChIP RFBR Yale ChIP-chip Regulatory Factor Binding Regions Analysis ENCODE Chromatin Immunoprecipitation Description Regulatory Factor Binding Regions (RFBRs) were identified from ChIP-Chip experimental data; they are non-randomly distributed in the ENCODE regions with local enrichment and depletion. By mapping the full set of RFBRs onto the human genome sequence, we identified 689 genomic subregions with RFBR enrichment and 726 subregions with RFBR depletion (the RFBR clusters and deserts, respectively) in the ENCODE regions. Methods The data set analyzed in this study consists of 105 lists of transcriptional regulatory elements (TREs) in the ENCODE regions. It was released on December 13, 2005 by the Transcriptional Regulation Group. TRE lists made available after this data freeze were not included in this study. A total of 29 transcription factors (BAF155, BAF170, Brg1, CEBPe, CTCF, E2F1, E2F4, H3ac, H4ac, H3K27me3, H3K27me3, H3K4me1, H3K4me2, H3K4me3, H3K9K14me2, HisH4, c-Jun, c-Myc, P300, P63, Pol2, PU1, RARecA, SIRT1, Sp1, Sp3, STAT1, Suz12, and TAF1) were assayed by seven laboratories (Affymetrix, Sanger, Stanford, UCD, UCSD, UT, Yale) using ChIP-chip experiments on three different microarray platforms (Affymetrix tiling array, NimbleGen tiling array, and traditional PCR array) in nine cell lines (HL-60, HeLa, GM06990, K562, IMR90, HCT116, THP1, Jurkat, and fibroblasts) or at two different experimental time points (P0, before addition of gamma-interferon, and P30, 30 minutes after the addition of gamma-interferon). The raw data from these 105 ChIP-chip experiments was uniformly processed using a method based on the false discovery rate (Efron, 2004). Three sets of TRE lists were generated at 1%, 5%, and 10% false discovery rates respectively, and the list generated at the lowest (1%) false discovery rate was used in this study. The non-redundant factor-specific RFBR lists were mapped onto the ENCODE regions. Uninterrupted genomic regions that are covered by one or more RFBRs were identified as RFBR groups. Neighboring groups that are less than 1 kb apart were collected into RFBR clusters. Un-clustered groups that are covered by more than three RFBRs were promoted into clusters. Further details of the method may be found in Zhang et al. (2007). Credits The data set was made available by the Transcriptional Regulation Group of the ENCODE Project Consortium. The RFBR cluster and desert tracks were generated by Zhengdong Zhang from Mark Gerstein's group at Yale University. References Efron B. Large-scale simultaneous hypothesis testing: The choice of a null hypothesis. Journal of the American Statistical Association. 2004;99(465):96-104. Zhang ZD, Paccanaro A, Fu Y, Weissman S, Weng Z, Chang J, Snyder M, Gerstein M. Statistical analysis of the genomic distribution and correlation of regulatory elements in the ENCODE regions. Genome Res. 2007 Jun;17(6):787-97. encodeYaleChipRfbrDeserts Yale RFBR Deserts Yale ChIP-chip Regulatory Factor Binding Regions (RFBR) Deserts ENCODE Chromatin Immunoprecipitation encodeYaleChipRfbrClusters Yale RFBR Clusters Yale ChIP-chip Regulatory Factor Binding Regions (RFBR) Clusters ENCODE Chromatin Immunoprecipitation encodeYaleChipSig Yale ChIP Signal Yale ChIP-chip Signal ENCODE Chromatin Immunoprecipitation Description This track shows the map of signal intensity (estimating the fold enrichment [log2 scale] of chromatin immunoprecipitated DNA vs. input DNA) for ChIP-chip using DNA from immunoprecipitated chromatin from either human HelaS3 (cervix epithelial adenocarcinoma), GM06990 (lymphoblastoid) or K562 (myeloid leukemia-derived) cells hybridized to maskless photolithographic arrays. The arrays consist of 50-mer oligonucleotides tiled with 12-nt overlaps covering most of the non-repetitive DNA sequence of the ENCODE regions. Chromatin immunoprecipitation was carried out for each experiment using antibodies against the following targets: BAF155, BAF170, INI1/BAF47, c-Fos, c-Jun, TAF1/TAFII250, RNA polymerase II, histone H4 tetra-acetylated lysine (H4Kac4), histone H3 tri-methylated lysine (H3K27me3), STAT1, nuclear factor kappa B (NFKB) p65, SMARCA4/BRG1, SMARCA6 and NRSF. Additionally, HeLa S3 cells immunoprecipitated with STAT1 were pre-treated with interferon-alpha and HeLa S3 cells immunoprecipitated with NFKB antibody were pre-treated with tumor necrosis factor-alpha (TNF-alpha) (see table below). This track shows the combined results of three or four multiple biological replicates. For all arrays, the ChIP DNA was labeled with Cy5 and the control DNA was labeled with Cy3. These data are available at NCBI GEO (see table below for links), which also provides additional information about the experimental protocols. Target GEO Accession(s) Description BAF155 (H-76) GSE3549 (HeLa S3 cells) and GSE6898 (K562 cells) BAF155 (Brg1-Associated Factor, 155 kD) is a human homolog of yeast SWI3. The Swi-Snf chromatin-remodeling complex was first described in yeast, and similar proteins have been found in mammalian cells. The human Swi-Snf complex is comprised of at least nine polypeptides, including two ATPase subunits, Brm and Brg-1. Other members of the human Swi-Snf complex are termed BAFs for Brg1-associated factors. BAF155 is a conserved (core) component that stimulates the chromatin remodeling activity of Brg1. BAF170 (H-116) GSE3550 (HeLa S3 cells) and GSE6896 (K562 cells) BAF170 (Brg1-Associated Factor, 170 kD) is a human homolog of yeast SWI3, a protein important in chromatin remodeling. It is a conserved (core) component of the Swi-Snf complex that stimulates the chromatin remodeling activity of Brg1 (see the description for BAF155). INI1/BAF47 (H-300) GSE6897 (K562 cells) INI1 (Integrase Interactor 1) or BAF47 is a human homolog of yeast SNF5, a protein important in chromatin remodeling. c-Fos GSE3449 (HeLa S3 cells) c-Fos (transcription factor) is the cellular homolog of the v-fos viral oncogene. It is a member of the leucine zipper protein family and its transcriptional activity has been implicated in cell growth, differentiation, and development. Fos is induced by many stimuli, ranging from mitogens to pharmacological agents. c-Fos has been shown to be associated with another proto-oncogene, c-Jun, and together they bind to the AP-1 binding site to regulate gene transcription. Like CREB, c-Fos is regulated by p90Rsk. c-Jun GSE3448 (HeLa S3 cells) c-Jun (transcription factor), also known as AP-1 (activator protein 1), is the cellular homolog of the avian sarcoma virus oncogene v-jun, and as such can be referred to as a proto-oncogene. TAF1/TAFII250 GSE3450 (HeLa S3 cells) TAF1 (TATA box binding protein (TBP)-associated factor, with molecular weight 250 kD, also known as TAFII250) is involved in the initiation of transcription by RNA polymerase II. It has histone acetyltransferase activity, which can relieve the binding between DNA and histones in the nucleosome. It is the largest subunit of the basal transcription factor, TFIID. RNA polymerase II (N-20), N-terminus GSE6390 (HeLa S3 cells) and GSE6392 (GM06990 cells) RNA polymerase II (pol II) catalyzes transcription of DNA for the production of mRNAs and most snoRNAs. RNA polymerase II (8WG16), C-terminus GSE6391 (HeLa S3 cells) and GSE6394 (GM06990 cells) RNA polymerase II (pol II) catalyzes transcription of DNA for the production of mRNAs and most snoRNAs. This antibody targets the pre-initiation complex form recognizing the C-terminal hexapeptide repeat of the large subunit of pol II. The initiation-complex form of RNA polymerase II is associated with the transcription start site. H4Kac4 GSE6389 (HeLa S3 cells) and GSE6393 (GM06990 cells) H4Kac4 (Histone H4 tetra-acetylated lysine) is a post-translational modification of the histone which affects chromatin remodeling. Histone H4 is found in transcriptionally active euchromatin. H3K27me3 GSE8073 (HeLa S3 cells) H3K27me3 (Histone H3 tri-methylated lysine) is a post-translational modification of the histone which affects chromatin remodeling. It is known to be associated with heterochromatin. STAT1 p91 (C-24) GSE6892 (HeLa S3 cells, interferon-alpha stimulated) STAT1 (Signal Transducer and Activator of Transcription 1) responds to many cytokines and growth factors and regulates genes important for apoptosis, inflammation, and the immune system. NFKB p65, N-terminus GSE6900 (HeLa S3 cells, TNF-alpha stimulated) NFKB p65 (RelA) is the strongest transcriptional-activator among the five members of the mammalian NF-kB/Rel family and plays an essential role in regulating the induction of genes involved in several physiological processes, including immune and inflammatory responses. NFKB p65 (C-20), C-terminus GSE6899 (HeLa S3 cells, TNF-alpha stimulated) NFKB p65 (RelA) is the strongest transcriptional-activator among the five members of the mammalian NF-kB/Rel family and plays an essential role in regulating the induction of genes involved in several physiological processes, including immune and inflammatory responses. SMARCA4/BRG1 GSE7370 (HeLa S3 cells) SMARCA4 (BRG1) is a catalytic subunit of the SWI/SNF chromatin remodeling complex. It is a member of the SNF2 family of chromatin remodeling ATPases. SMARCA6 GSE7371 (HeLa S3 cells) SMARCA6 is a SNF2-like helicase linked to cell proliferation and DNA methylation. It is a member of the SNF2 family of chromatin remodeling ATPases. NRSF GSE7372 (HeLa S3 cells) NRSF (neuron-restrictive silencer factor) represses neuron-specific genes in non-neuronal cells. Display Conventions and Configuration The subtracks within this annotation may be configured in a variety of ways to highlight different aspects of the displayed data. The graphical configuration options are shown at the top of the track description page, followed by a list of subtracks. To display only selected subtracks, uncheck the boxes next to the tracks you wish to hide. For more information about the graphical configuration options, click the Graph configuration help link. Methods The data from replicates were quantile-normalized and median-scaled to each other (both Cy3 and Cy5 channels). Using a 1000 bp sliding window centered on each oligonucleotide probe, a signal map (estimating the fold enrichment [log2 scale] of ChIP DNA) was generated by computing the pseudomedian signal of all log2(Cy5/Cy3) ratios (median of pairwise averages) within the window, including replicates. Using the same procedure, a -log10(P-value) map (measuring significance of enrichment of oligonucleotide probes in the window) for all sliding windows was made by computing P-values using the Wilcoxon paired signed rank test comparing fluorensent intensity between Cy5 and Cy3 for each oligonucleotide probe (Cy5 and Cy3 signals from the same array). A binding site was determined by thresholding oligonucleotide positions with -log10(P-value) (>= 4), extending qualified positions upstream and downstream 250 bp, and requiring 1000 bp space between two sites. Top 400 sites are retained. Verification ChIP-chip binding sites were verified by comparing "hit lists" generated from combinations of different biological replicates. Only experiments that yielded a significant overlap (greater than 50 percent) were accepted. As an independent check (for maskless arrays), data on the microarray were randomized with respect to position and re-scored; significantly fewer hits (consistent with random noise) were generated this way. Credits These data were generated and analyzed by the labs of Michael Snyder, Mark Gerstein and Sherman Weissman at Yale University. References Cawley S, Bekiranov S, Ng HH, Kapranov P, Sekinger EA, Kampa D, Piccolboni A, Sementchenko V, Cheng J, Williams AJ et al. Unbiased mapping of transcription factor binding sites along human chromosomes 21 and 22 points to widespread regulation of noncoding RNAs. Cell 2004 Feb 20;116(4):499-509. Euskirchen G, Royce TE, Bertone P, Martone R, Rinn JL, Nelson FK, Sayward F, Luscombe NM, Miller P, Gerstein M et al. CREB binds to multiple loci on human chromosome 22. Mol Cell Biol. 2004 May;24(9):3804-14. Martone R, Euskirchen G, Bertone P, Hartman S, Royce TE, Luscombe NM, Rinn JL, Nelson FK, Miller P, Gerstein M et al. Distribution of NF-kappaB-binding sites across human chromosome 22. Proc Natl Acad Sci U S A. 2003 Oct 14;100(21):12247-52. Quackenbush J. Microarray data normalization and transformation Nat Genet. 2002 Dec;32(Suppl):496-501. encodeYaleChipSignalBaf47K562 YU BAF47 K562 S Yale ChIP-chip (BAF47 ab, K562 cells) Signal ENCODE Chromatin Immunoprecipitation encodeYaleChipSignalBaf170K562 YU BAF170 K562 S Yale ChIP-chip (BAF170 ab, K562 cells) Signal ENCODE Chromatin Immunoprecipitation encodeYaleChipSignalBaf155K562 YU BAF155 K562 S Yale ChIP-chip (BAF155 ab, K562 cells) Signal ENCODE Chromatin Immunoprecipitation encodeYaleChipSignalH4kac4Gm06990 YU H4Kac4 GM S Yale ChIP-chip (H4Kac4 ab, GM06990 cells) Signal ENCODE Chromatin Immunoprecipitation encodeYaleChipSignalPol2nGm06990 YU Pol2N GM S Yale ChIP-chip (Pol2 N-terminus ab, GM06990 cells) Signal ENCODE Chromatin Immunoprecipitation encodeYaleChipSignalPol2Gm06990 YU Pol2 GM S Yale ChIP-chip (Pol2 ab, GM06990 cells) Signal ENCODE Chromatin Immunoprecipitation encodeYaleChipSignalNrsfHela YU NRSF HeLa S Yale ChIP-chip (NRSF, HeLa S3 cells) Signal ENCODE Chromatin Immunoprecipitation encodeYaleChipSignalSmarca6Hela YU SMARCA6 HeLa S Yale ChIP-chip (SMARCA6, HeLa S3 cells) Signal ENCODE Chromatin Immunoprecipitation encodeYaleChipSignalSmarca4Hela YU SMARCA4 HeLa S Yale ChIP-chip (SMARCA4, HeLa S3 cells) Signal ENCODE Chromatin Immunoprecipitation encodeYaleChipSignalP65cHelaTnfa YU P65-C HeLa TNF S Yale ChIP-chip (NFKB p65 C-terminus ab, HeLa S3 cells, TNF-alpha treated) Signal ENCODE Chromatin Immunoprecipitation encodeYaleChipSignalP65nHelaTnfa YU P65-N HeLa TNF S Yale ChIP-chip (NFKB p65 N-terminus ab, HeLa S3 cells, TNF-alpha treated) Signal ENCODE Chromatin Immunoprecipitation encodeYaleChipSignalStat1HelaIfna YU STAT1 HeLa IF S Yale ChIP-chip (STAT1 ab, HeLa S3 cells, Interferon-alpha treated) Signal ENCODE Chromatin Immunoprecipitation encodeYaleChipSignalH3k27me3Hela YU H3K27me3 HeLa S Yale ChIP-chip (H3K27me3 ab, HeLa S3 cells) Signal ENCODE Chromatin Immunoprecipitation encodeYaleChipSignalH4kac4Hela YU H4Kac4 HeLa S Yale ChIP-chip (H4Kac4 ab, HeLa S3 cells) Signal ENCODE Chromatin Immunoprecipitation encodeYaleChipSignalPol2nHela YU Pol2N HeLa S Yale ChIP-chip (Pol2 N-terminus ab, HeLa S3 cells) Signal ENCODE Chromatin Immunoprecipitation encodeYaleChipSignalPol2Hela YU Pol2 HeLa S Yale ChIP-chip (Pol2 ab, HeLa S3 cells) Signal ENCODE Chromatin Immunoprecipitation encodeYaleChipSignalTaf YU TAF1 HeLa S Yale ChIP-chip (TAF1 ab, HeLa S3 cells) Signal ENCODE Chromatin Immunoprecipitation encodeYaleChipSignalJun YU c-Jun HeLa S Yale ChIP-chip (c-Jun ab, HeLa S3 cells) Signal ENCODE Chromatin Immunoprecipitation encodeYaleChipSignalFos YU c-Fos HeLa S Yale ChIP-chip (c-Fos ab, HeLa S3 cells) Signal ENCODE Chromatin Immunoprecipitation encodeYaleChipSignalBaf170 YU BAF170 HeLa S Yale ChIP-chip (BAF170 ab, HeLa S3 cells) Signal ENCODE Chromatin Immunoprecipitation encodeYaleChipSignalBaf155 YU BAF155 HeLa S Yale ChIP-chip (BAF155 ab, HeLa S3 cells) Signal ENCODE Chromatin Immunoprecipitation encodeYaleChipSites Yale ChIP Sites Yale ChIP-chip Sites ENCODE Chromatin Immunoprecipitation Description This track shows the map of -log10(P-value) of binding sites (as determined in the Methods below) for ChIP-chip using DNA from immunoprecipitated chromatin from either human HelaS3 (cervix epithelial adenocarcinoma), GM06990 (lymphoblastoid) or K562 (myeloid leukemia-derived) cells hybridized to maskless photolithographic arrays. The arrays consist of 50-mer oligonucleotides tiled with 12-nt overlaps covering most of the non-repetitive DNA sequence of the ENCODE regions. Chromatin immunoprecipitation was carried out for each experiment using antibodies against the following targets: BAF155, BAF170, INI1/BAF47, c-Fos, c-Jun, TAF1/TAFII250, RNA polymerase II, histone H4 tetra-acetylated lysine (H4Kac4), histone H3 tri-methylated lysine (H3K27me3), STAT1, nuclear factor kappa B (NFKB) p65, SMARCA4/BRG1, SMARCA6 and NRSF. Additionally, HeLa S3 cells immunoprecipitated with STAT1 were pre-treated with interferon-alpha and HeLa S3 cells immunoprecipitated with NFKB antibody were pre-treated with tumor necrosis factor-alpha (TNF-alpha) (see table below). This track shows the combined results of three or four multiple biological replicates. For all arrays, the ChIP DNA was labeled with Cy5 and the control DNA was labeled with Cy3. These data are available at NCBI GEO (see table below for links), which also provides additional information about the experimental protocols. Target GEO Accession(s) Description BAF155 (H-76) GSE3549 (HeLa S3 cells) and GSE6898 (K562 cells) BAF155 (Brg1-Associated Factor, 155 kD) is a human homolog of yeast SWI3. The Swi-Snf chromatin-remodeling complex was first described in yeast, and similar proteins have been found in mammalian cells. The human Swi-Snf complex is comprised of at least nine polypeptides, including two ATPase subunits, Brm and Brg-1. Other members of the human Swi-Snf complex are termed BAFs for Brg1-associated factors. BAF155 is a conserved (core) component that stimulates the chromatin remodeling activity of Brg1. BAF170 (H-116) GSE3550 (HeLa S3 cells) and GSE6896 (K562 cells) BAF170 (Brg1-Associated Factor, 170 kD) is a human homolog of yeast SWI3, a protein important in chromatin remodeling. It is a conserved (core) component of the Swi-Snf complex that stimulates the chromatin remodeling activity of Brg1 (see the description for BAF155). INI1/BAF47 (H-300) GSE6897 (K562 cells) INI1 (Integrase Interactor 1) or BAF47 is a human homolog of yeast SNF5, a protein important in chromatin remodeling. c-Fos GSE3449 (HeLa S3 cells) c-Fos (transcription factor) is the cellular homolog of the v-fos viral oncogene. It is a member of the leucine zipper protein family and its transcriptional activity has been implicated in cell growth, differentiation, and development. Fos is induced by many stimuli, ranging from mitogens to pharmacological agents. c-Fos has been shown to be associated with another proto-oncogene, c-Jun, and together they bind to the AP-1 binding site to regulate gene transcription. Like CREB, c-Fos is regulated by p90Rsk. c-Jun GSE3448 (HeLa S3 cells) c-Jun (transcription factor), also known as AP-1 (activator protein 1), is the cellular homolog of the avian sarcoma virus oncogene v-jun, and as such can be referred to as a proto-oncogene. TAF1/TAFII250 GSE3450 (HeLa S3 cells) TAF1 (TATA box binding protein (TBP)-associated factor, with molecular weight 250 kD, also known as TAFII250) is involved in the initiation of transcription by RNA polymerase II. It has histone acetyltransferase activity, which can relieve the binding between DNA and histones in the nucleosome. It is the largest subunit of the basal transcription factor, TFIID. RNA polymerase II (N-20), N-terminus GSE6390 (HeLa S3 cells) and GSE6392 (GM06990 cells) RNA polymerase II (pol II) catalyzes transcription of DNA for the production of mRNAs and most snoRNAs. RNA polymerase II (8WG16), C-terminus GSE6391 (HeLa S3 cells) and GSE6394 (GM06990 cells) RNA polymerase II (pol II) catalyzes transcription of DNA for the production of mRNAs and most snoRNAs. This antibody targets the pre-initiation complex form recognizing the C-terminal hexapeptide repeat of the large subunit of pol II. The initiation-complex form of RNA polymerase II is associated with the transcription start site. H4Kac4 GSE6389 (HeLa S3 cells) and GSE6393 (GM06990 cells) H4Kac4 (Histone H4 tetra-acetylated lysine) is a post-translational modification of the histone which affects chromatin remodeling. Histone H4 is found in transcriptionally active euchromatin. H3K27me3 GSE8073 (HeLa S3 cells) H3K27me3 (Histone H3 tri-methylated lysine) is a post-translational modification of the histone which affects chromatin remodeling. It is known to be associated with heterochromatin. STAT1 p91 (C-24) GSE6892 (HeLa S3 cells, interferon-alpha stimulated) STAT1 (Signal Transducer and Activator of Transcription 1) responds to many cytokines and growth factors and regulates genes important for apoptosis, inflammation, and the immune system. NFKB p65, N-terminus GSE6900 (HeLa S3 cells, TNF-alpha stimulated) NFKB p65 (RelA) is the strongest transcriptional-activator among the five members of the mammalian NF-kB/Rel family and plays an essential role in regulating the induction of genes involved in several physiological processes, including immune and inflammatory responses. NFKB p65 (C-20), C-terminus GSE6899 (HeLa S3 cells, TNF-alpha stimulated) NFKB p65 (RelA) is the strongest transcriptional-activator among the five members of the mammalian NF-kB/Rel family and plays an essential role in regulating the induction of genes involved in several physiological processes, including immune and inflammatory responses. SMARCA4/BRG1 GSE7370 (HeLa S3 cells) SMARCA4 (BRG1) is a catalytic subunit of the SWI/SNF chromatin remodeling complex. It is a member of the SNF2 family of chromatin remodeling ATPases. SMARCA6 GSE7371 (HeLa S3 cells) SMARCA6 is a SNF2-like helicase linked to cell proliferation and DNA methylation. It is a member of the SNF2 family of chromatin remodeling ATPases. NRSF GSE7372 (HeLa S3 cells) NRSF (neuron-restrictive silencer factor) represses neuron-specific genes in non-neuronal cells. Display Conventions and Configuration The subtracks within this annotation may be configured in a variety of ways to highlight different aspects of the displayed data. Data may be thresholded by score and/or the user can specify the display of only the top N scoring items (default is 200) for all the subtracks. The score for each item is indicated in grayscale, with darker shades corresponding to higher scores. The details page for an item (displayed after clicking on an item in the track) shows the top 20 highest scoring items displayed in the current window. Methods The data from replicates were quantile-normalized and median-scaled to each other (both Cy3 and Cy5 channels). Using a 1000 bp sliding window centered on each oligonucleotide probe, a signal map (estimating the fold enrichment [log2 scale] of ChIP DNA) was generated by computing the pseudomedian signal of all log2(Cy5/Cy3) ratios (median of pairwise averages) within the window, including replicates. Using the same procedure, a -log10(P-value) map (measuring significance of enrichment of oligonucleotide probes in the window) for all sliding windows was made by computing P-values using the Wilcoxon paired signed rank test comparing fluorensent intensity between Cy5 and Cy3 for each oligonucleotide probe (Cy5 and Cy3 signals from the same array). A binding site was determined by thresholding oligonucleotide positions with -log10(P-value) (>= 4), extending qualified positions upstream and downstream 250 bp, and requiring 1000 bp space between two sites. Top 400 sites are retained for experiments (ENCODE Oct 2005 Freeze) and for the other datasets, sites found using 1, 5 and 10% false discovery rates (FDR) are displayed. Verification ChIP-chip binding sites were verified by comparing "hit lists" generated from combinations of different biological replicates. Only experiments that yielded a significant overlap (greater than 50 percent) were accepted. As an independent check (for maskless arrays), data on the microarray were randomized with respect to position and re-scored; significantly fewer hits (consistent with random noise) were generated this way. Sites for data from Nov. 2006, Jan. 2007, Apr. 2007 and Jun. 2007 were determined with false discovery rates (FDR) of 1%, 5% and 10%. The lowest FDR which includes each "Site" is displayed on that site's details page. For the ENCODE Oct 2005 Freeze data (BAF155, BAF170, Fos, Jun and TAF1 in HeLa S3 cells), the top 400 sites are shown. Credits These data were generated and analyzed by the labs of Michael Snyder, Mark Gerstein and Sherman Weissman at Yale University. References Cawley S, Bekiranov S, Ng HH, Kapranov P, Sekinger EA, Kampa D, Piccolboni A, Sementchenko V, Cheng J, Williams AJ et al. Unbiased mapping of transcription factor binding sites along human chromosomes 21 and 22 points to widespread regulation of noncoding RNAs. Cell 2004 Feb 20;116(4):499-509. Euskirchen G, Royce TE, Bertone P, Martone R, Rinn JL, Nelson FK, Sayward F, Luscombe NM, Miller P, Gerstein M et al. CREB binds to multiple loci on human chromosome 22. Mol Cell Biol. 2004 May;24(9):3804-14. Martone R, Euskirchen G, Bertone P, Hartman S, Royce TE, Luscombe NM, Rinn JL, Nelson FK, Miller P, Gerstein M et al. Distribution of NF-kappaB-binding sites across human chromosome 22. Proc Natl Acad Sci U S A. 2003 Oct 14;100(21):12247-52. Quackenbush J. Microarray data normalization and transformation Nat Genet. 2002 Dec;32(Suppl):496-501. encodeYaleChipSitesBaf47K562 YU BAF47 K562 Yale ChIP-chip (BAF47 ab, K562 cells) Sites ENCODE Chromatin Immunoprecipitation encodeYaleChipSitesBaf170K562 YU BAF170 K562 Yale ChIP-chip (BAF170 ab, K562 cells) Sites ENCODE Chromatin Immunoprecipitation encodeYaleChipSitesBaf155K562 YU BAF155 K562 Yale ChIP-chip (BAF155 ab, K562 cells) Sites ENCODE Chromatin Immunoprecipitation encodeYaleChipSitesH4kac4Gm06990 YU H4Kac4 GM Yale ChIP-chip (H4Kac4 ab, GM06990 cells) Sites ENCODE Chromatin Immunoprecipitation encodeYaleChipSitesPol2nGm06990 YU Pol2N GM Yale ChIP-chip (Pol2 N-terminus ab, GM06990 cells) Sites ENCODE Chromatin Immunoprecipitation encodeYaleChipSitesPol2Gm06990 YU Pol2 GM Yale ChIP-chip (Pol2 ab, GM06990 cells) Sites ENCODE Chromatin Immunoprecipitation encodeYaleChipSitesNrsfHela YU NRSF HeLa Yale ChIP-chip (NRSF, HeLa S3 cells) Sites ENCODE Chromatin Immunoprecipitation encodeYaleChipSitesSmarca6Hela YU SMARCA6 HeLa Yale ChIP-chip (SMARCA6, HeLa S3 cells) Sites ENCODE Chromatin Immunoprecipitation encodeYaleChipSitesSmarca4Hela YU SMARCA4 HeLa Yale ChIP-chip (SMARCA4, HeLa S3 cells) Sites ENCODE Chromatin Immunoprecipitation encodeYaleChipSitesP65cHelaTnfa YU P65-C HeLa TNF Yale ChIP-chip (NFKB p65 C-terminus ab, HeLa S3 cells, TNF-alpha treated) Sites ENCODE Chromatin Immunoprecipitation encodeYaleChipSitesP65nHelaTnfa YU P65-N HeLa TNF Yale ChIP-chip (NFKB p65 N-terminus ab, HeLa S3 cells, TNF-alpha treated) Sites ENCODE Chromatin Immunoprecipitation encodeYaleChipSitesStat1HelaIfna YU STAT1 HeLa IF Yale ChIP-chip (STAT1 ab, HeLa S3 cells, Interferon-alpha treated) Sites ENCODE Chromatin Immunoprecipitation encodeYaleChipSitesH3k27me3Hela YU H3K27me3 HeLa Yale ChIP-chip (H3K27me3 ab, HeLa S3 cells) Sites ENCODE Chromatin Immunoprecipitation encodeYaleChipSitesH4kac4Hela YU H4Kac4 HeLa Yale ChIP-chip (H4Kac4 ab, HeLa S3 cells) Sites ENCODE Chromatin Immunoprecipitation encodeYaleChipSitesPol2nHela YU Pol2N HeLa Yale ChIP-chip (Pol2 N-terminus ab, HeLa S3 cells) Sites ENCODE Chromatin Immunoprecipitation encodeYaleChipSitesPol2Hela YU Pol2 HeLa Yale ChIP-chip (Pol2 ab, HeLa S3 cells) Sites ENCODE Chromatin Immunoprecipitation encodeYaleChipSitesTaf YU TAF1 HeLa Yale ChIP-chip (TAF1 ab, HeLa S3 cells) Sites ENCODE Chromatin Immunoprecipitation encodeYaleChipSitesJun YU c-Jun HeLa Yale ChIP-chip (c-Jun ab, HeLa S3 cells) Sites ENCODE Chromatin Immunoprecipitation encodeYaleChipSitesFos YU c-Fos HeLa Yale ChIP-chip (c-Fos ab, HeLa S3 cells) Sites ENCODE Chromatin Immunoprecipitation encodeYaleChipSitesBaf170 YU BAF170 HeLa Yale ChIP-chip (BAF170 ab, HeLa S3 cells) Sites ENCODE Chromatin Immunoprecipitation encodeYaleChipSitesBaf155 YU BAF155 HeLa Yale ChIP-chip (BAF155 ab, HeLa S3 cells) Sites ENCODE Chromatin Immunoprecipitation pseudoYale Yale Pseudo Yale Pseudogenes. Genes and Gene Predictions Description This track shows identified pseudogenes as recorded in the Yale Pseudogene Database. For information on how these pseudogenes were identified and access to the database, see http://www.pseudogene.org. encodeYaleAffyRNATransMap Yale RNA Yale RNA Transcript Map (Neutrophil, Placenta and NB4 cells) ENCODE Transcript Levels Description This track shows the transcript map of signal intensity (estimating RNA abundance) for the following, hybridized to the Affymetrix ENCODE oligonucleotide microarray: human neutrophil (PMN) total RNA (10 biological samples from different individuals) human placental Poly(A)+ RNA (3 biological replicates) total RNA from human NB4 cells (4 biological replicates), each sample divided into three parts and treated as follows: untreated, treated with retinoic acid (RA), and treated with 12-O-tetradecanoylphorbol-13 acetate (TPA) (three out of the four original samples). Total RNA was extracted from each treated sample and applied to arrays in duplicate (2 technical replicates). poly(A)+ and Total RNA for HeLa S3 (3 biological replicates for each) The human NB4 cell can be made to differentiate towards either monocytes (by treatment with TPA) or neutrophils (by treatment with RA). See Kluger et al., 2004 in the References section for more details about the differentiation of hematopoietic cells. This array has 25-mer oligonucleotide probes tiled approximately every 22 bp, covering all the non-repetitive DNA sequence of the ENCODE regions. The transcript map is a combined signal for both strands of DNA. This is derived from the number of different biological samples indicated above, each with at least two technical replicates. See the following NCBI Gene Expression Omnibus (GEO) accessions for details of experimental protocols: ENCODE Transcript Mapping for Human Neutrophil (PMN) Total RNA: GSE2678 ENCODE Transcript Mapping for Human Placental Poly(A)+ RNA: GSE2671 ENCODE Transcript Mapping for Total RNA from Human NB4 Cells untreated, treated with RA, and treated with TPA: GSE2679 Display Conventions and Configuration This annotation follows the display conventions for composite "wiggle" tracks. The subtracks within this annotation may be configured in a variety of ways to highlight different aspects of the displayed data. The graphical configuration options are shown at the top of the track description page, followed by a list of subtracks. To display only selected subtracks, uncheck the boxes next to the tracks you wish to hide. For more information about the graphical configuration options, click the Graph configuration help link. Color differences among the subtracks are arbitrary. They provide a visual cue for distinguishing between the different data samples. Methods The data from biological & technical replicates were quantile-normalized to each other and then median scaled to 25. Using a 101 bp sliding window centered on each oligonucleotide probe, a signal map estimating RNA abundance was generated by computing the pseudomedian signal of all PM-MM pairs (median of pairwise PM-MM averages) within the window, including replicates. Verification Independent biological replicates (as indicated above) were generated, and each was hybridized to at least two different arrays (technical replicates). Transcribed regions were then identified using a signal threshold of 90 percentile of signal intensities, as well as a maximum gap of 50 bp and a minimum run of 50 bp (between oligonucleotide positions). Transcribed regions, as determined by individual biological samples, were compared to ensure significant overlap. Credits These data were generated and analyzed by the Yale/Affymetrix collaboration between the labs of Michael Snyder, Mark Gerstein and Sherman Weissman at Yale University and Tom Gingeras at Affymetrix. References Bertone P, Stolc V, Royce TE, Rozowsky JS, Urban AE, Zhu X, Rinn JL, Tongprasit W, Samanta M, Weissman S et al. Global identification of human transcribed sequences with genome tiling arrays. Science. 2004 Dec 24;306(5705):2242-6. Cheng J, Kapranov P, Drenkow J, Dike S, Brubaker S, Patel S, Long J, Stern D, Tammana H, Helt G et al. Transcriptional maps of 10 human chromosomes at 5-nucleotide resolution. Science. 2005 May 20;308(5725):1149-54. Kapranov P, Cawley SE, Drenkow J, Bekiranov S, Strausberg RL, Fodor SP, Gingeras TR. Large-scale transcriptional activity in chromosomes 21 and 22. Science. 2002 May 3;296(5569):916-9. Kluger Y, Tuck DP, Chang JT, Nakayama Y, Poddar R, Kohya N, Lian Z, Ben Nasr A, Halaban HR, Krause DS et al. Lineage specificity of gene expression patterns. Proc Natl Acad Sci U S A. 2004 April 27;101(17):6508-13. Rinn JL, Euskirchen G, Bertone P, Martone R, Luscombe NM, Hartman S, Harrison PM, Nelson FK, Miller P, Gerstein M et al. The transcriptional activity of human Chromosome 22. Genes Dev. 2003 Feb 15;17(4):529-40. encodeYaleAffyNB4UntrRNATransMap Yale RNA NB4 Un Yale NB4 RNA Transcript Map, Untreated ENCODE Transcript Levels encodeYaleAffyNB4TPARNATransMap Yale RNA NB4 TPA Yale NB4 RNA Transcript Map, Treated with 12-O-tetradecanoylphorbol-13 Acetate (TPA) ENCODE Transcript Levels encodeYaleAffyNB4RARNATransMap Yale RNA NB4 RA Yale NB4 RNA Transcript Map, Treated with Retinoic Acid ENCODE Transcript Levels encodeYaleAffyPlacRNATransMap Yale RNA Plcnta Yale Placenta RNA Transcript Map ENCODE Transcript Levels encodeYaleAffyNeutRNATransMap Yale RNA Neutro Yale Neutrophil RNA Transcript Map ENCODE Transcript Levels encodeYaleChIPSTAT1Pval Yale STAT1 pVal Yale ChIP-chip (STAT1 ab, HeLa cells) P-Value ENCODE Chromatin Immunoprecipitation Description This track shows probable sites of STAT1 binding in HeLa cells as determined by chromatin immunoprecipitation followed by microarray analysis. STAT1 (Signal Transducer and Activator of Transcription) is a transcription factor that moves to the nucleus and binds DNA only in response to a cytokine signal such as interferon-gamma. HeLa cells are a common cell line derived from a cervical cancer. Each of the four subtracks represents a different microarray platform. The track as a whole can be used to compare results across microarray platforms. The first three platforms are custom maskless photolithographic arrays with oligonucleotides tiling most of the non-repetitive DNA sequence of the ENCODE regions: Maskless design #1: 50-mer oligonucleotides tiled every 38 bps (overlapping by 12 nts) Maskless design #2: 36-mer oligonucleotides tiled end to end Maskless design #3: 50-mer oligonucleotides tiled end to end The fourth array platform is an ENCODE PCR Amplicon array manufactured by Bing Ren's lab at UCSD. The subtracks show the ratio of immunoprecipitated DNA from cytokine-stimulated cells vs. unstimulated cells in each of the four platforms. The ratio is calculated as -log10(p-value) in a 501-base window. The data shown is the combined result of multiple biological replicates: five for the first maskless array (50-mer every 38 bp), two for the second maskless array (36-mer every 36 bp), three for the third maskless array (50-mer every 50 bp) and six for the PCR Amplicon array. These data are available at NCBI GEO as GSE2714, which also provides additional information about the experimental protocols. Display Conventions and Configuration This annotation follows the display conventions for composite "wiggle" tracks. The subtracks within this annotation may be configured in a variety of ways to highlight different aspects of the displayed data. The graphical configuration options are shown at the top of the track description page, followed by a list of subtracks. To display only selected subtracks, uncheck the boxes next to the tracks you wish to hide. For more information about the graphical configuration options, click the Graph configuration help link. Methods For all arrays, the STAT1 ChIP DNA was labeled with Cy5 and the control DNA was labeled with Cy3. Maskless photolithographic arrays The data from replicates were median-scaled and quantile-normalized to each other. After normalization, replicates were condensed to a single value. Using a 501 bp sliding window centered on each oligonucleotide probe, a signal map (estimating the fold enrichment [log2 scale] of ChIP DNA) is generated by computing the pseudomedian signal of all log2(Cy5/Cy3) ratios (median of pairwise averages) within the window (including replicates). Using the same procedure, a -log10(p-value) map (measuring significance of enrichment of oligonucleotide probes in the window) for all sliding windows can be made by computing P-values using the Wilcoxon paired signed rank test comparing fluorensent intensity between Cy5 and Cy3 for each oligonucleotide probe (Cy5 and Cy3 signals from the same array). A binding site is determined by thresholding both on fold enrichment and -log10(p-value) and requiring a maximum gap and a minimum run between oligonucleotide positions. For the first maskless array (50-mer every 38 bp):    log2(Cy5/Cy3) >= 1.25, -log10(p-value) >=8.0, MaxGap <= 100 bp, MinRun >= 180 bp For the second maskless array (36-mer every 36 bp):    log2(Cy5/Cy3) >= 0.25, -log10(p-value) >=4.0, MaxGap <= 250 bp, MinRun >= 0 bp For the third maskless array (50-mer every 50 bp):    log2(Cy5/Cy3) >= 0.25, -log10(p-value) >=4.0, MaxGap <= 250 bp, MinRun >= 0 bp PCR Amplicon Arrays The Cy5 and Cy3 array data were loess-normalized between channels on the same slide and then between slides. A z-score was then determined for each PCR amplicon from the distribution of log(Cy5/Cy3) in a local log(Cy5*Cy3) intensity window (see Quackenbush, 2002 and the Express Yourself website for more details). From the z-score, a P-value was then associated with each PCR amplicon. Hits were determined using a 3 sigma threshold and requiring a spot to be present on three out of six arrays. Verification ChIP-chip binding sites were verified by comparing "hit lists" generated from combinations of different biological replicates. Only experiments that yielded a significant overlap (greater than 50 percent) were accepted. As an independent check (for maskless arrays), data on the microarray were randomized with respect to position and re-scored; significantly fewer hits (consistent with random noise) were generated this way. Credits This data was generated and analyzed by the labs of Michael Snyder, Mark Gerstein and Sherman Weissman at Yale University. The PCR Amplicon arrays were manufactured by Bing Ren's lab at UCSD. References Cawley, S., Bekiranov, S., Ng, H.H., Kapranov, P., Sekinger, E.A., Kampa, D., Piccolboni, A., Sementchenko, V., Cheng, J. et al. Unbiased mapping of transcription factor binding sites along human chromosomes 21 and 22 points to widespread regulation of noncoding RNAs. Cell 116(4), 499-509 (2004). Euskirchen, G., Royce, T.E., Bertone, P., Martone, R., Rinn, J.L., Nelson, F.K., Sayward, F., Luscombe, N.M., Miller, P. et al. CREB binds to multiple loci on human chromosome 22, Mol Cell Biol. 24(9), 3804-14 (2004). Luscombe, N.M., Royce, T.E., Bertone, P., Echols, N., Horak, C.E., Chang, J.T., Snyder, M. and Gerstein, M. ExpressYourself: A modular platform for processing and visualizing microarray data. Nucleic Acids Res. 31(13), 3477-82 (2003). Martone, R., Euskirchen, G., Bertone, P., Hartman, S., Royce, T.E., Luscombe, N.M., Rinn, J.L., Nelson, F.K., Miller, P. et al. Distribution of NF-kappaB-binding sites across human chromosome 22. Proc Natl Acad Sci U S A. 100(21), 12247-52 (2003). Quackenbush, J.. Microarray data normalization and transformation, Nat Genet. 32(Suppl), 496-501 (2002). encodeYaleChIPSTAT1HeLaBingRenPval Yale LI PVal Yale ChIP-chip (STAT1 ab, HeLa cells) LI/UCSD PCR Amplicon, P-Value ENCODE Chromatin Immunoprecipitation encodeYaleChIPSTAT1HeLaMaskLess50mer50bpPval Yale 50-50 PVal Yale ChIP-chip (STAT1 ab, HeLa cells) Maskless 50-mer, 50bp Win, P-Value ENCODE Chromatin Immunoprecipitation encodeYaleChIPSTAT1HeLaMaskLess50mer38bpPval Yale 50-38 PVal Yale ChIP-chip (STAT1 ab, HeLa cells) Maskless 50-mer, 38bp Win, P-Value ENCODE Chromatin Immunoprecipitation encodeYaleChIPSTAT1HeLaMaskLess36mer36bpPval Yale 36-36 PVal Yale ChIP-chip (STAT1 ab, HeLa cells) Maskless 36-mer, 36bp Win, P-Value ENCODE Chromatin Immunoprecipitation encodeYaleChIPSTAT1Sig Yale STAT1 Sig Yale ChIP-chip (STAT1 ab, HeLa cells) Signal ENCODE Chromatin Immunoprecipitation Description Each of these four tracks shows the map of signal intensity (estimating the fold enrichment [log2 scale] of ChIP DNA vs unstimulated DNA) for STAT1 ChIP-chip using Human Hela S3 cells hybridized to four different array designs/platforms. The first three platforms are custom maskless photolithographic arrays with oligonucleotides tiling most of the non-repetitive DNA sequence of the ENCODE regions: Maskless design #1: 50-mer oligonucleotides tiled every 38 bps (overlapping by 12 nts) Maskless design #2: 36-mer oligonucleotides tiled end to end Maskless design #3: 50-mer oligonucleotides tiled end to end The fourth array platform is an ENCODE PCR Amplicon array manufactured by Bing Ren's lab at UCSD. Each track shows the combined results of multiple biological replicates: five for the first maskless array (50-mer every 38 bp), two for the second maskless array (36-mer every 36 bp), three for the third maskless array (50-mer every 50 bp) and six for the PCR Amplicon array. For all arrays, the STAT1 ChIP DNA was labeled with Cy5 and the control DNA was labeled with Cy3. These data are available at NCBI GEO as GSE2714, which also provides additional information about the experimental protocols. Display Conventions and Configuration This annotation follows the display conventions for composite "wiggle" tracks. The subtracks within this annotation may be configured in a variety of ways to highlight different aspects of the displayed data. The graphical configuration options are shown at the top of the track description page, followed by a list of subtracks. To display only selected subtracks, uncheck the boxes next to the tracks you wish to hide. For more information about the graphical configuration options, click the Graph configuration help link. Methods Maskless photolithographic arrays The data from replicates were median-scaled and quantile-normalized to each other (both Cy3 and Cy5 channels). Using a 501 bp sliding window centered on each oligonucleotide probe, a signal map (estimating the fold enrichment [log2 scale] of ChIP DNA) was generated by computing the pseudomedian signal of all log2(Cy5/Cy3) ratios (median of pairwise averages) within the window, including replicates. Using the same procedure, a -log10(P-value) map (measuring significance of enrichment of oligonucleotide probes in the window) for all sliding windows was made by computing P-values using the Wilcoxon paired signed rank test comparing fluorensent intensity between Cy5 and Cy3 for each oligonucleotide probe (Cy5 and Cy3 signals from the same array). A binding site was determined by thresholding both on fold enrichment and -log10(P-value) and requiring a maximum gap and a minimum run between oligonucleotide positions. For the first maskless array (50-mer every 38 bp):    log2(Cy5/Cy3) >= 1.25, -log10(P-value) >= 8.0, MaxGap <= 100 bp, MinRun >= 180 bp For the second maskless array (36-mer every 36 bp):    log2(Cy5/Cy3) >= 0.25, -log10(P-value) >= 4.0, MaxGap <= 250 bp, MinRun >= 0 bp For the third maskless array (50-mer every 50 bp):    log2(Cy5/Cy3) >= 0.25, -log10(P-value) >= 4.0, MaxGap <= 250 bp, MinRun >= 0 bp PCR Amplicon Arrays The Cy5 and Cy3 array data were loess-normalized between channels on the same slide and then between slides. A z-score was then determined for each PCR amplicon from the distribution of log(Cy5/Cy3) in a local log(Cy5*Cy3) intensity window (see Quackenbush, 2002 and the Express Yourself website for more details). From the z-score, a P-value was then associated with each PCR amplicon. Hits were determined using a 3 sigma threshold and requiring a spot to be present on three out of six arrays. Verification ChIP-chip binding sites were verified by comparing "hit lists" generated from combinations of different biological replicates. Only experiments that yielded a significant overlap (greater than 50 percent) were accepted. As an independent check (for maskless arrays), data on the microarray were randomized with respect to position and re-scored; significantly fewer hits (consistent with random noise) were generated this way. Credits These data were generated and analyzed by the labs of Michael Snyder, Mark Gerstein and Sherman Weissman at Yale University. The PCR Amplicon arrays were manufactured by Bing Ren's lab at UCSD. References Cawley, S., Bekiranov, S., Ng, H.H., Kapranov, P., Sekinger, E.A., Kampa, D., Piccolboni, A., Sementchenko, V., Cheng, J. et al. Unbiased mapping of transcription factor binding sites along human chromosomes 21 and 22 points to widespread regulation of noncoding RNAs. Cell 116(4), 499-509 (2004). Euskirchen, G., Royce, T.E., Bertone, P., Martone, R., Rinn, J.L., Nelson, F.K., Sayward, F., Luscombe, N.M., Miller, P. et al. CREB binds to multiple loci on human chromosome 22, Mol Cell Biol. 24(9), 3804-14 (2004). Luscombe, N.M., Royce, T.E., Bertone, P., Echols, N., Horak, C.E., Chang, J.T., Snyder, M. and Gerstein, M. ExpressYourself: A modular platform for processing and visualizing microarray data. Nucleic Acids Res. 31(13), 3477-82 (2003). Martone, R., Euskirchen, G., Bertone, P., Hartman, S., Royce, T.E., Luscombe, N.M., Rinn, J.L., Nelson, F.K., Miller, P. et al. Distribution of NF-kappaB-binding sites across human chromosome 22. Proc Natl Acad Sci U S A. 100(21), 12247-52 (2003). Quackenbush, J.. Microarray data normalization and transformation, Nat Genet. 32(Suppl), 496-501 (2002). encodeYaleChIPSTAT1HeLaBingRenSig Yale LI Sig Yale ChIP-chip (STAT1 ab, HeLa cells) LI/UCSD PCR Amplicon, Signal ENCODE Chromatin Immunoprecipitation encodeYaleChIPSTAT1HeLaMaskLess50mer50bpSig Yale 50-50 Sig Yale ChIP-chip (STAT1 ab, HeLa cells) Maskless 50-mer, 50bp Win, Signal ENCODE Chromatin Immunoprecipitation encodeYaleChIPSTAT1HeLaMaskLess50mer38bpSig Yale 50-38 Sig Yale ChIP-chip (STAT1 ab, HeLa cells) Maskless 50-mer, 38bp Win, Signal ENCODE Chromatin Immunoprecipitation encodeYaleChIPSTAT1HeLaMaskLess36mer36bpSig Yale 36-36 Sig Yale ChIP-chip (STAT1 ab, HeLa cells) Maskless 36-mer, 36bp Win, Signal ENCODE Chromatin Immunoprecipitation encodeYaleChIPSTAT1Sites Yale STAT1 Sites Yale ChIP-chip (STAT1 ab, HeLa cells) Binding Sites ENCODE Chromatin Immunoprecipitation Description Each of these four tracks shows the binding sites for STAT1 ChIP-chip using Human Hela S3 cells hybridized to four different array designs/platforms. The first three platforms are custom maskless photolithographic arrays with oligonucleotides tiling most of the non-repetitive DNA sequence of the ENCODE regions: Maskless design #1: 50mer oligonucleotides tiled every 38 bps (overlapping by 12 nts) Maskless design #2: 36mer oligonucleotides tiled end to end Maskless design #3: 50mer oligonucleotides tiled end to end The fourth array platform is an ENCODE PCR Amplicon array manufactured by Bing Ren's lab at UCSD. Each track shows the combined results of multiple biological replicates: five for the first maskless array (50-mer every 38 bp), two for the second maskless array (36-mer every 36 bp), three for the third maskless array (50-mer every 50 bp) and six for the PCR Amplicon array. For all arrays, the STAT1 ChIP DNA was labeled with Cy5 and the control DNA was labeled with Cy3. See NCBI GEO GSE2714 for details of the experimental protocols. Methods Maskless photolithographic arrays The data from replicates were median-scaled and quantile-normalized to each other (both Cy3 and Cy5 channels). Using a 501 bp sliding window centered on each oligonucleotide probe, a signal map (estimating the fold enrichment [log2 scale] of ChIP DNA) was generated by computing the pseudomedian signal of all log2(Cy5/Cy3) ratios (median of pairwise averages) within the window, including replicates. Using the same procedure, a -log10(P-value) map (measuring significance of enrichment of oligonucleotide probes in the window) for all sliding windows was made by computing P-values using the Wilcoxon paired signed rank test comparing fluorensent intensity between Cy5 and Cy3 for each oligonucleotide probe (Cy5 and Cy3 signals from the same array). A binding site was determined by thresholding both on fold enrichment and -log10(P-value) and requiring a maximum gap and a minimum run between oligonucleotide positions. For the first maskless array (50-mer every 38 bp):    log2(Cy5/Cy3) >= 1.25, -log10(P-value) >= 8.0, MaxGap <= 100 bp, MinRun >= 180 bp For the second maskless array (36-mer every 36 bp):    log2(Cy5/Cy3) >= 0.25, -log10(P-value) >= 4.0, MaxGap <= 250 bp, MinRun >= 0 bp For the third maskless array (50-mer every 50 bp):    log2(Cy5/Cy3) >= 0.25, -log10(P-value) >= 4.0, MaxGap <= 250 bp, MinRun >= 0 bp PCR Amplicon Arrays The Cy5 and Cy3 array data were loess-normalized between channels on the same slide and then between slides. A z-score was then determined for each PCR amplicon from the distribution of log(Cy5/Cy3) in a local log(Cy5*Cy3) intensity window (see Quackenbush, 2002 and the Express Yourself website for more details). From the z-score, a P-value was then associated with each PCR amplicon. Hits were determined using a 3 sigma threshold and requiring a spot to be present on three out of six arrays. Verification ChIP-chip binding sites were verified by comparing "hit lists" generated from combinations of different biological replicates. Only experiments that yielded a significant overlap (greater than 50 percent) were accepted. As an independent check (for maskless arrays), data on the microarray were randomized with respect to position and re-scored; significantly fewer hits (consistent with random noise) were generated this way. Credits This data was generated and analyzed by the labs of Michael Snyder, Mark Gerstein and Sherman Weissman at Yale University. The PCR Amplicon arrays were manufactured by Bing Ren's lab at UCSD. References Cawley, S., Bekiranov, S., Ng, H.H., Kapranov, P., Sekinger, E.A., Kampa, D., Piccolboni, A., Sementchenko, V., Cheng, J. et al. Unbiased mapping of transcription factor binding sites along human chromosomes 21 and 22 points to widespread regulation of noncoding RNAs. Cell 116(4), 499-509 (2004). Euskirchen, G., Royce, T.E., Bertone, P., Martone, R., Rinn, J.L., Nelson, F.K., Sayward, F., Luscombe, N.M., Miller, P. et al. CREB binds to multiple loci on human chromosome 22, Mol Cell Biol. 24(9), 3804-14 (2004). Luscombe, N.M., Royce, T.E., Bertone, P., Echols, N., Horak, C.E., Chang, J.T., Snyder, M. and Gerstein, M. ExpressYourself: A modular platform for processing and visualizing microarray data. Nucleic Acids Res. 31(13), 3477-82 (2003). Martone, R., Euskirchen, G., Bertone, P., Hartman, S., Royce, T.E., Luscombe, N.M., Rinn, J.L., Nelson, F.K., Miller, P. et al. Distribution of NF-kappaB-binding sites across human chromosome 22. Proc Natl Acad Sci U S A. 100(21), 12247-52 (2003). Quackenbush, J.. Microarray data normalization and transformation, Nat Genet. 32(Suppl), 496-501 (2002). encodeYaleChIPSTAT1HeLaBingRenSites Yale LI Sites Yale ChIP-chip (STAT1 ab, HeLa cells) LI/UCSD PCR Amplicon, Binding Sites ENCODE Chromatin Immunoprecipitation encodeYaleChIPSTAT1HeLaMaskLess50mer50bpSite Yale 50-50 Sites Yale ChIP-chip (STAT1 ab, HeLa cells) Maskless 50-mer, 50bp Win, Binding Sites ENCODE Chromatin Immunoprecipitation encodeYaleChIPSTAT1HeLaMaskLess50mer38bpSite Yale 50-38 Sites Yale ChIP-chip (STAT1 ab, HeLa cells) Maskless 50-mer, 38bp Win, Binding Sites ENCODE Chromatin Immunoprecipitation encodeYaleChIPSTAT1HeLaMaskLess36mer36bpSite Yale 36-36 Sites Yale ChIP-chip (STAT1 ab, HeLa cells) Maskless 36-mer, 36bp Win, Binding Sites ENCODE Chromatin Immunoprecipitation encodeYaleAffyRNATars Yale TAR Yale RNA Transcriptionally Active Regions (TARs) ENCODE Transcript Levels Description This track shows the locations of transcriptionally active regions (TARs)/transcribed fragments (transfrags) for the following, hybridized to the Affymetrix ENCODE oligonucleotide microarray: human neutrophil (PMN) total RNA (10 biological samples from different individuals) human placental Poly(A)+ RNA (3 biological replicates) total RNA from human NB4 cells (4 biological replicates), each sample divided into three parts and treated as follows: untreated, treated with retinoic acid (RA), and treated with 12-O-tetradecanoylphorbol-13 acetate (TPA) (three out of the four original samples). Total RNA was extracted from each treated sample and applied to arrays in duplicate (2 technical replicates). poly(A)+ and Total RNA for HeLa S3 (3 biological replicates for each) The human NB4 cell can be made to differentiate towards either monocytes (by treatment with TPA) or neutrophils (by treatment with RA). See Kluger et al., 2004 in the References section for more details about the differentiation of hematopoietic cells. This array has 25-mer oligonucleotide probes tiled approximately every 22 bp, covering all the non-repetitive DNA sequence of the ENCODE regions. The transcript map is a combined signal for both strands of DNA. This is derived from the number of different biological samples indicated above, each with at least two technical replicates. See the following NCBI GEO accessions for details of experimental protocols: GSE2678 GSE2671 GSE2679 Display Conventions and Configuration TARs are represented by blocks in the graphical display. This composite annotation track consists of several subtracks that are listed at the top of the track description page. To display only selected subtracks, uncheck the boxes next to the tracks you wish to hide. Color differences among the subtracks are arbitrary. They provide a visual cue for distinguishing between the different data samples. Methods The data from biological & technical replicates were quantile-normalized to each other and then median scaled to 25. Using a 101 bp sliding window centered on each oligonucleotide probe, a signal map estimating RNA abundance was generated by computing the pseudomedian signal of all PM-MM pairs (median of pairwise PM-MM averages) within the window, including replicates. Transcribed regions (TARs/transfrags) were then identified using a signal theshold determined from a 95% false positive rate (FPR) using the bacterial negatives on the array, as well as a maximum gap of 50 bp and a minimum run of 40 bp (between oligonucleotide positions). The TAR sites that are reported start and end at the middle nucleotide of the beginning and ending oligonucleotide probes. Verification Transcribed regions (TARs/transfrags), as determined by individual biological samples, were compared to ensure significant overlap. Credits These data were generated and analyzed by the Yale/Affymetrix collaboration between the labs of Michael Snyder, Mark Gerstein and Sherman Weissman at Yale University and Tom Gingeras at Affymetrix. References Bertone, P., Stolc, V., Royce, T.E., Rozowsky, J.S., Urban, A.E., Zhu, X., Rinn, J.L., Tongprasit, W., Samanta, M. et al. Global identification of human transcribed sequences with genome tiling arrays. Science 306(5705), 2242-6 (2004). Cheng, J., Kapranov, P., Drenkow, J., Dike, S., Brubaker, S., Patel, S., Long, J., Stern, D., Tammana, H. et al. Transcriptional maps of 10 human chromosomes at 5-nucleotide resolution. Science 308(5725), 1149-54 (2005). Kapranov, P., Cawley, S.E., Drenkow, J., Bekiranov, S., Strausberg, R.L., Fodor, S.P. and Gingeras, T.R. Large-scale transcriptional activity in chromosomes 21 and 22. Science 296(5569), 916-9 (2002). Kluger, Y., Tuck, D.P., Chang, J.T., Nakayama, Y., Poddar, R., Kohya, N., Lian, Z., Ben Nasr, A., Halaban, H.R. et al. Lineage specificity of gene expression patterns. Proc Natl Acad Sci U S A 101(17), 6508-13 (2004). Rinn, J.L., Euskirchen, G., Bertone, P., Martone, R., Luscombe, N.M., Hartman, S., Harrison, P.M., Nelson, F.K., Miller, P. et al. The transcriptional activity of human Chromosome 22. Genes Dev 17(4), 529-40 (2003). encodeYaleAffyNB4UntrRNATars Yale TAR NB4 Un Yale NB4 RNA, TAR, Untreated ENCODE Transcript Levels encodeYaleAffyNB4TPARNATars Yale TAR NB4 TPA Yale NB4 RNA, TAR, Treated with 12-O-tetradecanoylphorbol-13 Acetate (TPA) ENCODE Transcript Levels encodeYaleAffyNB4RARNATars Yale TAR NB4 RA Yale NB4 RNA, TAR, Treated with Retinoic Acid ENCODE Transcript Levels encodeYaleAffyPlacRNATars Yale TAR Plcnta Yale Placenta RNA Transcriptionally Active Region ENCODE Transcript Levels encodeYaleAffyNeutRNATars Yale TAR Neutro Yale Neutrophil RNA Transcriptionally Active Region (TAR) ENCODE Transcript Levels chainNetPanTro1 Chimp Chain/Net Chimp (Nov. 2003 (CGSC 1.1/panTro1)), Chain and Net Alignments Comparative Genomics Description This track shows regions of the genome that are alignable to other genomes ("chain" subtracks) or in synteny ("net" subtracks). The alignable parts are shown with thick blocks that look like exons. Non-alignable parts between these are shown like introns. Chain Track The chain track shows alignments of chimp (Nov. 2003 (CGSC 1.1/panTro1)) to the human genome using a gap scoring system that allows longer gaps than traditional affine gap scoring systems. It can also tolerate gaps in both chimp and human simultaneously. These "double-sided" gaps can be caused by local inversions and overlapping deletions in both species. The chain track displays boxes joined together by either single or double lines. The boxes represent aligning regions. Single lines indicate gaps that are largely due to a deletion in the chimp assembly or an insertion in the human assembly. Double lines represent more complex gaps that involve substantial sequence in both species. This may result from inversions, overlapping deletions, an abundance of local mutation, or an unsequenced gap in one species. In cases where multiple chains align over a particular region of the human genome, the chains with single-lined gaps are often due to processed pseudogenes, while chains with double-lined gaps are more often due to paralogs and unprocessed pseudogenes. In the "pack" and "full" display modes, the individual feature names indicate the chromosome, strand, and location (in thousands) of the match for each matching alignment. Net Track The net track shows the best chimp/human chain for every part of the human genome. It is useful for finding syntenic regions, possibly orthologs, and for studying genome rearrangement. The chimp sequence used in this annotation is from the Nov. 2003 (CGSC 1.1/panTro1) assembly. Display Conventions and Configuration Chain Track By default, the chains to chromosome-based assemblies are colored based on which chromosome they map to in the aligning organism. To turn off the coloring, check the "off" button next to: Color track based on chromosome. To display only the chains of one chromosome in the aligning organism, enter the name of that chromosome (e.g. chr4) in box next to: Filter by chromosome. Net Track In full display mode, the top-level (level 1) chains are the largest, highest-scoring chains that span this region. In many cases gaps exist in the top-level chain. When possible, these are filled in by other chains that are displayed at level 2. The gaps in level 2 chains may be filled by level 3 chains and so forth. In the graphical display, the boxes represent ungapped alignments; the lines represent gaps. Click on a box to view detailed information about the chain as a whole; click on a line to display information about the gap. The detailed information is useful in determining the cause of the gap or, for lower level chains, the genomic rearrangement. Individual items in the display are categorized as one of four types (other than gap): Top - the best, longest match. Displayed on level 1. Syn - line-ups on the same chromosome as the gap in the level above it. Inv - a line-up on the same chromosome as the gap above it, but in the opposite orientation. NonSyn - a match to a chromosome different from the gap in the level above. Methods Chain track Transposons that have been inserted since the chimp/human split were removed from the assemblies. The abbreviated genomes were aligned with lastz, and the transposons were added back in. The resulting alignments were converted into axt format using the lavToAxt program. The axt alignments were fed into axtChain, which organizes all alignments between a single chimp chromosome and a single human chromosome into a group and creates a kd-tree out of the gapless subsections (blocks) of the alignments. A dynamic program was then run over the kd-trees to find the maximally scoring chains of these blocks. The following matrix was used:  ACGT A90-330-236-356 C-330100-318-236 G-236-318100-330 T-356-236-33090 Chains scoring below a minimum score of "1000" were discarded; the remaining chains are displayed in this track. The linear gap matrix used with axtChain: -linearGap=loose tablesize 11 smallSize 111 position 1 2 3 11 111 2111 12111 32111 72111 152111 252111 qGap 325 360 400 450 600 1100 3600 7600 15600 31600 56600 tGap 325 360 400 450 600 1100 3600 7600 15600 31600 56600 bothGap 625 660 700 750 900 1400 4000 8000 16000 32000 57000 Net track Chains were derived from lastz alignments, using the methods described on the chain tracks description pages, and sorted with the highest-scoring chains in the genome ranked first. The program chainNet was then used to place the chains one at a time, trimming them as necessary to fit into sections not already covered by a higher-scoring chain. During this process, a natural hierarchy emerged in which a chain that filled a gap in a higher-scoring chain was placed underneath that chain. The program netSyntenic was used to fill in information about the relationship between higher- and lower-level chains, such as whether a lower-level chain was syntenic or inverted relative to the higher-level chain. The program netClass was then used to fill in how much of the gaps and chains contained Ns (sequencing gaps) in one or both species and how much was filled with transposons inserted before and after the two organisms diverged. Credits Lastz (previously known as blastz) was developed at Pennsylvania State University by Minmei Hou, Scott Schwartz, Zheng Zhang, and Webb Miller with advice from Ross Hardison. Lineage-specific repeats were identified by Arian Smit and his RepeatMasker program. The axtChain program was developed at the University of California at Santa Cruz by Jim Kent with advice from Webb Miller and David Haussler. The browser display and database storage of the chains and nets were created by Robert Baertsch and Jim Kent. The chainNet, netSyntenic, and netClass programs were developed at the University of California Santa Cruz by Jim Kent. References Harris, R.S. (2007) Improved pairwise alignment of genomic DNA Ph.D. Thesis, The Pennsylvania State University Chiaromonte F, Yap VB, Miller W. Scoring pairwise genomic sequence alignments. Pac Symp Biocomput. 2002:115-26. PMID: 11928468 Kent WJ, Baertsch R, Hinrichs A, Miller W, Haussler D. Evolution's cauldron: duplication, deletion, and rearrangement in the mouse and human genomes. Proc Natl Acad Sci U S A. 2003 Sep 30;100(20):11484-9. PMID: 14500911; PMC: PMC208784 Schwartz S, Kent WJ, Smit A, Zhang Z, Baertsch R, Hardison RC, Haussler D, Miller W. Human-mouse alignments with BLASTZ. Genome Res. 2003 Jan;13(1):103-7. PMID: 12529312; PMC: PMC430961 chainNetPanTro1Viewnet Net Chimp (Nov. 2003 (CGSC 1.1/panTro1)), Chain and Net Alignments Comparative Genomics netPanTro1 panTro1 Net Chimp (Nov. 2003 (CGSC 1.1/panTro1)) Alignment net Comparative Genomics Description This track shows regions of the genome that are alignable to other genomes ("chain" subtracks) or in synteny ("net" subtracks). The alignable parts are shown with thick blocks that look like exons. Non-alignable parts between these are shown like introns. Chain Track The chain track shows alignments of chimp (Nov. 2003 (CGSC 1.1/panTro1)) to the human genome using a gap scoring system that allows longer gaps than traditional affine gap scoring systems. It can also tolerate gaps in both chimp and human simultaneously. These "double-sided" gaps can be caused by local inversions and overlapping deletions in both species. The chain track displays boxes joined together by either single or double lines. The boxes represent aligning regions. Single lines indicate gaps that are largely due to a deletion in the chimp assembly or an insertion in the human assembly. Double lines represent more complex gaps that involve substantial sequence in both species. This may result from inversions, overlapping deletions, an abundance of local mutation, or an unsequenced gap in one species. In cases where multiple chains align over a particular region of the human genome, the chains with single-lined gaps are often due to processed pseudogenes, while chains with double-lined gaps are more often due to paralogs and unprocessed pseudogenes. In the "pack" and "full" display modes, the individual feature names indicate the chromosome, strand, and location (in thousands) of the match for each matching alignment. Net Track The net track shows the best chimp/human chain for every part of the human genome. It is useful for finding syntenic regions, possibly orthologs, and for studying genome rearrangement. The chimp sequence used in this annotation is from the Nov. 2003 (CGSC 1.1/panTro1) assembly. Display Conventions and Configuration Chain Track By default, the chains to chromosome-based assemblies are colored based on which chromosome they map to in the aligning organism. To turn off the coloring, check the "off" button next to: Color track based on chromosome. To display only the chains of one chromosome in the aligning organism, enter the name of that chromosome (e.g. chr4) in box next to: Filter by chromosome. Net Track In full display mode, the top-level (level 1) chains are the largest, highest-scoring chains that span this region. In many cases gaps exist in the top-level chain. When possible, these are filled in by other chains that are displayed at level 2. The gaps in level 2 chains may be filled by level 3 chains and so forth. In the graphical display, the boxes represent ungapped alignments; the lines represent gaps. Click on a box to view detailed information about the chain as a whole; click on a line to display information about the gap. The detailed information is useful in determining the cause of the gap or, for lower level chains, the genomic rearrangement. Individual items in the display are categorized as one of four types (other than gap): Top - the best, longest match. Displayed on level 1. Syn - line-ups on the same chromosome as the gap in the level above it. Inv - a line-up on the same chromosome as the gap above it, but in the opposite orientation. NonSyn - a match to a chromosome different from the gap in the level above. Methods Chain track Transposons that have been inserted since the chimp/human split were removed from the assemblies. The abbreviated genomes were aligned with lastz, and the transposons were added back in. The resulting alignments were converted into axt format using the lavToAxt program. The axt alignments were fed into axtChain, which organizes all alignments between a single chimp chromosome and a single human chromosome into a group and creates a kd-tree out of the gapless subsections (blocks) of the alignments. A dynamic program was then run over the kd-trees to find the maximally scoring chains of these blocks. The following matrix was used:  ACGT A90-330-236-356 C-330100-318-236 G-236-318100-330 T-356-236-33090 Chains scoring below a minimum score of "1000" were discarded; the remaining chains are displayed in this track. The linear gap matrix used with axtChain: -linearGap=loose tablesize 11 smallSize 111 position 1 2 3 11 111 2111 12111 32111 72111 152111 252111 qGap 325 360 400 450 600 1100 3600 7600 15600 31600 56600 tGap 325 360 400 450 600 1100 3600 7600 15600 31600 56600 bothGap 625 660 700 750 900 1400 4000 8000 16000 32000 57000 Net track Chains were derived from lastz alignments, using the methods described on the chain tracks description pages, and sorted with the highest-scoring chains in the genome ranked first. The program chainNet was then used to place the chains one at a time, trimming them as necessary to fit into sections not already covered by a higher-scoring chain. During this process, a natural hierarchy emerged in which a chain that filled a gap in a higher-scoring chain was placed underneath that chain. The program netSyntenic was used to fill in information about the relationship between higher- and lower-level chains, such as whether a lower-level chain was syntenic or inverted relative to the higher-level chain. The program netClass was then used to fill in how much of the gaps and chains contained Ns (sequencing gaps) in one or both species and how much was filled with transposons inserted before and after the two organisms diverged. Credits Lastz (previously known as blastz) was developed at Pennsylvania State University by Minmei Hou, Scott Schwartz, Zheng Zhang, and Webb Miller with advice from Ross Hardison. Lineage-specific repeats were identified by Arian Smit and his RepeatMasker program. The axtChain program was developed at the University of California at Santa Cruz by Jim Kent with advice from Webb Miller and David Haussler. The browser display and database storage of the chains and nets were created by Robert Baertsch and Jim Kent. The chainNet, netSyntenic, and netClass programs were developed at the University of California Santa Cruz by Jim Kent. References Harris, R.S. (2007) Improved pairwise alignment of genomic DNA Ph.D. Thesis, The Pennsylvania State University Chiaromonte F, Yap VB, Miller W. Scoring pairwise genomic sequence alignments. Pac Symp Biocomput. 2002:115-26. PMID: 11928468 Kent WJ, Baertsch R, Hinrichs A, Miller W, Haussler D. Evolution's cauldron: duplication, deletion, and rearrangement in the mouse and human genomes. Proc Natl Acad Sci U S A. 2003 Sep 30;100(20):11484-9. PMID: 14500911; PMC: PMC208784 Schwartz S, Kent WJ, Smit A, Zhang Z, Baertsch R, Hardison RC, Haussler D, Miller W. Human-mouse alignments with BLASTZ. Genome Res. 2003 Jan;13(1):103-7. PMID: 12529312; PMC: PMC430961 chainNetPanTro1Viewchain Chain Chimp (Nov. 2003 (CGSC 1.1/panTro1)), Chain and Net Alignments Comparative Genomics chainPanTro1 panTro1 Chain Chimp (Nov. 2003 (CGSC 1.1/panTro1)) Chained Alignments Comparative Genomics Description This track shows regions of the genome that are alignable to other genomes ("chain" subtracks) or in synteny ("net" subtracks). The alignable parts are shown with thick blocks that look like exons. Non-alignable parts between these are shown like introns. Chain Track The chain track shows alignments of chimp (Nov. 2003 (CGSC 1.1/panTro1)) to the human genome using a gap scoring system that allows longer gaps than traditional affine gap scoring systems. It can also tolerate gaps in both chimp and human simultaneously. These "double-sided" gaps can be caused by local inversions and overlapping deletions in both species. The chain track displays boxes joined together by either single or double lines. The boxes represent aligning regions. Single lines indicate gaps that are largely due to a deletion in the chimp assembly or an insertion in the human assembly. Double lines represent more complex gaps that involve substantial sequence in both species. This may result from inversions, overlapping deletions, an abundance of local mutation, or an unsequenced gap in one species. In cases where multiple chains align over a particular region of the human genome, the chains with single-lined gaps are often due to processed pseudogenes, while chains with double-lined gaps are more often due to paralogs and unprocessed pseudogenes. In the "pack" and "full" display modes, the individual feature names indicate the chromosome, strand, and location (in thousands) of the match for each matching alignment. Net Track The net track shows the best chimp/human chain for every part of the human genome. It is useful for finding syntenic regions, possibly orthologs, and for studying genome rearrangement. The chimp sequence used in this annotation is from the Nov. 2003 (CGSC 1.1/panTro1) assembly. Display Conventions and Configuration Chain Track By default, the chains to chromosome-based assemblies are colored based on which chromosome they map to in the aligning organism. To turn off the coloring, check the "off" button next to: Color track based on chromosome. To display only the chains of one chromosome in the aligning organism, enter the name of that chromosome (e.g. chr4) in box next to: Filter by chromosome. Net Track In full display mode, the top-level (level 1) chains are the largest, highest-scoring chains that span this region. In many cases gaps exist in the top-level chain. When possible, these are filled in by other chains that are displayed at level 2. The gaps in level 2 chains may be filled by level 3 chains and so forth. In the graphical display, the boxes represent ungapped alignments; the lines represent gaps. Click on a box to view detailed information about the chain as a whole; click on a line to display information about the gap. The detailed information is useful in determining the cause of the gap or, for lower level chains, the genomic rearrangement. Individual items in the display are categorized as one of four types (other than gap): Top - the best, longest match. Displayed on level 1. Syn - line-ups on the same chromosome as the gap in the level above it. Inv - a line-up on the same chromosome as the gap above it, but in the opposite orientation. NonSyn - a match to a chromosome different from the gap in the level above. Methods Chain track Transposons that have been inserted since the chimp/human split were removed from the assemblies. The abbreviated genomes were aligned with lastz, and the transposons were added back in. The resulting alignments were converted into axt format using the lavToAxt program. The axt alignments were fed into axtChain, which organizes all alignments between a single chimp chromosome and a single human chromosome into a group and creates a kd-tree out of the gapless subsections (blocks) of the alignments. A dynamic program was then run over the kd-trees to find the maximally scoring chains of these blocks. The following matrix was used:  ACGT A90-330-236-356 C-330100-318-236 G-236-318100-330 T-356-236-33090 Chains scoring below a minimum score of "1000" were discarded; the remaining chains are displayed in this track. The linear gap matrix used with axtChain: -linearGap=loose tablesize 11 smallSize 111 position 1 2 3 11 111 2111 12111 32111 72111 152111 252111 qGap 325 360 400 450 600 1100 3600 7600 15600 31600 56600 tGap 325 360 400 450 600 1100 3600 7600 15600 31600 56600 bothGap 625 660 700 750 900 1400 4000 8000 16000 32000 57000 Net track Chains were derived from lastz alignments, using the methods described on the chain tracks description pages, and sorted with the highest-scoring chains in the genome ranked first. The program chainNet was then used to place the chains one at a time, trimming them as necessary to fit into sections not already covered by a higher-scoring chain. During this process, a natural hierarchy emerged in which a chain that filled a gap in a higher-scoring chain was placed underneath that chain. The program netSyntenic was used to fill in information about the relationship between higher- and lower-level chains, such as whether a lower-level chain was syntenic or inverted relative to the higher-level chain. The program netClass was then used to fill in how much of the gaps and chains contained Ns (sequencing gaps) in one or both species and how much was filled with transposons inserted before and after the two organisms diverged. Credits Lastz (previously known as blastz) was developed at Pennsylvania State University by Minmei Hou, Scott Schwartz, Zheng Zhang, and Webb Miller with advice from Ross Hardison. Lineage-specific repeats were identified by Arian Smit and his RepeatMasker program. The axtChain program was developed at the University of California at Santa Cruz by Jim Kent with advice from Webb Miller and David Haussler. The browser display and database storage of the chains and nets were created by Robert Baertsch and Jim Kent. The chainNet, netSyntenic, and netClass programs were developed at the University of California Santa Cruz by Jim Kent. References Harris, R.S. (2007) Improved pairwise alignment of genomic DNA Ph.D. Thesis, The Pennsylvania State University Chiaromonte F, Yap VB, Miller W. Scoring pairwise genomic sequence alignments. Pac Symp Biocomput. 2002:115-26. PMID: 11928468 Kent WJ, Baertsch R, Hinrichs A, Miller W, Haussler D. Evolution's cauldron: duplication, deletion, and rearrangement in the mouse and human genomes. Proc Natl Acad Sci U S A. 2003 Sep 30;100(20):11484-9. PMID: 14500911; PMC: PMC208784 Schwartz S, Kent WJ, Smit A, Zhang Z, Baertsch R, Hardison RC, Haussler D, Miller W. Human-mouse alignments with BLASTZ. Genome Res. 2003 Jan;13(1):103-7. PMID: 12529312; PMC: PMC430961 chainNetRheMac2 Rhesus Chain/Net Rhesus (Jan. 2006 (MGSC Merged 1.0/rheMac2)), Chain and Net Alignments Comparative Genomics Description This track shows regions of the genome that are alignable to other genomes ("chain" subtracks) or in synteny ("net" subtracks). The alignable parts are shown with thick blocks that look like exons. Non-alignable parts between these are shown like introns. Chain Track The chain track shows alignments of rhesus (Jan. 2006 (MGSC Merged 1.0/rheMac2)) to the human genome using a gap scoring system that allows longer gaps than traditional affine gap scoring systems. It can also tolerate gaps in both rhesus and human simultaneously. These "double-sided" gaps can be caused by local inversions and overlapping deletions in both species. The chain track displays boxes joined together by either single or double lines. The boxes represent aligning regions. Single lines indicate gaps that are largely due to a deletion in the rhesus assembly or an insertion in the human assembly. Double lines represent more complex gaps that involve substantial sequence in both species. This may result from inversions, overlapping deletions, an abundance of local mutation, or an unsequenced gap in one species. In cases where multiple chains align over a particular region of the human genome, the chains with single-lined gaps are often due to processed pseudogenes, while chains with double-lined gaps are more often due to paralogs and unprocessed pseudogenes. In the "pack" and "full" display modes, the individual feature names indicate the chromosome, strand, and location (in thousands) of the match for each matching alignment. Net Track The net track shows the best rhesus/human chain for every part of the human genome. It is useful for finding syntenic regions, possibly orthologs, and for studying genome rearrangement. The rhesus sequence used in this annotation is from the Jan. 2006 (MGSC Merged 1.0/rheMac2) assembly. Display Conventions and Configuration Chain Track By default, the chains to chromosome-based assemblies are colored based on which chromosome they map to in the aligning organism. To turn off the coloring, check the "off" button next to: Color track based on chromosome. To display only the chains of one chromosome in the aligning organism, enter the name of that chromosome (e.g. chr4) in box next to: Filter by chromosome. Net Track In full display mode, the top-level (level 1) chains are the largest, highest-scoring chains that span this region. In many cases gaps exist in the top-level chain. When possible, these are filled in by other chains that are displayed at level 2. The gaps in level 2 chains may be filled by level 3 chains and so forth. In the graphical display, the boxes represent ungapped alignments; the lines represent gaps. Click on a box to view detailed information about the chain as a whole; click on a line to display information about the gap. The detailed information is useful in determining the cause of the gap or, for lower level chains, the genomic rearrangement. Individual items in the display are categorized as one of four types (other than gap): Top - the best, longest match. Displayed on level 1. Syn - line-ups on the same chromosome as the gap in the level above it. Inv - a line-up on the same chromosome as the gap above it, but in the opposite orientation. NonSyn - a match to a chromosome different from the gap in the level above. Methods Chain track Transposons that have been inserted since the rhesus/human split were removed from the assemblies. The abbreviated genomes were aligned with lastz, and the transposons were added back in. The resulting alignments were converted into axt format using the lavToAxt program. The axt alignments were fed into axtChain, which organizes all alignments between a single rhesus chromosome and a single human chromosome into a group and creates a kd-tree out of the gapless subsections (blocks) of the alignments. A dynamic program was then run over the kd-trees to find the maximally scoring chains of these blocks. The following matrix was used:  ACGT A91-114-31-123 C-114100-125-31 G-31-125100-114 T-123-31-11491 Chains scoring below a minimum score of "5000" were discarded; the remaining chains are displayed in this track. The linear gap matrix used with axtChain: -linearGap=medium tableSize 11 smallSize 111 position 1 2 3 11 111 2111 12111 32111 72111 152111 252111 qGap 350 425 450 600 900 2900 22900 57900 117900 217900 317900 tGap 350 425 450 600 900 2900 22900 57900 117900 217900 317900 bothGap 750 825 850 1000 1300 3300 23300 58300 118300 218300 318300 Net track Chains were derived from lastz alignments, using the methods described on the chain tracks description pages, and sorted with the highest-scoring chains in the genome ranked first. The program chainNet was then used to place the chains one at a time, trimming them as necessary to fit into sections not already covered by a higher-scoring chain. During this process, a natural hierarchy emerged in which a chain that filled a gap in a higher-scoring chain was placed underneath that chain. The program netSyntenic was used to fill in information about the relationship between higher- and lower-level chains, such as whether a lower-level chain was syntenic or inverted relative to the higher-level chain. The program netClass was then used to fill in how much of the gaps and chains contained Ns (sequencing gaps) in one or both species and how much was filled with transposons inserted before and after the two organisms diverged. Credits Lastz (previously known as blastz) was developed at Pennsylvania State University by Minmei Hou, Scott Schwartz, Zheng Zhang, and Webb Miller with advice from Ross Hardison. Lineage-specific repeats were identified by Arian Smit and his RepeatMasker program. The axtChain program was developed at the University of California at Santa Cruz by Jim Kent with advice from Webb Miller and David Haussler. The browser display and database storage of the chains and nets were created by Robert Baertsch and Jim Kent. The chainNet, netSyntenic, and netClass programs were developed at the University of California Santa Cruz by Jim Kent. References Harris, R.S. (2007) Improved pairwise alignment of genomic DNA Ph.D. Thesis, The Pennsylvania State University Chiaromonte F, Yap VB, Miller W. Scoring pairwise genomic sequence alignments. Pac Symp Biocomput. 2002:115-26. PMID: 11928468 Kent WJ, Baertsch R, Hinrichs A, Miller W, Haussler D. Evolution's cauldron: duplication, deletion, and rearrangement in the mouse and human genomes. Proc Natl Acad Sci U S A. 2003 Sep 30;100(20):11484-9. PMID: 14500911; PMC: PMC208784 Schwartz S, Kent WJ, Smit A, Zhang Z, Baertsch R, Hardison RC, Haussler D, Miller W. Human-mouse alignments with BLASTZ. Genome Res. 2003 Jan;13(1):103-7. PMID: 12529312; PMC: PMC430961 chainNetRheMac2Viewnet Net Rhesus (Jan. 2006 (MGSC Merged 1.0/rheMac2)), Chain and Net Alignments Comparative Genomics netRheMac2 Rhesus Net Rhesus (Jan. 2006 (MGSC Merged 1.0/rheMac2)) Alignment net Comparative Genomics Description This track shows regions of the genome that are alignable to other genomes ("chain" subtracks) or in synteny ("net" subtracks). The alignable parts are shown with thick blocks that look like exons. Non-alignable parts between these are shown like introns. Chain Track The chain track shows alignments of rhesus (Jan. 2006 (MGSC Merged 1.0/rheMac2)) to the human genome using a gap scoring system that allows longer gaps than traditional affine gap scoring systems. It can also tolerate gaps in both rhesus and human simultaneously. These "double-sided" gaps can be caused by local inversions and overlapping deletions in both species. The chain track displays boxes joined together by either single or double lines. The boxes represent aligning regions. Single lines indicate gaps that are largely due to a deletion in the rhesus assembly or an insertion in the human assembly. Double lines represent more complex gaps that involve substantial sequence in both species. This may result from inversions, overlapping deletions, an abundance of local mutation, or an unsequenced gap in one species. In cases where multiple chains align over a particular region of the human genome, the chains with single-lined gaps are often due to processed pseudogenes, while chains with double-lined gaps are more often due to paralogs and unprocessed pseudogenes. In the "pack" and "full" display modes, the individual feature names indicate the chromosome, strand, and location (in thousands) of the match for each matching alignment. Net Track The net track shows the best rhesus/human chain for every part of the human genome. It is useful for finding syntenic regions, possibly orthologs, and for studying genome rearrangement. The rhesus sequence used in this annotation is from the Jan. 2006 (MGSC Merged 1.0/rheMac2) assembly. Display Conventions and Configuration Chain Track By default, the chains to chromosome-based assemblies are colored based on which chromosome they map to in the aligning organism. To turn off the coloring, check the "off" button next to: Color track based on chromosome. To display only the chains of one chromosome in the aligning organism, enter the name of that chromosome (e.g. chr4) in box next to: Filter by chromosome. Net Track In full display mode, the top-level (level 1) chains are the largest, highest-scoring chains that span this region. In many cases gaps exist in the top-level chain. When possible, these are filled in by other chains that are displayed at level 2. The gaps in level 2 chains may be filled by level 3 chains and so forth. In the graphical display, the boxes represent ungapped alignments; the lines represent gaps. Click on a box to view detailed information about the chain as a whole; click on a line to display information about the gap. The detailed information is useful in determining the cause of the gap or, for lower level chains, the genomic rearrangement. Individual items in the display are categorized as one of four types (other than gap): Top - the best, longest match. Displayed on level 1. Syn - line-ups on the same chromosome as the gap in the level above it. Inv - a line-up on the same chromosome as the gap above it, but in the opposite orientation. NonSyn - a match to a chromosome different from the gap in the level above. Methods Chain track Transposons that have been inserted since the rhesus/human split were removed from the assemblies. The abbreviated genomes were aligned with lastz, and the transposons were added back in. The resulting alignments were converted into axt format using the lavToAxt program. The axt alignments were fed into axtChain, which organizes all alignments between a single rhesus chromosome and a single human chromosome into a group and creates a kd-tree out of the gapless subsections (blocks) of the alignments. A dynamic program was then run over the kd-trees to find the maximally scoring chains of these blocks. The following matrix was used:  ACGT A91-114-31-123 C-114100-125-31 G-31-125100-114 T-123-31-11491 Chains scoring below a minimum score of "5000" were discarded; the remaining chains are displayed in this track. The linear gap matrix used with axtChain: -linearGap=medium tableSize 11 smallSize 111 position 1 2 3 11 111 2111 12111 32111 72111 152111 252111 qGap 350 425 450 600 900 2900 22900 57900 117900 217900 317900 tGap 350 425 450 600 900 2900 22900 57900 117900 217900 317900 bothGap 750 825 850 1000 1300 3300 23300 58300 118300 218300 318300 Net track Chains were derived from lastz alignments, using the methods described on the chain tracks description pages, and sorted with the highest-scoring chains in the genome ranked first. The program chainNet was then used to place the chains one at a time, trimming them as necessary to fit into sections not already covered by a higher-scoring chain. During this process, a natural hierarchy emerged in which a chain that filled a gap in a higher-scoring chain was placed underneath that chain. The program netSyntenic was used to fill in information about the relationship between higher- and lower-level chains, such as whether a lower-level chain was syntenic or inverted relative to the higher-level chain. The program netClass was then used to fill in how much of the gaps and chains contained Ns (sequencing gaps) in one or both species and how much was filled with transposons inserted before and after the two organisms diverged. Credits Lastz (previously known as blastz) was developed at Pennsylvania State University by Minmei Hou, Scott Schwartz, Zheng Zhang, and Webb Miller with advice from Ross Hardison. Lineage-specific repeats were identified by Arian Smit and his RepeatMasker program. The axtChain program was developed at the University of California at Santa Cruz by Jim Kent with advice from Webb Miller and David Haussler. The browser display and database storage of the chains and nets were created by Robert Baertsch and Jim Kent. The chainNet, netSyntenic, and netClass programs were developed at the University of California Santa Cruz by Jim Kent. References Harris, R.S. (2007) Improved pairwise alignment of genomic DNA Ph.D. Thesis, The Pennsylvania State University Chiaromonte F, Yap VB, Miller W. Scoring pairwise genomic sequence alignments. Pac Symp Biocomput. 2002:115-26. PMID: 11928468 Kent WJ, Baertsch R, Hinrichs A, Miller W, Haussler D. Evolution's cauldron: duplication, deletion, and rearrangement in the mouse and human genomes. Proc Natl Acad Sci U S A. 2003 Sep 30;100(20):11484-9. PMID: 14500911; PMC: PMC208784 Schwartz S, Kent WJ, Smit A, Zhang Z, Baertsch R, Hardison RC, Haussler D, Miller W. Human-mouse alignments with BLASTZ. Genome Res. 2003 Jan;13(1):103-7. PMID: 12529312; PMC: PMC430961 chainNetRheMac2Viewchain Chain Rhesus (Jan. 2006 (MGSC Merged 1.0/rheMac2)), Chain and Net Alignments Comparative Genomics chainRheMac2 Rhesus Chain Rhesus (Jan. 2006 (MGSC Merged 1.0/rheMac2)) Chained Alignments Comparative Genomics Description This track shows regions of the genome that are alignable to other genomes ("chain" subtracks) or in synteny ("net" subtracks). The alignable parts are shown with thick blocks that look like exons. Non-alignable parts between these are shown like introns. Chain Track The chain track shows alignments of rhesus (Jan. 2006 (MGSC Merged 1.0/rheMac2)) to the human genome using a gap scoring system that allows longer gaps than traditional affine gap scoring systems. It can also tolerate gaps in both rhesus and human simultaneously. These "double-sided" gaps can be caused by local inversions and overlapping deletions in both species. The chain track displays boxes joined together by either single or double lines. The boxes represent aligning regions. Single lines indicate gaps that are largely due to a deletion in the rhesus assembly or an insertion in the human assembly. Double lines represent more complex gaps that involve substantial sequence in both species. This may result from inversions, overlapping deletions, an abundance of local mutation, or an unsequenced gap in one species. In cases where multiple chains align over a particular region of the human genome, the chains with single-lined gaps are often due to processed pseudogenes, while chains with double-lined gaps are more often due to paralogs and unprocessed pseudogenes. In the "pack" and "full" display modes, the individual feature names indicate the chromosome, strand, and location (in thousands) of the match for each matching alignment. Net Track The net track shows the best rhesus/human chain for every part of the human genome. It is useful for finding syntenic regions, possibly orthologs, and for studying genome rearrangement. The rhesus sequence used in this annotation is from the Jan. 2006 (MGSC Merged 1.0/rheMac2) assembly. Display Conventions and Configuration Chain Track By default, the chains to chromosome-based assemblies are colored based on which chromosome they map to in the aligning organism. To turn off the coloring, check the "off" button next to: Color track based on chromosome. To display only the chains of one chromosome in the aligning organism, enter the name of that chromosome (e.g. chr4) in box next to: Filter by chromosome. Net Track In full display mode, the top-level (level 1) chains are the largest, highest-scoring chains that span this region. In many cases gaps exist in the top-level chain. When possible, these are filled in by other chains that are displayed at level 2. The gaps in level 2 chains may be filled by level 3 chains and so forth. In the graphical display, the boxes represent ungapped alignments; the lines represent gaps. Click on a box to view detailed information about the chain as a whole; click on a line to display information about the gap. The detailed information is useful in determining the cause of the gap or, for lower level chains, the genomic rearrangement. Individual items in the display are categorized as one of four types (other than gap): Top - the best, longest match. Displayed on level 1. Syn - line-ups on the same chromosome as the gap in the level above it. Inv - a line-up on the same chromosome as the gap above it, but in the opposite orientation. NonSyn - a match to a chromosome different from the gap in the level above. Methods Chain track Transposons that have been inserted since the rhesus/human split were removed from the assemblies. The abbreviated genomes were aligned with lastz, and the transposons were added back in. The resulting alignments were converted into axt format using the lavToAxt program. The axt alignments were fed into axtChain, which organizes all alignments between a single rhesus chromosome and a single human chromosome into a group and creates a kd-tree out of the gapless subsections (blocks) of the alignments. A dynamic program was then run over the kd-trees to find the maximally scoring chains of these blocks. The following matrix was used:  ACGT A91-114-31-123 C-114100-125-31 G-31-125100-114 T-123-31-11491 Chains scoring below a minimum score of "5000" were discarded; the remaining chains are displayed in this track. The linear gap matrix used with axtChain: -linearGap=medium tableSize 11 smallSize 111 position 1 2 3 11 111 2111 12111 32111 72111 152111 252111 qGap 350 425 450 600 900 2900 22900 57900 117900 217900 317900 tGap 350 425 450 600 900 2900 22900 57900 117900 217900 317900 bothGap 750 825 850 1000 1300 3300 23300 58300 118300 218300 318300 Net track Chains were derived from lastz alignments, using the methods described on the chain tracks description pages, and sorted with the highest-scoring chains in the genome ranked first. The program chainNet was then used to place the chains one at a time, trimming them as necessary to fit into sections not already covered by a higher-scoring chain. During this process, a natural hierarchy emerged in which a chain that filled a gap in a higher-scoring chain was placed underneath that chain. The program netSyntenic was used to fill in information about the relationship between higher- and lower-level chains, such as whether a lower-level chain was syntenic or inverted relative to the higher-level chain. The program netClass was then used to fill in how much of the gaps and chains contained Ns (sequencing gaps) in one or both species and how much was filled with transposons inserted before and after the two organisms diverged. Credits Lastz (previously known as blastz) was developed at Pennsylvania State University by Minmei Hou, Scott Schwartz, Zheng Zhang, and Webb Miller with advice from Ross Hardison. Lineage-specific repeats were identified by Arian Smit and his RepeatMasker program. The axtChain program was developed at the University of California at Santa Cruz by Jim Kent with advice from Webb Miller and David Haussler. The browser display and database storage of the chains and nets were created by Robert Baertsch and Jim Kent. The chainNet, netSyntenic, and netClass programs were developed at the University of California Santa Cruz by Jim Kent. References Harris, R.S. (2007) Improved pairwise alignment of genomic DNA Ph.D. Thesis, The Pennsylvania State University Chiaromonte F, Yap VB, Miller W. Scoring pairwise genomic sequence alignments. Pac Symp Biocomput. 2002:115-26. PMID: 11928468 Kent WJ, Baertsch R, Hinrichs A, Miller W, Haussler D. Evolution's cauldron: duplication, deletion, and rearrangement in the mouse and human genomes. Proc Natl Acad Sci U S A. 2003 Sep 30;100(20):11484-9. PMID: 14500911; PMC: PMC208784 Schwartz S, Kent WJ, Smit A, Zhang Z, Baertsch R, Hardison RC, Haussler D, Miller W. Human-mouse alignments with BLASTZ. Genome Res. 2003 Jan;13(1):103-7. PMID: 12529312; PMC: PMC430961 chainNetRn4 Rat Chain/Net Rat (Nov. 2004 (Baylor 3.4/rn4)), Chain and Net Alignments Comparative Genomics Description This track shows regions of the genome that are alignable to other genomes ("chain" subtracks) or in synteny ("net" subtracks). The alignable parts are shown with thick blocks that look like exons. Non-alignable parts between these are shown like introns. Chain Track The chain track shows alignments of rat (Nov. 2004 (Baylor 3.4/rn4)) to the human genome using a gap scoring system that allows longer gaps than traditional affine gap scoring systems. It can also tolerate gaps in both rat and human simultaneously. These "double-sided" gaps can be caused by local inversions and overlapping deletions in both species. The chain track displays boxes joined together by either single or double lines. The boxes represent aligning regions. Single lines indicate gaps that are largely due to a deletion in the rat assembly or an insertion in the human assembly. Double lines represent more complex gaps that involve substantial sequence in both species. This may result from inversions, overlapping deletions, an abundance of local mutation, or an unsequenced gap in one species. In cases where multiple chains align over a particular region of the human genome, the chains with single-lined gaps are often due to processed pseudogenes, while chains with double-lined gaps are more often due to paralogs and unprocessed pseudogenes. In the "pack" and "full" display modes, the individual feature names indicate the chromosome, strand, and location (in thousands) of the match for each matching alignment. Net Track The net track shows the best rat/human chain for every part of the human genome. It is useful for finding syntenic regions, possibly orthologs, and for studying genome rearrangement. The rat sequence used in this annotation is from the Nov. 2004 (Baylor 3.4/rn4) assembly. Display Conventions and Configuration Chain Track By default, the chains to chromosome-based assemblies are colored based on which chromosome they map to in the aligning organism. To turn off the coloring, check the "off" button next to: Color track based on chromosome. To display only the chains of one chromosome in the aligning organism, enter the name of that chromosome (e.g. chr4) in box next to: Filter by chromosome. Net Track In full display mode, the top-level (level 1) chains are the largest, highest-scoring chains that span this region. In many cases gaps exist in the top-level chain. When possible, these are filled in by other chains that are displayed at level 2. The gaps in level 2 chains may be filled by level 3 chains and so forth. In the graphical display, the boxes represent ungapped alignments; the lines represent gaps. Click on a box to view detailed information about the chain as a whole; click on a line to display information about the gap. The detailed information is useful in determining the cause of the gap or, for lower level chains, the genomic rearrangement. Individual items in the display are categorized as one of four types (other than gap): Top - the best, longest match. Displayed on level 1. Syn - line-ups on the same chromosome as the gap in the level above it. Inv - a line-up on the same chromosome as the gap above it, but in the opposite orientation. NonSyn - a match to a chromosome different from the gap in the level above. Methods Chain track Transposons that have been inserted since the rat/human split were removed from the assemblies. The abbreviated genomes were aligned with lastz, and the transposons were added back in. The resulting alignments were converted into axt format using the lavToAxt program. The axt alignments were fed into axtChain, which organizes all alignments between a single rat chromosome and a single human chromosome into a group and creates a kd-tree out of the gapless subsections (blocks) of the alignments. A dynamic program was then run over the kd-trees to find the maximally scoring chains of these blocks. The following matrix was used:  ACGT A91-114-31-123 C-114100-125-31 G-31-125100-114 T-123-31-11491 Chains scoring below a minimum score of "1000" were discarded; the remaining chains are displayed in this track. The linear gap matrix used with axtChain: -linearGap=medium tableSize 11 smallSize 111 position 1 2 3 11 111 2111 12111 32111 72111 152111 252111 qGap 350 425 450 600 900 2900 22900 57900 117900 217900 317900 tGap 350 425 450 600 900 2900 22900 57900 117900 217900 317900 bothGap 750 825 850 1000 1300 3300 23300 58300 118300 218300 318300 Net track Chains were derived from lastz alignments, using the methods described on the chain tracks description pages, and sorted with the highest-scoring chains in the genome ranked first. The program chainNet was then used to place the chains one at a time, trimming them as necessary to fit into sections not already covered by a higher-scoring chain. During this process, a natural hierarchy emerged in which a chain that filled a gap in a higher-scoring chain was placed underneath that chain. The program netSyntenic was used to fill in information about the relationship between higher- and lower-level chains, such as whether a lower-level chain was syntenic or inverted relative to the higher-level chain. The program netClass was then used to fill in how much of the gaps and chains contained Ns (sequencing gaps) in one or both species and how much was filled with transposons inserted before and after the two organisms diverged. Credits Lastz (previously known as blastz) was developed at Pennsylvania State University by Minmei Hou, Scott Schwartz, Zheng Zhang, and Webb Miller with advice from Ross Hardison. Lineage-specific repeats were identified by Arian Smit and his RepeatMasker program. The axtChain program was developed at the University of California at Santa Cruz by Jim Kent with advice from Webb Miller and David Haussler. The browser display and database storage of the chains and nets were created by Robert Baertsch and Jim Kent. The chainNet, netSyntenic, and netClass programs were developed at the University of California Santa Cruz by Jim Kent. References Harris, R.S. (2007) Improved pairwise alignment of genomic DNA Ph.D. Thesis, The Pennsylvania State University Chiaromonte F, Yap VB, Miller W. Scoring pairwise genomic sequence alignments. Pac Symp Biocomput. 2002:115-26. PMID: 11928468 Kent WJ, Baertsch R, Hinrichs A, Miller W, Haussler D. Evolution's cauldron: duplication, deletion, and rearrangement in the mouse and human genomes. Proc Natl Acad Sci U S A. 2003 Sep 30;100(20):11484-9. PMID: 14500911; PMC: PMC208784 Schwartz S, Kent WJ, Smit A, Zhang Z, Baertsch R, Hardison RC, Haussler D, Miller W. Human-mouse alignments with BLASTZ. Genome Res. 2003 Jan;13(1):103-7. PMID: 12529312; PMC: PMC430961 chainNetRn4Viewnet Net Rat (Nov. 2004 (Baylor 3.4/rn4)), Chain and Net Alignments Comparative Genomics netRn4 Rat Net Rat (Nov. 2004 (Baylor 3.4/rn4)) Alignment Net Comparative Genomics Description This track shows regions of the genome that are alignable to other genomes ("chain" subtracks) or in synteny ("net" subtracks). The alignable parts are shown with thick blocks that look like exons. Non-alignable parts between these are shown like introns. Chain Track The chain track shows alignments of rat (Nov. 2004 (Baylor 3.4/rn4)) to the human genome using a gap scoring system that allows longer gaps than traditional affine gap scoring systems. It can also tolerate gaps in both rat and human simultaneously. These "double-sided" gaps can be caused by local inversions and overlapping deletions in both species. The chain track displays boxes joined together by either single or double lines. The boxes represent aligning regions. Single lines indicate gaps that are largely due to a deletion in the rat assembly or an insertion in the human assembly. Double lines represent more complex gaps that involve substantial sequence in both species. This may result from inversions, overlapping deletions, an abundance of local mutation, or an unsequenced gap in one species. In cases where multiple chains align over a particular region of the human genome, the chains with single-lined gaps are often due to processed pseudogenes, while chains with double-lined gaps are more often due to paralogs and unprocessed pseudogenes. In the "pack" and "full" display modes, the individual feature names indicate the chromosome, strand, and location (in thousands) of the match for each matching alignment. Net Track The net track shows the best rat/human chain for every part of the human genome. It is useful for finding syntenic regions, possibly orthologs, and for studying genome rearrangement. The rat sequence used in this annotation is from the Nov. 2004 (Baylor 3.4/rn4) assembly. Display Conventions and Configuration Chain Track By default, the chains to chromosome-based assemblies are colored based on which chromosome they map to in the aligning organism. To turn off the coloring, check the "off" button next to: Color track based on chromosome. To display only the chains of one chromosome in the aligning organism, enter the name of that chromosome (e.g. chr4) in box next to: Filter by chromosome. Net Track In full display mode, the top-level (level 1) chains are the largest, highest-scoring chains that span this region. In many cases gaps exist in the top-level chain. When possible, these are filled in by other chains that are displayed at level 2. The gaps in level 2 chains may be filled by level 3 chains and so forth. In the graphical display, the boxes represent ungapped alignments; the lines represent gaps. Click on a box to view detailed information about the chain as a whole; click on a line to display information about the gap. The detailed information is useful in determining the cause of the gap or, for lower level chains, the genomic rearrangement. Individual items in the display are categorized as one of four types (other than gap): Top - the best, longest match. Displayed on level 1. Syn - line-ups on the same chromosome as the gap in the level above it. Inv - a line-up on the same chromosome as the gap above it, but in the opposite orientation. NonSyn - a match to a chromosome different from the gap in the level above. Methods Chain track Transposons that have been inserted since the rat/human split were removed from the assemblies. The abbreviated genomes were aligned with lastz, and the transposons were added back in. The resulting alignments were converted into axt format using the lavToAxt program. The axt alignments were fed into axtChain, which organizes all alignments between a single rat chromosome and a single human chromosome into a group and creates a kd-tree out of the gapless subsections (blocks) of the alignments. A dynamic program was then run over the kd-trees to find the maximally scoring chains of these blocks. The following matrix was used:  ACGT A91-114-31-123 C-114100-125-31 G-31-125100-114 T-123-31-11491 Chains scoring below a minimum score of "1000" were discarded; the remaining chains are displayed in this track. The linear gap matrix used with axtChain: -linearGap=medium tableSize 11 smallSize 111 position 1 2 3 11 111 2111 12111 32111 72111 152111 252111 qGap 350 425 450 600 900 2900 22900 57900 117900 217900 317900 tGap 350 425 450 600 900 2900 22900 57900 117900 217900 317900 bothGap 750 825 850 1000 1300 3300 23300 58300 118300 218300 318300 Net track Chains were derived from lastz alignments, using the methods described on the chain tracks description pages, and sorted with the highest-scoring chains in the genome ranked first. The program chainNet was then used to place the chains one at a time, trimming them as necessary to fit into sections not already covered by a higher-scoring chain. During this process, a natural hierarchy emerged in which a chain that filled a gap in a higher-scoring chain was placed underneath that chain. The program netSyntenic was used to fill in information about the relationship between higher- and lower-level chains, such as whether a lower-level chain was syntenic or inverted relative to the higher-level chain. The program netClass was then used to fill in how much of the gaps and chains contained Ns (sequencing gaps) in one or both species and how much was filled with transposons inserted before and after the two organisms diverged. Credits Lastz (previously known as blastz) was developed at Pennsylvania State University by Minmei Hou, Scott Schwartz, Zheng Zhang, and Webb Miller with advice from Ross Hardison. Lineage-specific repeats were identified by Arian Smit and his RepeatMasker program. The axtChain program was developed at the University of California at Santa Cruz by Jim Kent with advice from Webb Miller and David Haussler. The browser display and database storage of the chains and nets were created by Robert Baertsch and Jim Kent. The chainNet, netSyntenic, and netClass programs were developed at the University of California Santa Cruz by Jim Kent. References Harris, R.S. (2007) Improved pairwise alignment of genomic DNA Ph.D. Thesis, The Pennsylvania State University Chiaromonte F, Yap VB, Miller W. Scoring pairwise genomic sequence alignments. Pac Symp Biocomput. 2002:115-26. PMID: 11928468 Kent WJ, Baertsch R, Hinrichs A, Miller W, Haussler D. Evolution's cauldron: duplication, deletion, and rearrangement in the mouse and human genomes. Proc Natl Acad Sci U S A. 2003 Sep 30;100(20):11484-9. PMID: 14500911; PMC: PMC208784 Schwartz S, Kent WJ, Smit A, Zhang Z, Baertsch R, Hardison RC, Haussler D, Miller W. Human-mouse alignments with BLASTZ. Genome Res. 2003 Jan;13(1):103-7. PMID: 12529312; PMC: PMC430961 chainNetRn4Viewchain Chain Rat (Nov. 2004 (Baylor 3.4/rn4)), Chain and Net Alignments Comparative Genomics chainRn4 Rat Chain Rat (Nov. 2004 (Baylor 3.4/rn4)) Chained Alignments Comparative Genomics Description This track shows regions of the genome that are alignable to other genomes ("chain" subtracks) or in synteny ("net" subtracks). The alignable parts are shown with thick blocks that look like exons. Non-alignable parts between these are shown like introns. Chain Track The chain track shows alignments of rat (Nov. 2004 (Baylor 3.4/rn4)) to the human genome using a gap scoring system that allows longer gaps than traditional affine gap scoring systems. It can also tolerate gaps in both rat and human simultaneously. These "double-sided" gaps can be caused by local inversions and overlapping deletions in both species. The chain track displays boxes joined together by either single or double lines. The boxes represent aligning regions. Single lines indicate gaps that are largely due to a deletion in the rat assembly or an insertion in the human assembly. Double lines represent more complex gaps that involve substantial sequence in both species. This may result from inversions, overlapping deletions, an abundance of local mutation, or an unsequenced gap in one species. In cases where multiple chains align over a particular region of the human genome, the chains with single-lined gaps are often due to processed pseudogenes, while chains with double-lined gaps are more often due to paralogs and unprocessed pseudogenes. In the "pack" and "full" display modes, the individual feature names indicate the chromosome, strand, and location (in thousands) of the match for each matching alignment. Net Track The net track shows the best rat/human chain for every part of the human genome. It is useful for finding syntenic regions, possibly orthologs, and for studying genome rearrangement. The rat sequence used in this annotation is from the Nov. 2004 (Baylor 3.4/rn4) assembly. Display Conventions and Configuration Chain Track By default, the chains to chromosome-based assemblies are colored based on which chromosome they map to in the aligning organism. To turn off the coloring, check the "off" button next to: Color track based on chromosome. To display only the chains of one chromosome in the aligning organism, enter the name of that chromosome (e.g. chr4) in box next to: Filter by chromosome. Net Track In full display mode, the top-level (level 1) chains are the largest, highest-scoring chains that span this region. In many cases gaps exist in the top-level chain. When possible, these are filled in by other chains that are displayed at level 2. The gaps in level 2 chains may be filled by level 3 chains and so forth. In the graphical display, the boxes represent ungapped alignments; the lines represent gaps. Click on a box to view detailed information about the chain as a whole; click on a line to display information about the gap. The detailed information is useful in determining the cause of the gap or, for lower level chains, the genomic rearrangement. Individual items in the display are categorized as one of four types (other than gap): Top - the best, longest match. Displayed on level 1. Syn - line-ups on the same chromosome as the gap in the level above it. Inv - a line-up on the same chromosome as the gap above it, but in the opposite orientation. NonSyn - a match to a chromosome different from the gap in the level above. Methods Chain track Transposons that have been inserted since the rat/human split were removed from the assemblies. The abbreviated genomes were aligned with lastz, and the transposons were added back in. The resulting alignments were converted into axt format using the lavToAxt program. The axt alignments were fed into axtChain, which organizes all alignments between a single rat chromosome and a single human chromosome into a group and creates a kd-tree out of the gapless subsections (blocks) of the alignments. A dynamic program was then run over the kd-trees to find the maximally scoring chains of these blocks. The following matrix was used:  ACGT A91-114-31-123 C-114100-125-31 G-31-125100-114 T-123-31-11491 Chains scoring below a minimum score of "1000" were discarded; the remaining chains are displayed in this track. The linear gap matrix used with axtChain: -linearGap=medium tableSize 11 smallSize 111 position 1 2 3 11 111 2111 12111 32111 72111 152111 252111 qGap 350 425 450 600 900 2900 22900 57900 117900 217900 317900 tGap 350 425 450 600 900 2900 22900 57900 117900 217900 317900 bothGap 750 825 850 1000 1300 3300 23300 58300 118300 218300 318300 Net track Chains were derived from lastz alignments, using the methods described on the chain tracks description pages, and sorted with the highest-scoring chains in the genome ranked first. The program chainNet was then used to place the chains one at a time, trimming them as necessary to fit into sections not already covered by a higher-scoring chain. During this process, a natural hierarchy emerged in which a chain that filled a gap in a higher-scoring chain was placed underneath that chain. The program netSyntenic was used to fill in information about the relationship between higher- and lower-level chains, such as whether a lower-level chain was syntenic or inverted relative to the higher-level chain. The program netClass was then used to fill in how much of the gaps and chains contained Ns (sequencing gaps) in one or both species and how much was filled with transposons inserted before and after the two organisms diverged. Credits Lastz (previously known as blastz) was developed at Pennsylvania State University by Minmei Hou, Scott Schwartz, Zheng Zhang, and Webb Miller with advice from Ross Hardison. Lineage-specific repeats were identified by Arian Smit and his RepeatMasker program. The axtChain program was developed at the University of California at Santa Cruz by Jim Kent with advice from Webb Miller and David Haussler. The browser display and database storage of the chains and nets were created by Robert Baertsch and Jim Kent. The chainNet, netSyntenic, and netClass programs were developed at the University of California Santa Cruz by Jim Kent. References Harris, R.S. (2007) Improved pairwise alignment of genomic DNA Ph.D. Thesis, The Pennsylvania State University Chiaromonte F, Yap VB, Miller W. Scoring pairwise genomic sequence alignments. Pac Symp Biocomput. 2002:115-26. PMID: 11928468 Kent WJ, Baertsch R, Hinrichs A, Miller W, Haussler D. Evolution's cauldron: duplication, deletion, and rearrangement in the mouse and human genomes. Proc Natl Acad Sci U S A. 2003 Sep 30;100(20):11484-9. PMID: 14500911; PMC: PMC208784 Schwartz S, Kent WJ, Smit A, Zhang Z, Baertsch R, Hardison RC, Haussler D, Miller W. Human-mouse alignments with BLASTZ. Genome Res. 2003 Jan;13(1):103-7. PMID: 12529312; PMC: PMC430961 chainNetMm7 Mouse Chain/Net Mouse (Aug. 2005 (NCBI35/mm7)), Chain and Net Alignments Comparative Genomics Description This track shows regions of the genome that are alignable to other genomes ("chain" subtracks) or in synteny ("net" subtracks). The alignable parts are shown with thick blocks that look like exons. Non-alignable parts between these are shown like introns. Chain Track The chain track shows alignments of mouse (Aug. 2005 (NCBI35/mm7)) to the human genome using a gap scoring system that allows longer gaps than traditional affine gap scoring systems. It can also tolerate gaps in both mouse and human simultaneously. These "double-sided" gaps can be caused by local inversions and overlapping deletions in both species. The chain track displays boxes joined together by either single or double lines. The boxes represent aligning regions. Single lines indicate gaps that are largely due to a deletion in the mouse assembly or an insertion in the human assembly. Double lines represent more complex gaps that involve substantial sequence in both species. This may result from inversions, overlapping deletions, an abundance of local mutation, or an unsequenced gap in one species. In cases where multiple chains align over a particular region of the human genome, the chains with single-lined gaps are often due to processed pseudogenes, while chains with double-lined gaps are more often due to paralogs and unprocessed pseudogenes. In the "pack" and "full" display modes, the individual feature names indicate the chromosome, strand, and location (in thousands) of the match for each matching alignment. Net Track The net track shows the best mouse/human chain for every part of the human genome. It is useful for finding syntenic regions, possibly orthologs, and for studying genome rearrangement. The mouse sequence used in this annotation is from the Aug. 2005 (NCBI35/mm7) assembly. Display Conventions and Configuration Chain Track By default, the chains to chromosome-based assemblies are colored based on which chromosome they map to in the aligning organism. To turn off the coloring, check the "off" button next to: Color track based on chromosome. To display only the chains of one chromosome in the aligning organism, enter the name of that chromosome (e.g. chr4) in box next to: Filter by chromosome. Net Track In full display mode, the top-level (level 1) chains are the largest, highest-scoring chains that span this region. In many cases gaps exist in the top-level chain. When possible, these are filled in by other chains that are displayed at level 2. The gaps in level 2 chains may be filled by level 3 chains and so forth. In the graphical display, the boxes represent ungapped alignments; the lines represent gaps. Click on a box to view detailed information about the chain as a whole; click on a line to display information about the gap. The detailed information is useful in determining the cause of the gap or, for lower level chains, the genomic rearrangement. Individual items in the display are categorized as one of four types (other than gap): Top - the best, longest match. Displayed on level 1. Syn - line-ups on the same chromosome as the gap in the level above it. Inv - a line-up on the same chromosome as the gap above it, but in the opposite orientation. NonSyn - a match to a chromosome different from the gap in the level above. Methods Chain track Transposons that have been inserted since the mouse/human split were removed from the assemblies. The abbreviated genomes were aligned with lastz, and the transposons were added back in. The resulting alignments were converted into axt format using the lavToAxt program. The axt alignments were fed into axtChain, which organizes all alignments between a single mouse chromosome and a single human chromosome into a group and creates a kd-tree out of the gapless subsections (blocks) of the alignments. A dynamic program was then run over the kd-trees to find the maximally scoring chains of these blocks. The following matrix was used:  ACGT A91-114-31-123 C-114100-125-31 G-31-125100-114 T-123-31-11491 Chains scoring below a minimum score of "3000" were discarded; the remaining chains are displayed in this track. The linear gap matrix used with axtChain: -linearGap=medium tableSize 11 smallSize 111 position 1 2 3 11 111 2111 12111 32111 72111 152111 252111 qGap 350 425 450 600 900 2900 22900 57900 117900 217900 317900 tGap 350 425 450 600 900 2900 22900 57900 117900 217900 317900 bothGap 750 825 850 1000 1300 3300 23300 58300 118300 218300 318300 Net track Chains were derived from lastz alignments, using the methods described on the chain tracks description pages, and sorted with the highest-scoring chains in the genome ranked first. The program chainNet was then used to place the chains one at a time, trimming them as necessary to fit into sections not already covered by a higher-scoring chain. During this process, a natural hierarchy emerged in which a chain that filled a gap in a higher-scoring chain was placed underneath that chain. The program netSyntenic was used to fill in information about the relationship between higher- and lower-level chains, such as whether a lower-level chain was syntenic or inverted relative to the higher-level chain. The program netClass was then used to fill in how much of the gaps and chains contained Ns (sequencing gaps) in one or both species and how much was filled with transposons inserted before and after the two organisms diverged. Credits Lastz (previously known as blastz) was developed at Pennsylvania State University by Minmei Hou, Scott Schwartz, Zheng Zhang, and Webb Miller with advice from Ross Hardison. Lineage-specific repeats were identified by Arian Smit and his RepeatMasker program. The axtChain program was developed at the University of California at Santa Cruz by Jim Kent with advice from Webb Miller and David Haussler. The browser display and database storage of the chains and nets were created by Robert Baertsch and Jim Kent. The chainNet, netSyntenic, and netClass programs were developed at the University of California Santa Cruz by Jim Kent. References Harris, R.S. (2007) Improved pairwise alignment of genomic DNA Ph.D. Thesis, The Pennsylvania State University Chiaromonte F, Yap VB, Miller W. Scoring pairwise genomic sequence alignments. Pac Symp Biocomput. 2002:115-26. PMID: 11928468 Kent WJ, Baertsch R, Hinrichs A, Miller W, Haussler D. Evolution's cauldron: duplication, deletion, and rearrangement in the mouse and human genomes. Proc Natl Acad Sci U S A. 2003 Sep 30;100(20):11484-9. PMID: 14500911; PMC: PMC208784 Schwartz S, Kent WJ, Smit A, Zhang Z, Baertsch R, Hardison RC, Haussler D, Miller W. Human-mouse alignments with BLASTZ. Genome Res. 2003 Jan;13(1):103-7. PMID: 12529312; PMC: PMC430961 chainNetMm7Viewnet Net Mouse (Aug. 2005 (NCBI35/mm7)), Chain and Net Alignments Comparative Genomics netMm7 Mouse Net Mouse (Aug. 2005 (NCBI35/mm7)) Alignment Net Comparative Genomics Description This track shows regions of the genome that are alignable to other genomes ("chain" subtracks) or in synteny ("net" subtracks). The alignable parts are shown with thick blocks that look like exons. Non-alignable parts between these are shown like introns. Chain Track The chain track shows alignments of mouse (Aug. 2005 (NCBI35/mm7)) to the human genome using a gap scoring system that allows longer gaps than traditional affine gap scoring systems. It can also tolerate gaps in both mouse and human simultaneously. These "double-sided" gaps can be caused by local inversions and overlapping deletions in both species. The chain track displays boxes joined together by either single or double lines. The boxes represent aligning regions. Single lines indicate gaps that are largely due to a deletion in the mouse assembly or an insertion in the human assembly. Double lines represent more complex gaps that involve substantial sequence in both species. This may result from inversions, overlapping deletions, an abundance of local mutation, or an unsequenced gap in one species. In cases where multiple chains align over a particular region of the human genome, the chains with single-lined gaps are often due to processed pseudogenes, while chains with double-lined gaps are more often due to paralogs and unprocessed pseudogenes. In the "pack" and "full" display modes, the individual feature names indicate the chromosome, strand, and location (in thousands) of the match for each matching alignment. Net Track The net track shows the best mouse/human chain for every part of the human genome. It is useful for finding syntenic regions, possibly orthologs, and for studying genome rearrangement. The mouse sequence used in this annotation is from the Aug. 2005 (NCBI35/mm7) assembly. Display Conventions and Configuration Chain Track By default, the chains to chromosome-based assemblies are colored based on which chromosome they map to in the aligning organism. To turn off the coloring, check the "off" button next to: Color track based on chromosome. To display only the chains of one chromosome in the aligning organism, enter the name of that chromosome (e.g. chr4) in box next to: Filter by chromosome. Net Track In full display mode, the top-level (level 1) chains are the largest, highest-scoring chains that span this region. In many cases gaps exist in the top-level chain. When possible, these are filled in by other chains that are displayed at level 2. The gaps in level 2 chains may be filled by level 3 chains and so forth. In the graphical display, the boxes represent ungapped alignments; the lines represent gaps. Click on a box to view detailed information about the chain as a whole; click on a line to display information about the gap. The detailed information is useful in determining the cause of the gap or, for lower level chains, the genomic rearrangement. Individual items in the display are categorized as one of four types (other than gap): Top - the best, longest match. Displayed on level 1. Syn - line-ups on the same chromosome as the gap in the level above it. Inv - a line-up on the same chromosome as the gap above it, but in the opposite orientation. NonSyn - a match to a chromosome different from the gap in the level above. Methods Chain track Transposons that have been inserted since the mouse/human split were removed from the assemblies. The abbreviated genomes were aligned with lastz, and the transposons were added back in. The resulting alignments were converted into axt format using the lavToAxt program. The axt alignments were fed into axtChain, which organizes all alignments between a single mouse chromosome and a single human chromosome into a group and creates a kd-tree out of the gapless subsections (blocks) of the alignments. A dynamic program was then run over the kd-trees to find the maximally scoring chains of these blocks. The following matrix was used:  ACGT A91-114-31-123 C-114100-125-31 G-31-125100-114 T-123-31-11491 Chains scoring below a minimum score of "3000" were discarded; the remaining chains are displayed in this track. The linear gap matrix used with axtChain: -linearGap=medium tableSize 11 smallSize 111 position 1 2 3 11 111 2111 12111 32111 72111 152111 252111 qGap 350 425 450 600 900 2900 22900 57900 117900 217900 317900 tGap 350 425 450 600 900 2900 22900 57900 117900 217900 317900 bothGap 750 825 850 1000 1300 3300 23300 58300 118300 218300 318300 Net track Chains were derived from lastz alignments, using the methods described on the chain tracks description pages, and sorted with the highest-scoring chains in the genome ranked first. The program chainNet was then used to place the chains one at a time, trimming them as necessary to fit into sections not already covered by a higher-scoring chain. During this process, a natural hierarchy emerged in which a chain that filled a gap in a higher-scoring chain was placed underneath that chain. The program netSyntenic was used to fill in information about the relationship between higher- and lower-level chains, such as whether a lower-level chain was syntenic or inverted relative to the higher-level chain. The program netClass was then used to fill in how much of the gaps and chains contained Ns (sequencing gaps) in one or both species and how much was filled with transposons inserted before and after the two organisms diverged. Credits Lastz (previously known as blastz) was developed at Pennsylvania State University by Minmei Hou, Scott Schwartz, Zheng Zhang, and Webb Miller with advice from Ross Hardison. Lineage-specific repeats were identified by Arian Smit and his RepeatMasker program. The axtChain program was developed at the University of California at Santa Cruz by Jim Kent with advice from Webb Miller and David Haussler. The browser display and database storage of the chains and nets were created by Robert Baertsch and Jim Kent. The chainNet, netSyntenic, and netClass programs were developed at the University of California Santa Cruz by Jim Kent. References Harris, R.S. (2007) Improved pairwise alignment of genomic DNA Ph.D. Thesis, The Pennsylvania State University Chiaromonte F, Yap VB, Miller W. Scoring pairwise genomic sequence alignments. Pac Symp Biocomput. 2002:115-26. PMID: 11928468 Kent WJ, Baertsch R, Hinrichs A, Miller W, Haussler D. Evolution's cauldron: duplication, deletion, and rearrangement in the mouse and human genomes. Proc Natl Acad Sci U S A. 2003 Sep 30;100(20):11484-9. PMID: 14500911; PMC: PMC208784 Schwartz S, Kent WJ, Smit A, Zhang Z, Baertsch R, Hardison RC, Haussler D, Miller W. Human-mouse alignments with BLASTZ. Genome Res. 2003 Jan;13(1):103-7. PMID: 12529312; PMC: PMC430961 chainNetMm7Viewchain Chain Mouse (Aug. 2005 (NCBI35/mm7)), Chain and Net Alignments Comparative Genomics chainMm7 Mouse Chain Mouse (Aug. 2005 (NCBI35/mm7)) Chained Alignments Comparative Genomics Description This track shows regions of the genome that are alignable to other genomes ("chain" subtracks) or in synteny ("net" subtracks). The alignable parts are shown with thick blocks that look like exons. Non-alignable parts between these are shown like introns. Chain Track The chain track shows alignments of mouse (Aug. 2005 (NCBI35/mm7)) to the human genome using a gap scoring system that allows longer gaps than traditional affine gap scoring systems. It can also tolerate gaps in both mouse and human simultaneously. These "double-sided" gaps can be caused by local inversions and overlapping deletions in both species. The chain track displays boxes joined together by either single or double lines. The boxes represent aligning regions. Single lines indicate gaps that are largely due to a deletion in the mouse assembly or an insertion in the human assembly. Double lines represent more complex gaps that involve substantial sequence in both species. This may result from inversions, overlapping deletions, an abundance of local mutation, or an unsequenced gap in one species. In cases where multiple chains align over a particular region of the human genome, the chains with single-lined gaps are often due to processed pseudogenes, while chains with double-lined gaps are more often due to paralogs and unprocessed pseudogenes. In the "pack" and "full" display modes, the individual feature names indicate the chromosome, strand, and location (in thousands) of the match for each matching alignment. Net Track The net track shows the best mouse/human chain for every part of the human genome. It is useful for finding syntenic regions, possibly orthologs, and for studying genome rearrangement. The mouse sequence used in this annotation is from the Aug. 2005 (NCBI35/mm7) assembly. Display Conventions and Configuration Chain Track By default, the chains to chromosome-based assemblies are colored based on which chromosome they map to in the aligning organism. To turn off the coloring, check the "off" button next to: Color track based on chromosome. To display only the chains of one chromosome in the aligning organism, enter the name of that chromosome (e.g. chr4) in box next to: Filter by chromosome. Net Track In full display mode, the top-level (level 1) chains are the largest, highest-scoring chains that span this region. In many cases gaps exist in the top-level chain. When possible, these are filled in by other chains that are displayed at level 2. The gaps in level 2 chains may be filled by level 3 chains and so forth. In the graphical display, the boxes represent ungapped alignments; the lines represent gaps. Click on a box to view detailed information about the chain as a whole; click on a line to display information about the gap. The detailed information is useful in determining the cause of the gap or, for lower level chains, the genomic rearrangement. Individual items in the display are categorized as one of four types (other than gap): Top - the best, longest match. Displayed on level 1. Syn - line-ups on the same chromosome as the gap in the level above it. Inv - a line-up on the same chromosome as the gap above it, but in the opposite orientation. NonSyn - a match to a chromosome different from the gap in the level above. Methods Chain track Transposons that have been inserted since the mouse/human split were removed from the assemblies. The abbreviated genomes were aligned with lastz, and the transposons were added back in. The resulting alignments were converted into axt format using the lavToAxt program. The axt alignments were fed into axtChain, which organizes all alignments between a single mouse chromosome and a single human chromosome into a group and creates a kd-tree out of the gapless subsections (blocks) of the alignments. A dynamic program was then run over the kd-trees to find the maximally scoring chains of these blocks. The following matrix was used:  ACGT A91-114-31-123 C-114100-125-31 G-31-125100-114 T-123-31-11491 Chains scoring below a minimum score of "3000" were discarded; the remaining chains are displayed in this track. The linear gap matrix used with axtChain: -linearGap=medium tableSize 11 smallSize 111 position 1 2 3 11 111 2111 12111 32111 72111 152111 252111 qGap 350 425 450 600 900 2900 22900 57900 117900 217900 317900 tGap 350 425 450 600 900 2900 22900 57900 117900 217900 317900 bothGap 750 825 850 1000 1300 3300 23300 58300 118300 218300 318300 Net track Chains were derived from lastz alignments, using the methods described on the chain tracks description pages, and sorted with the highest-scoring chains in the genome ranked first. The program chainNet was then used to place the chains one at a time, trimming them as necessary to fit into sections not already covered by a higher-scoring chain. During this process, a natural hierarchy emerged in which a chain that filled a gap in a higher-scoring chain was placed underneath that chain. The program netSyntenic was used to fill in information about the relationship between higher- and lower-level chains, such as whether a lower-level chain was syntenic or inverted relative to the higher-level chain. The program netClass was then used to fill in how much of the gaps and chains contained Ns (sequencing gaps) in one or both species and how much was filled with transposons inserted before and after the two organisms diverged. Credits Lastz (previously known as blastz) was developed at Pennsylvania State University by Minmei Hou, Scott Schwartz, Zheng Zhang, and Webb Miller with advice from Ross Hardison. Lineage-specific repeats were identified by Arian Smit and his RepeatMasker program. The axtChain program was developed at the University of California at Santa Cruz by Jim Kent with advice from Webb Miller and David Haussler. The browser display and database storage of the chains and nets were created by Robert Baertsch and Jim Kent. The chainNet, netSyntenic, and netClass programs were developed at the University of California Santa Cruz by Jim Kent. References Harris, R.S. (2007) Improved pairwise alignment of genomic DNA Ph.D. Thesis, The Pennsylvania State University Chiaromonte F, Yap VB, Miller W. Scoring pairwise genomic sequence alignments. Pac Symp Biocomput. 2002:115-26. PMID: 11928468 Kent WJ, Baertsch R, Hinrichs A, Miller W, Haussler D. Evolution's cauldron: duplication, deletion, and rearrangement in the mouse and human genomes. Proc Natl Acad Sci U S A. 2003 Sep 30;100(20):11484-9. PMID: 14500911; PMC: PMC208784 Schwartz S, Kent WJ, Smit A, Zhang Z, Baertsch R, Hardison RC, Haussler D, Miller W. Human-mouse alignments with BLASTZ. Genome Res. 2003 Jan;13(1):103-7. PMID: 12529312; PMC: PMC430961 chainNetCanFam2 Dog Chain/Net Dog (May 2005 (Broad/canFam2)), Chain and Net Alignments Comparative Genomics Description This track shows regions of the genome that are alignable to other genomes ("chain" subtracks) or in synteny ("net" subtracks). The alignable parts are shown with thick blocks that look like exons. Non-alignable parts between these are shown like introns. Chain Track The chain track shows alignments of dog (May 2005 (Broad/canFam2)) to the human genome using a gap scoring system that allows longer gaps than traditional affine gap scoring systems. It can also tolerate gaps in both dog and human simultaneously. These "double-sided" gaps can be caused by local inversions and overlapping deletions in both species. The chain track displays boxes joined together by either single or double lines. The boxes represent aligning regions. Single lines indicate gaps that are largely due to a deletion in the dog assembly or an insertion in the human assembly. Double lines represent more complex gaps that involve substantial sequence in both species. This may result from inversions, overlapping deletions, an abundance of local mutation, or an unsequenced gap in one species. In cases where multiple chains align over a particular region of the human genome, the chains with single-lined gaps are often due to processed pseudogenes, while chains with double-lined gaps are more often due to paralogs and unprocessed pseudogenes. In the "pack" and "full" display modes, the individual feature names indicate the chromosome, strand, and location (in thousands) of the match for each matching alignment. Net Track The net track shows the best dog/human chain for every part of the human genome. It is useful for finding syntenic regions, possibly orthologs, and for studying genome rearrangement. The dog sequence used in this annotation is from the May 2005 (Broad/canFam2) assembly. Display Conventions and Configuration Chain Track By default, the chains to chromosome-based assemblies are colored based on which chromosome they map to in the aligning organism. To turn off the coloring, check the "off" button next to: Color track based on chromosome. To display only the chains of one chromosome in the aligning organism, enter the name of that chromosome (e.g. chr4) in box next to: Filter by chromosome. Net Track In full display mode, the top-level (level 1) chains are the largest, highest-scoring chains that span this region. In many cases gaps exist in the top-level chain. When possible, these are filled in by other chains that are displayed at level 2. The gaps in level 2 chains may be filled by level 3 chains and so forth. In the graphical display, the boxes represent ungapped alignments; the lines represent gaps. Click on a box to view detailed information about the chain as a whole; click on a line to display information about the gap. The detailed information is useful in determining the cause of the gap or, for lower level chains, the genomic rearrangement. Individual items in the display are categorized as one of four types (other than gap): Top - the best, longest match. Displayed on level 1. Syn - line-ups on the same chromosome as the gap in the level above it. Inv - a line-up on the same chromosome as the gap above it, but in the opposite orientation. NonSyn - a match to a chromosome different from the gap in the level above. Methods Chain track Transposons that have been inserted since the dog/human split were removed from the assemblies. The abbreviated genomes were aligned with lastz, and the transposons were added back in. The resulting alignments were converted into axt format using the lavToAxt program. The axt alignments were fed into axtChain, which organizes all alignments between a single dog chromosome and a single human chromosome into a group and creates a kd-tree out of the gapless subsections (blocks) of the alignments. A dynamic program was then run over the kd-trees to find the maximally scoring chains of these blocks. The following matrix was used:  ACGT A91-114-31-123 C-114100-125-31 G-31-125100-114 T-123-31-11491 Chains scoring below a minimum score of "1000" were discarded; the remaining chains are displayed in this track. The linear gap matrix used with axtChain: -linearGap=loose tablesize 11 smallSize 111 position 1 2 3 11 111 2111 12111 32111 72111 152111 252111 qGap 325 360 400 450 600 1100 3600 7600 15600 31600 56600 tGap 325 360 400 450 600 1100 3600 7600 15600 31600 56600 bothGap 625 660 700 750 900 1400 4000 8000 16000 32000 57000 Net track Chains were derived from lastz alignments, using the methods described on the chain tracks description pages, and sorted with the highest-scoring chains in the genome ranked first. The program chainNet was then used to place the chains one at a time, trimming them as necessary to fit into sections not already covered by a higher-scoring chain. During this process, a natural hierarchy emerged in which a chain that filled a gap in a higher-scoring chain was placed underneath that chain. The program netSyntenic was used to fill in information about the relationship between higher- and lower-level chains, such as whether a lower-level chain was syntenic or inverted relative to the higher-level chain. The program netClass was then used to fill in how much of the gaps and chains contained Ns (sequencing gaps) in one or both species and how much was filled with transposons inserted before and after the two organisms diverged. Credits Lastz (previously known as blastz) was developed at Pennsylvania State University by Minmei Hou, Scott Schwartz, Zheng Zhang, and Webb Miller with advice from Ross Hardison. Lineage-specific repeats were identified by Arian Smit and his RepeatMasker program. The axtChain program was developed at the University of California at Santa Cruz by Jim Kent with advice from Webb Miller and David Haussler. The browser display and database storage of the chains and nets were created by Robert Baertsch and Jim Kent. The chainNet, netSyntenic, and netClass programs were developed at the University of California Santa Cruz by Jim Kent. References Harris, R.S. (2007) Improved pairwise alignment of genomic DNA Ph.D. Thesis, The Pennsylvania State University Chiaromonte F, Yap VB, Miller W. Scoring pairwise genomic sequence alignments. Pac Symp Biocomput. 2002:115-26. PMID: 11928468 Kent WJ, Baertsch R, Hinrichs A, Miller W, Haussler D. Evolution's cauldron: duplication, deletion, and rearrangement in the mouse and human genomes. Proc Natl Acad Sci U S A. 2003 Sep 30;100(20):11484-9. PMID: 14500911; PMC: PMC208784 Schwartz S, Kent WJ, Smit A, Zhang Z, Baertsch R, Hardison RC, Haussler D, Miller W. Human-mouse alignments with BLASTZ. Genome Res. 2003 Jan;13(1):103-7. PMID: 12529312; PMC: PMC430961 chainNetCanFam2Viewnet Net Dog (May 2005 (Broad/canFam2)), Chain and Net Alignments Comparative Genomics netCanFam2 Dog Net Dog (May 2005 (Broad/canFam2)) Alignment Net Comparative Genomics Description This track shows regions of the genome that are alignable to other genomes ("chain" subtracks) or in synteny ("net" subtracks). The alignable parts are shown with thick blocks that look like exons. Non-alignable parts between these are shown like introns. Chain Track The chain track shows alignments of dog (May 2005 (Broad/canFam2)) to the human genome using a gap scoring system that allows longer gaps than traditional affine gap scoring systems. It can also tolerate gaps in both dog and human simultaneously. These "double-sided" gaps can be caused by local inversions and overlapping deletions in both species. The chain track displays boxes joined together by either single or double lines. The boxes represent aligning regions. Single lines indicate gaps that are largely due to a deletion in the dog assembly or an insertion in the human assembly. Double lines represent more complex gaps that involve substantial sequence in both species. This may result from inversions, overlapping deletions, an abundance of local mutation, or an unsequenced gap in one species. In cases where multiple chains align over a particular region of the human genome, the chains with single-lined gaps are often due to processed pseudogenes, while chains with double-lined gaps are more often due to paralogs and unprocessed pseudogenes. In the "pack" and "full" display modes, the individual feature names indicate the chromosome, strand, and location (in thousands) of the match for each matching alignment. Net Track The net track shows the best dog/human chain for every part of the human genome. It is useful for finding syntenic regions, possibly orthologs, and for studying genome rearrangement. The dog sequence used in this annotation is from the May 2005 (Broad/canFam2) assembly. Display Conventions and Configuration Chain Track By default, the chains to chromosome-based assemblies are colored based on which chromosome they map to in the aligning organism. To turn off the coloring, check the "off" button next to: Color track based on chromosome. To display only the chains of one chromosome in the aligning organism, enter the name of that chromosome (e.g. chr4) in box next to: Filter by chromosome. Net Track In full display mode, the top-level (level 1) chains are the largest, highest-scoring chains that span this region. In many cases gaps exist in the top-level chain. When possible, these are filled in by other chains that are displayed at level 2. The gaps in level 2 chains may be filled by level 3 chains and so forth. In the graphical display, the boxes represent ungapped alignments; the lines represent gaps. Click on a box to view detailed information about the chain as a whole; click on a line to display information about the gap. The detailed information is useful in determining the cause of the gap or, for lower level chains, the genomic rearrangement. Individual items in the display are categorized as one of four types (other than gap): Top - the best, longest match. Displayed on level 1. Syn - line-ups on the same chromosome as the gap in the level above it. Inv - a line-up on the same chromosome as the gap above it, but in the opposite orientation. NonSyn - a match to a chromosome different from the gap in the level above. Methods Chain track Transposons that have been inserted since the dog/human split were removed from the assemblies. The abbreviated genomes were aligned with lastz, and the transposons were added back in. The resulting alignments were converted into axt format using the lavToAxt program. The axt alignments were fed into axtChain, which organizes all alignments between a single dog chromosome and a single human chromosome into a group and creates a kd-tree out of the gapless subsections (blocks) of the alignments. A dynamic program was then run over the kd-trees to find the maximally scoring chains of these blocks. The following matrix was used:  ACGT A91-114-31-123 C-114100-125-31 G-31-125100-114 T-123-31-11491 Chains scoring below a minimum score of "1000" were discarded; the remaining chains are displayed in this track. The linear gap matrix used with axtChain: -linearGap=loose tablesize 11 smallSize 111 position 1 2 3 11 111 2111 12111 32111 72111 152111 252111 qGap 325 360 400 450 600 1100 3600 7600 15600 31600 56600 tGap 325 360 400 450 600 1100 3600 7600 15600 31600 56600 bothGap 625 660 700 750 900 1400 4000 8000 16000 32000 57000 Net track Chains were derived from lastz alignments, using the methods described on the chain tracks description pages, and sorted with the highest-scoring chains in the genome ranked first. The program chainNet was then used to place the chains one at a time, trimming them as necessary to fit into sections not already covered by a higher-scoring chain. During this process, a natural hierarchy emerged in which a chain that filled a gap in a higher-scoring chain was placed underneath that chain. The program netSyntenic was used to fill in information about the relationship between higher- and lower-level chains, such as whether a lower-level chain was syntenic or inverted relative to the higher-level chain. The program netClass was then used to fill in how much of the gaps and chains contained Ns (sequencing gaps) in one or both species and how much was filled with transposons inserted before and after the two organisms diverged. Credits Lastz (previously known as blastz) was developed at Pennsylvania State University by Minmei Hou, Scott Schwartz, Zheng Zhang, and Webb Miller with advice from Ross Hardison. Lineage-specific repeats were identified by Arian Smit and his RepeatMasker program. The axtChain program was developed at the University of California at Santa Cruz by Jim Kent with advice from Webb Miller and David Haussler. The browser display and database storage of the chains and nets were created by Robert Baertsch and Jim Kent. The chainNet, netSyntenic, and netClass programs were developed at the University of California Santa Cruz by Jim Kent. References Harris, R.S. (2007) Improved pairwise alignment of genomic DNA Ph.D. Thesis, The Pennsylvania State University Chiaromonte F, Yap VB, Miller W. Scoring pairwise genomic sequence alignments. Pac Symp Biocomput. 2002:115-26. PMID: 11928468 Kent WJ, Baertsch R, Hinrichs A, Miller W, Haussler D. Evolution's cauldron: duplication, deletion, and rearrangement in the mouse and human genomes. Proc Natl Acad Sci U S A. 2003 Sep 30;100(20):11484-9. PMID: 14500911; PMC: PMC208784 Schwartz S, Kent WJ, Smit A, Zhang Z, Baertsch R, Hardison RC, Haussler D, Miller W. Human-mouse alignments with BLASTZ. Genome Res. 2003 Jan;13(1):103-7. PMID: 12529312; PMC: PMC430961 chainNetCanFam2Viewchain Chain Dog (May 2005 (Broad/canFam2)), Chain and Net Alignments Comparative Genomics chainCanFam2 Dog Chain Dog (May 2005 (Broad/canFam2)) Chained Alignments Comparative Genomics Description This track shows regions of the genome that are alignable to other genomes ("chain" subtracks) or in synteny ("net" subtracks). The alignable parts are shown with thick blocks that look like exons. Non-alignable parts between these are shown like introns. Chain Track The chain track shows alignments of dog (May 2005 (Broad/canFam2)) to the human genome using a gap scoring system that allows longer gaps than traditional affine gap scoring systems. It can also tolerate gaps in both dog and human simultaneously. These "double-sided" gaps can be caused by local inversions and overlapping deletions in both species. The chain track displays boxes joined together by either single or double lines. The boxes represent aligning regions. Single lines indicate gaps that are largely due to a deletion in the dog assembly or an insertion in the human assembly. Double lines represent more complex gaps that involve substantial sequence in both species. This may result from inversions, overlapping deletions, an abundance of local mutation, or an unsequenced gap in one species. In cases where multiple chains align over a particular region of the human genome, the chains with single-lined gaps are often due to processed pseudogenes, while chains with double-lined gaps are more often due to paralogs and unprocessed pseudogenes. In the "pack" and "full" display modes, the individual feature names indicate the chromosome, strand, and location (in thousands) of the match for each matching alignment. Net Track The net track shows the best dog/human chain for every part of the human genome. It is useful for finding syntenic regions, possibly orthologs, and for studying genome rearrangement. The dog sequence used in this annotation is from the May 2005 (Broad/canFam2) assembly. Display Conventions and Configuration Chain Track By default, the chains to chromosome-based assemblies are colored based on which chromosome they map to in the aligning organism. To turn off the coloring, check the "off" button next to: Color track based on chromosome. To display only the chains of one chromosome in the aligning organism, enter the name of that chromosome (e.g. chr4) in box next to: Filter by chromosome. Net Track In full display mode, the top-level (level 1) chains are the largest, highest-scoring chains that span this region. In many cases gaps exist in the top-level chain. When possible, these are filled in by other chains that are displayed at level 2. The gaps in level 2 chains may be filled by level 3 chains and so forth. In the graphical display, the boxes represent ungapped alignments; the lines represent gaps. Click on a box to view detailed information about the chain as a whole; click on a line to display information about the gap. The detailed information is useful in determining the cause of the gap or, for lower level chains, the genomic rearrangement. Individual items in the display are categorized as one of four types (other than gap): Top - the best, longest match. Displayed on level 1. Syn - line-ups on the same chromosome as the gap in the level above it. Inv - a line-up on the same chromosome as the gap above it, but in the opposite orientation. NonSyn - a match to a chromosome different from the gap in the level above. Methods Chain track Transposons that have been inserted since the dog/human split were removed from the assemblies. The abbreviated genomes were aligned with lastz, and the transposons were added back in. The resulting alignments were converted into axt format using the lavToAxt program. The axt alignments were fed into axtChain, which organizes all alignments between a single dog chromosome and a single human chromosome into a group and creates a kd-tree out of the gapless subsections (blocks) of the alignments. A dynamic program was then run over the kd-trees to find the maximally scoring chains of these blocks. The following matrix was used:  ACGT A91-114-31-123 C-114100-125-31 G-31-125100-114 T-123-31-11491 Chains scoring below a minimum score of "1000" were discarded; the remaining chains are displayed in this track. The linear gap matrix used with axtChain: -linearGap=loose tablesize 11 smallSize 111 position 1 2 3 11 111 2111 12111 32111 72111 152111 252111 qGap 325 360 400 450 600 1100 3600 7600 15600 31600 56600 tGap 325 360 400 450 600 1100 3600 7600 15600 31600 56600 bothGap 625 660 700 750 900 1400 4000 8000 16000 32000 57000 Net track Chains were derived from lastz alignments, using the methods described on the chain tracks description pages, and sorted with the highest-scoring chains in the genome ranked first. The program chainNet was then used to place the chains one at a time, trimming them as necessary to fit into sections not already covered by a higher-scoring chain. During this process, a natural hierarchy emerged in which a chain that filled a gap in a higher-scoring chain was placed underneath that chain. The program netSyntenic was used to fill in information about the relationship between higher- and lower-level chains, such as whether a lower-level chain was syntenic or inverted relative to the higher-level chain. The program netClass was then used to fill in how much of the gaps and chains contained Ns (sequencing gaps) in one or both species and how much was filled with transposons inserted before and after the two organisms diverged. Credits Lastz (previously known as blastz) was developed at Pennsylvania State University by Minmei Hou, Scott Schwartz, Zheng Zhang, and Webb Miller with advice from Ross Hardison. Lineage-specific repeats were identified by Arian Smit and his RepeatMasker program. The axtChain program was developed at the University of California at Santa Cruz by Jim Kent with advice from Webb Miller and David Haussler. The browser display and database storage of the chains and nets were created by Robert Baertsch and Jim Kent. The chainNet, netSyntenic, and netClass programs were developed at the University of California Santa Cruz by Jim Kent. References Harris, R.S. (2007) Improved pairwise alignment of genomic DNA Ph.D. Thesis, The Pennsylvania State University Chiaromonte F, Yap VB, Miller W. Scoring pairwise genomic sequence alignments. Pac Symp Biocomput. 2002:115-26. PMID: 11928468 Kent WJ, Baertsch R, Hinrichs A, Miller W, Haussler D. Evolution's cauldron: duplication, deletion, and rearrangement in the mouse and human genomes. Proc Natl Acad Sci U S A. 2003 Sep 30;100(20):11484-9. PMID: 14500911; PMC: PMC208784 Schwartz S, Kent WJ, Smit A, Zhang Z, Baertsch R, Hardison RC, Haussler D, Miller W. Human-mouse alignments with BLASTZ. Genome Res. 2003 Jan;13(1):103-7. PMID: 12529312; PMC: PMC430961 bacendsCow Cow BAC Ends Cow (Sep. 2004 (Baylor 1.0/bosTau1)) BAC Ends (BLASTn) Comparative Genomics Description The track shows BLASTn results of approximately 300,000 cattle BAC-ends from the CHORI-240 library against the human genome sequence (hg17). The track displays approximately 53,000 unique BLASTn hits (E less than e-5) in the human genome. Credits Thanks to Harris Lewin and Denis Larkin, University of Illinois at Urbana-Champaign, for providing these data. chainNetBosTau2 Cow Chain/Net Cow (Mar. 2005 (Baylor 2.0/bosTau2)), Chain and Net Alignments Comparative Genomics Description This track shows regions of the genome that are alignable to other genomes ("chain" subtracks) or in synteny ("net" subtracks). The alignable parts are shown with thick blocks that look like exons. Non-alignable parts between these are shown like introns. Chain Track The chain track shows alignments of cow (Mar. 2005 (Baylor 2.0/bosTau2)) to the human genome using a gap scoring system that allows longer gaps than traditional affine gap scoring systems. It can also tolerate gaps in both cow and human simultaneously. These "double-sided" gaps can be caused by local inversions and overlapping deletions in both species. The chain track displays boxes joined together by either single or double lines. The boxes represent aligning regions. Single lines indicate gaps that are largely due to a deletion in the cow assembly or an insertion in the human assembly. Double lines represent more complex gaps that involve substantial sequence in both species. This may result from inversions, overlapping deletions, an abundance of local mutation, or an unsequenced gap in one species. In cases where multiple chains align over a particular region of the human genome, the chains with single-lined gaps are often due to processed pseudogenes, while chains with double-lined gaps are more often due to paralogs and unprocessed pseudogenes. In the "pack" and "full" display modes, the individual feature names indicate the chromosome, strand, and location (in thousands) of the match for each matching alignment. Net Track The net track shows the best cow/human chain for every part of the human genome. It is useful for finding syntenic regions, possibly orthologs, and for studying genome rearrangement. The cow sequence used in this annotation is from the Mar. 2005 (Baylor 2.0/bosTau2) assembly. Display Conventions and Configuration Chain Track By default, the chains to chromosome-based assemblies are colored based on which chromosome they map to in the aligning organism. To turn off the coloring, check the "off" button next to: Color track based on chromosome. To display only the chains of one chromosome in the aligning organism, enter the name of that chromosome (e.g. chr4) in box next to: Filter by chromosome. Net Track In full display mode, the top-level (level 1) chains are the largest, highest-scoring chains that span this region. In many cases gaps exist in the top-level chain. When possible, these are filled in by other chains that are displayed at level 2. The gaps in level 2 chains may be filled by level 3 chains and so forth. In the graphical display, the boxes represent ungapped alignments; the lines represent gaps. Click on a box to view detailed information about the chain as a whole; click on a line to display information about the gap. The detailed information is useful in determining the cause of the gap or, for lower level chains, the genomic rearrangement. Individual items in the display are categorized as one of four types (other than gap): Top - the best, longest match. Displayed on level 1. Syn - line-ups on the same chromosome as the gap in the level above it. Inv - a line-up on the same chromosome as the gap above it, but in the opposite orientation. NonSyn - a match to a chromosome different from the gap in the level above. Methods Chain track Transposons that have been inserted since the cow/human split were removed from the assemblies. The abbreviated genomes were aligned with lastz, and the transposons were added back in. The resulting alignments were converted into axt format using the lavToAxt program. The axt alignments were fed into axtChain, which organizes all alignments between a single cow chromosome and a single human chromosome into a group and creates a kd-tree out of the gapless subsections (blocks) of the alignments. A dynamic program was then run over the kd-trees to find the maximally scoring chains of these blocks. The following matrix was used:  ACGT A91-114-31-123 C-114100-125-31 G-31-125100-114 T-123-31-11491 Chains scoring below a minimum score of "3000" were discarded; the remaining chains are displayed in this track. The linear gap matrix used with axtChain: -linearGap=medium tableSize 11 smallSize 111 position 1 2 3 11 111 2111 12111 32111 72111 152111 252111 qGap 350 425 450 600 900 2900 22900 57900 117900 217900 317900 tGap 350 425 450 600 900 2900 22900 57900 117900 217900 317900 bothGap 750 825 850 1000 1300 3300 23300 58300 118300 218300 318300 Net track Chains were derived from lastz alignments, using the methods described on the chain tracks description pages, and sorted with the highest-scoring chains in the genome ranked first. The program chainNet was then used to place the chains one at a time, trimming them as necessary to fit into sections not already covered by a higher-scoring chain. During this process, a natural hierarchy emerged in which a chain that filled a gap in a higher-scoring chain was placed underneath that chain. The program netSyntenic was used to fill in information about the relationship between higher- and lower-level chains, such as whether a lower-level chain was syntenic or inverted relative to the higher-level chain. The program netClass was then used to fill in how much of the gaps and chains contained Ns (sequencing gaps) in one or both species and how much was filled with transposons inserted before and after the two organisms diverged. Credits Lastz (previously known as blastz) was developed at Pennsylvania State University by Minmei Hou, Scott Schwartz, Zheng Zhang, and Webb Miller with advice from Ross Hardison. Lineage-specific repeats were identified by Arian Smit and his RepeatMasker program. The axtChain program was developed at the University of California at Santa Cruz by Jim Kent with advice from Webb Miller and David Haussler. The browser display and database storage of the chains and nets were created by Robert Baertsch and Jim Kent. The chainNet, netSyntenic, and netClass programs were developed at the University of California Santa Cruz by Jim Kent. References Harris, R.S. (2007) Improved pairwise alignment of genomic DNA Ph.D. Thesis, The Pennsylvania State University Chiaromonte F, Yap VB, Miller W. Scoring pairwise genomic sequence alignments. Pac Symp Biocomput. 2002:115-26. PMID: 11928468 Kent WJ, Baertsch R, Hinrichs A, Miller W, Haussler D. Evolution's cauldron: duplication, deletion, and rearrangement in the mouse and human genomes. Proc Natl Acad Sci U S A. 2003 Sep 30;100(20):11484-9. PMID: 14500911; PMC: PMC208784 Schwartz S, Kent WJ, Smit A, Zhang Z, Baertsch R, Hardison RC, Haussler D, Miller W. Human-mouse alignments with BLASTZ. Genome Res. 2003 Jan;13(1):103-7. PMID: 12529312; PMC: PMC430961 chainNetBosTau2Viewnet Net Cow (Mar. 2005 (Baylor 2.0/bosTau2)), Chain and Net Alignments Comparative Genomics netBosTau2 Cow Net Cow (Mar. 2005 (Baylor 2.0/bosTau2)) Alignment Net Comparative Genomics Description This track shows regions of the genome that are alignable to other genomes ("chain" subtracks) or in synteny ("net" subtracks). The alignable parts are shown with thick blocks that look like exons. Non-alignable parts between these are shown like introns. Chain Track The chain track shows alignments of cow (Mar. 2005 (Baylor 2.0/bosTau2)) to the human genome using a gap scoring system that allows longer gaps than traditional affine gap scoring systems. It can also tolerate gaps in both cow and human simultaneously. These "double-sided" gaps can be caused by local inversions and overlapping deletions in both species. The chain track displays boxes joined together by either single or double lines. The boxes represent aligning regions. Single lines indicate gaps that are largely due to a deletion in the cow assembly or an insertion in the human assembly. Double lines represent more complex gaps that involve substantial sequence in both species. This may result from inversions, overlapping deletions, an abundance of local mutation, or an unsequenced gap in one species. In cases where multiple chains align over a particular region of the human genome, the chains with single-lined gaps are often due to processed pseudogenes, while chains with double-lined gaps are more often due to paralogs and unprocessed pseudogenes. In the "pack" and "full" display modes, the individual feature names indicate the chromosome, strand, and location (in thousands) of the match for each matching alignment. Net Track The net track shows the best cow/human chain for every part of the human genome. It is useful for finding syntenic regions, possibly orthologs, and for studying genome rearrangement. The cow sequence used in this annotation is from the Mar. 2005 (Baylor 2.0/bosTau2) assembly. Display Conventions and Configuration Chain Track By default, the chains to chromosome-based assemblies are colored based on which chromosome they map to in the aligning organism. To turn off the coloring, check the "off" button next to: Color track based on chromosome. To display only the chains of one chromosome in the aligning organism, enter the name of that chromosome (e.g. chr4) in box next to: Filter by chromosome. Net Track In full display mode, the top-level (level 1) chains are the largest, highest-scoring chains that span this region. In many cases gaps exist in the top-level chain. When possible, these are filled in by other chains that are displayed at level 2. The gaps in level 2 chains may be filled by level 3 chains and so forth. In the graphical display, the boxes represent ungapped alignments; the lines represent gaps. Click on a box to view detailed information about the chain as a whole; click on a line to display information about the gap. The detailed information is useful in determining the cause of the gap or, for lower level chains, the genomic rearrangement. Individual items in the display are categorized as one of four types (other than gap): Top - the best, longest match. Displayed on level 1. Syn - line-ups on the same chromosome as the gap in the level above it. Inv - a line-up on the same chromosome as the gap above it, but in the opposite orientation. NonSyn - a match to a chromosome different from the gap in the level above. Methods Chain track Transposons that have been inserted since the cow/human split were removed from the assemblies. The abbreviated genomes were aligned with lastz, and the transposons were added back in. The resulting alignments were converted into axt format using the lavToAxt program. The axt alignments were fed into axtChain, which organizes all alignments between a single cow chromosome and a single human chromosome into a group and creates a kd-tree out of the gapless subsections (blocks) of the alignments. A dynamic program was then run over the kd-trees to find the maximally scoring chains of these blocks. The following matrix was used:  ACGT A91-114-31-123 C-114100-125-31 G-31-125100-114 T-123-31-11491 Chains scoring below a minimum score of "3000" were discarded; the remaining chains are displayed in this track. The linear gap matrix used with axtChain: -linearGap=medium tableSize 11 smallSize 111 position 1 2 3 11 111 2111 12111 32111 72111 152111 252111 qGap 350 425 450 600 900 2900 22900 57900 117900 217900 317900 tGap 350 425 450 600 900 2900 22900 57900 117900 217900 317900 bothGap 750 825 850 1000 1300 3300 23300 58300 118300 218300 318300 Net track Chains were derived from lastz alignments, using the methods described on the chain tracks description pages, and sorted with the highest-scoring chains in the genome ranked first. The program chainNet was then used to place the chains one at a time, trimming them as necessary to fit into sections not already covered by a higher-scoring chain. During this process, a natural hierarchy emerged in which a chain that filled a gap in a higher-scoring chain was placed underneath that chain. The program netSyntenic was used to fill in information about the relationship between higher- and lower-level chains, such as whether a lower-level chain was syntenic or inverted relative to the higher-level chain. The program netClass was then used to fill in how much of the gaps and chains contained Ns (sequencing gaps) in one or both species and how much was filled with transposons inserted before and after the two organisms diverged. Credits Lastz (previously known as blastz) was developed at Pennsylvania State University by Minmei Hou, Scott Schwartz, Zheng Zhang, and Webb Miller with advice from Ross Hardison. Lineage-specific repeats were identified by Arian Smit and his RepeatMasker program. The axtChain program was developed at the University of California at Santa Cruz by Jim Kent with advice from Webb Miller and David Haussler. The browser display and database storage of the chains and nets were created by Robert Baertsch and Jim Kent. The chainNet, netSyntenic, and netClass programs were developed at the University of California Santa Cruz by Jim Kent. References Harris, R.S. (2007) Improved pairwise alignment of genomic DNA Ph.D. Thesis, The Pennsylvania State University Chiaromonte F, Yap VB, Miller W. Scoring pairwise genomic sequence alignments. Pac Symp Biocomput. 2002:115-26. PMID: 11928468 Kent WJ, Baertsch R, Hinrichs A, Miller W, Haussler D. Evolution's cauldron: duplication, deletion, and rearrangement in the mouse and human genomes. Proc Natl Acad Sci U S A. 2003 Sep 30;100(20):11484-9. PMID: 14500911; PMC: PMC208784 Schwartz S, Kent WJ, Smit A, Zhang Z, Baertsch R, Hardison RC, Haussler D, Miller W. Human-mouse alignments with BLASTZ. Genome Res. 2003 Jan;13(1):103-7. PMID: 12529312; PMC: PMC430961 chainNetBosTau2Viewchain Chain Cow (Mar. 2005 (Baylor 2.0/bosTau2)), Chain and Net Alignments Comparative Genomics chainBosTau2 Cow Chain Cow (Mar. 2005 (Baylor 2.0/bosTau2)) Chained Alignments Comparative Genomics Description This track shows regions of the genome that are alignable to other genomes ("chain" subtracks) or in synteny ("net" subtracks). The alignable parts are shown with thick blocks that look like exons. Non-alignable parts between these are shown like introns. Chain Track The chain track shows alignments of cow (Mar. 2005 (Baylor 2.0/bosTau2)) to the human genome using a gap scoring system that allows longer gaps than traditional affine gap scoring systems. It can also tolerate gaps in both cow and human simultaneously. These "double-sided" gaps can be caused by local inversions and overlapping deletions in both species. The chain track displays boxes joined together by either single or double lines. The boxes represent aligning regions. Single lines indicate gaps that are largely due to a deletion in the cow assembly or an insertion in the human assembly. Double lines represent more complex gaps that involve substantial sequence in both species. This may result from inversions, overlapping deletions, an abundance of local mutation, or an unsequenced gap in one species. In cases where multiple chains align over a particular region of the human genome, the chains with single-lined gaps are often due to processed pseudogenes, while chains with double-lined gaps are more often due to paralogs and unprocessed pseudogenes. In the "pack" and "full" display modes, the individual feature names indicate the chromosome, strand, and location (in thousands) of the match for each matching alignment. Net Track The net track shows the best cow/human chain for every part of the human genome. It is useful for finding syntenic regions, possibly orthologs, and for studying genome rearrangement. The cow sequence used in this annotation is from the Mar. 2005 (Baylor 2.0/bosTau2) assembly. Display Conventions and Configuration Chain Track By default, the chains to chromosome-based assemblies are colored based on which chromosome they map to in the aligning organism. To turn off the coloring, check the "off" button next to: Color track based on chromosome. To display only the chains of one chromosome in the aligning organism, enter the name of that chromosome (e.g. chr4) in box next to: Filter by chromosome. Net Track In full display mode, the top-level (level 1) chains are the largest, highest-scoring chains that span this region. In many cases gaps exist in the top-level chain. When possible, these are filled in by other chains that are displayed at level 2. The gaps in level 2 chains may be filled by level 3 chains and so forth. In the graphical display, the boxes represent ungapped alignments; the lines represent gaps. Click on a box to view detailed information about the chain as a whole; click on a line to display information about the gap. The detailed information is useful in determining the cause of the gap or, for lower level chains, the genomic rearrangement. Individual items in the display are categorized as one of four types (other than gap): Top - the best, longest match. Displayed on level 1. Syn - line-ups on the same chromosome as the gap in the level above it. Inv - a line-up on the same chromosome as the gap above it, but in the opposite orientation. NonSyn - a match to a chromosome different from the gap in the level above. Methods Chain track Transposons that have been inserted since the cow/human split were removed from the assemblies. The abbreviated genomes were aligned with lastz, and the transposons were added back in. The resulting alignments were converted into axt format using the lavToAxt program. The axt alignments were fed into axtChain, which organizes all alignments between a single cow chromosome and a single human chromosome into a group and creates a kd-tree out of the gapless subsections (blocks) of the alignments. A dynamic program was then run over the kd-trees to find the maximally scoring chains of these blocks. The following matrix was used:  ACGT A91-114-31-123 C-114100-125-31 G-31-125100-114 T-123-31-11491 Chains scoring below a minimum score of "3000" were discarded; the remaining chains are displayed in this track. The linear gap matrix used with axtChain: -linearGap=medium tableSize 11 smallSize 111 position 1 2 3 11 111 2111 12111 32111 72111 152111 252111 qGap 350 425 450 600 900 2900 22900 57900 117900 217900 317900 tGap 350 425 450 600 900 2900 22900 57900 117900 217900 317900 bothGap 750 825 850 1000 1300 3300 23300 58300 118300 218300 318300 Net track Chains were derived from lastz alignments, using the methods described on the chain tracks description pages, and sorted with the highest-scoring chains in the genome ranked first. The program chainNet was then used to place the chains one at a time, trimming them as necessary to fit into sections not already covered by a higher-scoring chain. During this process, a natural hierarchy emerged in which a chain that filled a gap in a higher-scoring chain was placed underneath that chain. The program netSyntenic was used to fill in information about the relationship between higher- and lower-level chains, such as whether a lower-level chain was syntenic or inverted relative to the higher-level chain. The program netClass was then used to fill in how much of the gaps and chains contained Ns (sequencing gaps) in one or both species and how much was filled with transposons inserted before and after the two organisms diverged. Credits Lastz (previously known as blastz) was developed at Pennsylvania State University by Minmei Hou, Scott Schwartz, Zheng Zhang, and Webb Miller with advice from Ross Hardison. Lineage-specific repeats were identified by Arian Smit and his RepeatMasker program. The axtChain program was developed at the University of California at Santa Cruz by Jim Kent with advice from Webb Miller and David Haussler. The browser display and database storage of the chains and nets were created by Robert Baertsch and Jim Kent. The chainNet, netSyntenic, and netClass programs were developed at the University of California Santa Cruz by Jim Kent. References Harris, R.S. (2007) Improved pairwise alignment of genomic DNA Ph.D. Thesis, The Pennsylvania State University Chiaromonte F, Yap VB, Miller W. Scoring pairwise genomic sequence alignments. Pac Symp Biocomput. 2002:115-26. PMID: 11928468 Kent WJ, Baertsch R, Hinrichs A, Miller W, Haussler D. Evolution's cauldron: duplication, deletion, and rearrangement in the mouse and human genomes. Proc Natl Acad Sci U S A. 2003 Sep 30;100(20):11484-9. PMID: 14500911; PMC: PMC208784 Schwartz S, Kent WJ, Smit A, Zhang Z, Baertsch R, Hardison RC, Haussler D, Miller W. Human-mouse alignments with BLASTZ. Genome Res. 2003 Jan;13(1):103-7. PMID: 12529312; PMC: PMC430961 syntenyCow Cow Synteny Cow Synteny Using RH Mapping Comparative Genomics Description This track depicts human-cattle synteny segments as defined on the basis of a cattle-human comparative map containing 3,200 BAC-end sequences and EST markers with a single significant hit (E-value less than 0.00001) in the human genome sequence (hg15) as defined by the TimeLogic Tera-BLASTn program (Everts-van der Wind et al., 2005). The synteny blocks were defined according to the rules described in Murphy et al. (2005). Credits Thanks to Harris Lewin, Denis Larkin, and Annelie Everts-van der Wind, University of Illinois at Urbana-Champaign, for providing these data. References Everts-van der Wind, A., Larkin, D., Green, C., Elliott, J., Olmstead, C., Chiu, R., Schein, J., Marra, M., Womack, J., and Lewin, H. A high-resolution whole-genome cattle-human comparative map reveals details of mammalian chromosome evolution. Proc Natl Acad Sci 102(51) 18526-18531 (2005). Murphy, W., Larkin, D., Everts-van der Wind, A., Bourque, G., Tesler, G., Auvil, L., Beever, J., Chowdhary, B., Galibert, F., Gatzke, L., Hitte, C., Meyers, S., Milan, D., Ostrander, E., Pape, G., Parker, H., Raudsepp, T., Rogatcheva, M., Schook, L., Skow, L., Welge, M., Womack, J., O'Brien, S., Pevzner, P., and Lewin, H. Dynamics of Mammalian Chromosome Evolution Inferred from Multispecies Comparative Maps. Science 309(5734) 613-617 (2005). chainNetMonDom1 Opossum Chain/Net Opossum (Oct. 2004 (Broad prelim/monDom1)), Chain and Net Alignments Comparative Genomics Description This track shows regions of the genome that are alignable to other genomes ("chain" subtracks) or in synteny ("net" subtracks). The alignable parts are shown with thick blocks that look like exons. Non-alignable parts between these are shown like introns. Chain Track The chain track shows alignments of opossum (Oct. 2004 (Broad prelim/monDom1)) to the human genome using a gap scoring system that allows longer gaps than traditional affine gap scoring systems. It can also tolerate gaps in both opossum and human simultaneously. These "double-sided" gaps can be caused by local inversions and overlapping deletions in both species. The chain track displays boxes joined together by either single or double lines. The boxes represent aligning regions. Single lines indicate gaps that are largely due to a deletion in the opossum assembly or an insertion in the human assembly. Double lines represent more complex gaps that involve substantial sequence in both species. This may result from inversions, overlapping deletions, an abundance of local mutation, or an unsequenced gap in one species. In cases where multiple chains align over a particular region of the human genome, the chains with single-lined gaps are often due to processed pseudogenes, while chains with double-lined gaps are more often due to paralogs and unprocessed pseudogenes. In the "pack" and "full" display modes, the individual feature names indicate the chromosome, strand, and location (in thousands) of the match for each matching alignment. Net Track The net track shows the best opossum/human chain for every part of the human genome. It is useful for finding syntenic regions, possibly orthologs, and for studying genome rearrangement. The opossum sequence used in this annotation is from the Oct. 2004 (Broad prelim/monDom1) assembly. Display Conventions and Configuration Chain Track By default, the chains to chromosome-based assemblies are colored based on which chromosome they map to in the aligning organism. To turn off the coloring, check the "off" button next to: Color track based on chromosome. To display only the chains of one chromosome in the aligning organism, enter the name of that chromosome (e.g. chr4) in box next to: Filter by chromosome. Net Track In full display mode, the top-level (level 1) chains are the largest, highest-scoring chains that span this region. In many cases gaps exist in the top-level chain. When possible, these are filled in by other chains that are displayed at level 2. The gaps in level 2 chains may be filled by level 3 chains and so forth. In the graphical display, the boxes represent ungapped alignments; the lines represent gaps. Click on a box to view detailed information about the chain as a whole; click on a line to display information about the gap. The detailed information is useful in determining the cause of the gap or, for lower level chains, the genomic rearrangement. Individual items in the display are categorized as one of four types (other than gap): Top - the best, longest match. Displayed on level 1. Syn - line-ups on the same chromosome as the gap in the level above it. Inv - a line-up on the same chromosome as the gap above it, but in the opposite orientation. NonSyn - a match to a chromosome different from the gap in the level above. Methods Chain track Transposons that have been inserted since the opossum/human split were removed from the assemblies. The abbreviated genomes were aligned with lastz, and the transposons were added back in. The resulting alignments were converted into axt format using the lavToAxt program. The axt alignments were fed into axtChain, which organizes all alignments between a single opossum chromosome and a single human chromosome into a group and creates a kd-tree out of the gapless subsections (blocks) of the alignments. A dynamic program was then run over the kd-trees to find the maximally scoring chains of these blocks. The following matrix was used:  ACGT A91-90-25-100 C-90100-100-25 G-25-100100-90 T-100-25-9091 Chains scoring below a minimum score of "5000" were discarded; the remaining chains are displayed in this track. The linear gap matrix used with axtChain: -linearGap=loose tablesize 11 smallSize 111 position 1 2 3 11 111 2111 12111 32111 72111 152111 252111 qGap 325 360 400 450 600 1100 3600 7600 15600 31600 56600 tGap 325 360 400 450 600 1100 3600 7600 15600 31600 56600 bothGap 625 660 700 750 900 1400 4000 8000 16000 32000 57000 Net track Chains were derived from lastz alignments, using the methods described on the chain tracks description pages, and sorted with the highest-scoring chains in the genome ranked first. The program chainNet was then used to place the chains one at a time, trimming them as necessary to fit into sections not already covered by a higher-scoring chain. During this process, a natural hierarchy emerged in which a chain that filled a gap in a higher-scoring chain was placed underneath that chain. The program netSyntenic was used to fill in information about the relationship between higher- and lower-level chains, such as whether a lower-level chain was syntenic or inverted relative to the higher-level chain. The program netClass was then used to fill in how much of the gaps and chains contained Ns (sequencing gaps) in one or both species and how much was filled with transposons inserted before and after the two organisms diverged. Credits Lastz (previously known as blastz) was developed at Pennsylvania State University by Minmei Hou, Scott Schwartz, Zheng Zhang, and Webb Miller with advice from Ross Hardison. Lineage-specific repeats were identified by Arian Smit and his RepeatMasker program. The axtChain program was developed at the University of California at Santa Cruz by Jim Kent with advice from Webb Miller and David Haussler. The browser display and database storage of the chains and nets were created by Robert Baertsch and Jim Kent. The chainNet, netSyntenic, and netClass programs were developed at the University of California Santa Cruz by Jim Kent. References Harris, R.S. (2007) Improved pairwise alignment of genomic DNA Ph.D. Thesis, The Pennsylvania State University Chiaromonte F, Yap VB, Miller W. Scoring pairwise genomic sequence alignments. Pac Symp Biocomput. 2002:115-26. PMID: 11928468 Kent WJ, Baertsch R, Hinrichs A, Miller W, Haussler D. Evolution's cauldron: duplication, deletion, and rearrangement in the mouse and human genomes. Proc Natl Acad Sci U S A. 2003 Sep 30;100(20):11484-9. PMID: 14500911; PMC: PMC208784 Schwartz S, Kent WJ, Smit A, Zhang Z, Baertsch R, Hardison RC, Haussler D, Miller W. Human-mouse alignments with BLASTZ. Genome Res. 2003 Jan;13(1):103-7. PMID: 12529312; PMC: PMC430961 chainNetMonDom1Viewnet Net Opossum (Oct. 2004 (Broad prelim/monDom1)), Chain and Net Alignments Comparative Genomics netMonDom1 Opossum Net Opossum (Oct. 2004 (Broad prelim/monDom1)) Alignment Net Comparative Genomics Description This track shows regions of the genome that are alignable to other genomes ("chain" subtracks) or in synteny ("net" subtracks). The alignable parts are shown with thick blocks that look like exons. Non-alignable parts between these are shown like introns. Chain Track The chain track shows alignments of opossum (Oct. 2004 (Broad prelim/monDom1)) to the human genome using a gap scoring system that allows longer gaps than traditional affine gap scoring systems. It can also tolerate gaps in both opossum and human simultaneously. These "double-sided" gaps can be caused by local inversions and overlapping deletions in both species. The chain track displays boxes joined together by either single or double lines. The boxes represent aligning regions. Single lines indicate gaps that are largely due to a deletion in the opossum assembly or an insertion in the human assembly. Double lines represent more complex gaps that involve substantial sequence in both species. This may result from inversions, overlapping deletions, an abundance of local mutation, or an unsequenced gap in one species. In cases where multiple chains align over a particular region of the human genome, the chains with single-lined gaps are often due to processed pseudogenes, while chains with double-lined gaps are more often due to paralogs and unprocessed pseudogenes. In the "pack" and "full" display modes, the individual feature names indicate the chromosome, strand, and location (in thousands) of the match for each matching alignment. Net Track The net track shows the best opossum/human chain for every part of the human genome. It is useful for finding syntenic regions, possibly orthologs, and for studying genome rearrangement. The opossum sequence used in this annotation is from the Oct. 2004 (Broad prelim/monDom1) assembly. Display Conventions and Configuration Chain Track By default, the chains to chromosome-based assemblies are colored based on which chromosome they map to in the aligning organism. To turn off the coloring, check the "off" button next to: Color track based on chromosome. To display only the chains of one chromosome in the aligning organism, enter the name of that chromosome (e.g. chr4) in box next to: Filter by chromosome. Net Track In full display mode, the top-level (level 1) chains are the largest, highest-scoring chains that span this region. In many cases gaps exist in the top-level chain. When possible, these are filled in by other chains that are displayed at level 2. The gaps in level 2 chains may be filled by level 3 chains and so forth. In the graphical display, the boxes represent ungapped alignments; the lines represent gaps. Click on a box to view detailed information about the chain as a whole; click on a line to display information about the gap. The detailed information is useful in determining the cause of the gap or, for lower level chains, the genomic rearrangement. Individual items in the display are categorized as one of four types (other than gap): Top - the best, longest match. Displayed on level 1. Syn - line-ups on the same chromosome as the gap in the level above it. Inv - a line-up on the same chromosome as the gap above it, but in the opposite orientation. NonSyn - a match to a chromosome different from the gap in the level above. Methods Chain track Transposons that have been inserted since the opossum/human split were removed from the assemblies. The abbreviated genomes were aligned with lastz, and the transposons were added back in. The resulting alignments were converted into axt format using the lavToAxt program. The axt alignments were fed into axtChain, which organizes all alignments between a single opossum chromosome and a single human chromosome into a group and creates a kd-tree out of the gapless subsections (blocks) of the alignments. A dynamic program was then run over the kd-trees to find the maximally scoring chains of these blocks. The following matrix was used:  ACGT A91-90-25-100 C-90100-100-25 G-25-100100-90 T-100-25-9091 Chains scoring below a minimum score of "5000" were discarded; the remaining chains are displayed in this track. The linear gap matrix used with axtChain: -linearGap=loose tablesize 11 smallSize 111 position 1 2 3 11 111 2111 12111 32111 72111 152111 252111 qGap 325 360 400 450 600 1100 3600 7600 15600 31600 56600 tGap 325 360 400 450 600 1100 3600 7600 15600 31600 56600 bothGap 625 660 700 750 900 1400 4000 8000 16000 32000 57000 Net track Chains were derived from lastz alignments, using the methods described on the chain tracks description pages, and sorted with the highest-scoring chains in the genome ranked first. The program chainNet was then used to place the chains one at a time, trimming them as necessary to fit into sections not already covered by a higher-scoring chain. During this process, a natural hierarchy emerged in which a chain that filled a gap in a higher-scoring chain was placed underneath that chain. The program netSyntenic was used to fill in information about the relationship between higher- and lower-level chains, such as whether a lower-level chain was syntenic or inverted relative to the higher-level chain. The program netClass was then used to fill in how much of the gaps and chains contained Ns (sequencing gaps) in one or both species and how much was filled with transposons inserted before and after the two organisms diverged. Credits Lastz (previously known as blastz) was developed at Pennsylvania State University by Minmei Hou, Scott Schwartz, Zheng Zhang, and Webb Miller with advice from Ross Hardison. Lineage-specific repeats were identified by Arian Smit and his RepeatMasker program. The axtChain program was developed at the University of California at Santa Cruz by Jim Kent with advice from Webb Miller and David Haussler. The browser display and database storage of the chains and nets were created by Robert Baertsch and Jim Kent. The chainNet, netSyntenic, and netClass programs were developed at the University of California Santa Cruz by Jim Kent. References Harris, R.S. (2007) Improved pairwise alignment of genomic DNA Ph.D. Thesis, The Pennsylvania State University Chiaromonte F, Yap VB, Miller W. Scoring pairwise genomic sequence alignments. Pac Symp Biocomput. 2002:115-26. PMID: 11928468 Kent WJ, Baertsch R, Hinrichs A, Miller W, Haussler D. Evolution's cauldron: duplication, deletion, and rearrangement in the mouse and human genomes. Proc Natl Acad Sci U S A. 2003 Sep 30;100(20):11484-9. PMID: 14500911; PMC: PMC208784 Schwartz S, Kent WJ, Smit A, Zhang Z, Baertsch R, Hardison RC, Haussler D, Miller W. Human-mouse alignments with BLASTZ. Genome Res. 2003 Jan;13(1):103-7. PMID: 12529312; PMC: PMC430961 chainNetMonDom1Viewchain Chain Opossum (Oct. 2004 (Broad prelim/monDom1)), Chain and Net Alignments Comparative Genomics chainMonDom1 Opossum Chain Opossum (Oct. 2004 (Broad prelim/monDom1)) Chained Alignments Comparative Genomics Description This track shows regions of the genome that are alignable to other genomes ("chain" subtracks) or in synteny ("net" subtracks). The alignable parts are shown with thick blocks that look like exons. Non-alignable parts between these are shown like introns. Chain Track The chain track shows alignments of opossum (Oct. 2004 (Broad prelim/monDom1)) to the human genome using a gap scoring system that allows longer gaps than traditional affine gap scoring systems. It can also tolerate gaps in both opossum and human simultaneously. These "double-sided" gaps can be caused by local inversions and overlapping deletions in both species. The chain track displays boxes joined together by either single or double lines. The boxes represent aligning regions. Single lines indicate gaps that are largely due to a deletion in the opossum assembly or an insertion in the human assembly. Double lines represent more complex gaps that involve substantial sequence in both species. This may result from inversions, overlapping deletions, an abundance of local mutation, or an unsequenced gap in one species. In cases where multiple chains align over a particular region of the human genome, the chains with single-lined gaps are often due to processed pseudogenes, while chains with double-lined gaps are more often due to paralogs and unprocessed pseudogenes. In the "pack" and "full" display modes, the individual feature names indicate the chromosome, strand, and location (in thousands) of the match for each matching alignment. Net Track The net track shows the best opossum/human chain for every part of the human genome. It is useful for finding syntenic regions, possibly orthologs, and for studying genome rearrangement. The opossum sequence used in this annotation is from the Oct. 2004 (Broad prelim/monDom1) assembly. Display Conventions and Configuration Chain Track By default, the chains to chromosome-based assemblies are colored based on which chromosome they map to in the aligning organism. To turn off the coloring, check the "off" button next to: Color track based on chromosome. To display only the chains of one chromosome in the aligning organism, enter the name of that chromosome (e.g. chr4) in box next to: Filter by chromosome. Net Track In full display mode, the top-level (level 1) chains are the largest, highest-scoring chains that span this region. In many cases gaps exist in the top-level chain. When possible, these are filled in by other chains that are displayed at level 2. The gaps in level 2 chains may be filled by level 3 chains and so forth. In the graphical display, the boxes represent ungapped alignments; the lines represent gaps. Click on a box to view detailed information about the chain as a whole; click on a line to display information about the gap. The detailed information is useful in determining the cause of the gap or, for lower level chains, the genomic rearrangement. Individual items in the display are categorized as one of four types (other than gap): Top - the best, longest match. Displayed on level 1. Syn - line-ups on the same chromosome as the gap in the level above it. Inv - a line-up on the same chromosome as the gap above it, but in the opposite orientation. NonSyn - a match to a chromosome different from the gap in the level above. Methods Chain track Transposons that have been inserted since the opossum/human split were removed from the assemblies. The abbreviated genomes were aligned with lastz, and the transposons were added back in. The resulting alignments were converted into axt format using the lavToAxt program. The axt alignments were fed into axtChain, which organizes all alignments between a single opossum chromosome and a single human chromosome into a group and creates a kd-tree out of the gapless subsections (blocks) of the alignments. A dynamic program was then run over the kd-trees to find the maximally scoring chains of these blocks. The following matrix was used:  ACGT A91-90-25-100 C-90100-100-25 G-25-100100-90 T-100-25-9091 Chains scoring below a minimum score of "5000" were discarded; the remaining chains are displayed in this track. The linear gap matrix used with axtChain: -linearGap=loose tablesize 11 smallSize 111 position 1 2 3 11 111 2111 12111 32111 72111 152111 252111 qGap 325 360 400 450 600 1100 3600 7600 15600 31600 56600 tGap 325 360 400 450 600 1100 3600 7600 15600 31600 56600 bothGap 625 660 700 750 900 1400 4000 8000 16000 32000 57000 Net track Chains were derived from lastz alignments, using the methods described on the chain tracks description pages, and sorted with the highest-scoring chains in the genome ranked first. The program chainNet was then used to place the chains one at a time, trimming them as necessary to fit into sections not already covered by a higher-scoring chain. During this process, a natural hierarchy emerged in which a chain that filled a gap in a higher-scoring chain was placed underneath that chain. The program netSyntenic was used to fill in information about the relationship between higher- and lower-level chains, such as whether a lower-level chain was syntenic or inverted relative to the higher-level chain. The program netClass was then used to fill in how much of the gaps and chains contained Ns (sequencing gaps) in one or both species and how much was filled with transposons inserted before and after the two organisms diverged. Credits Lastz (previously known as blastz) was developed at Pennsylvania State University by Minmei Hou, Scott Schwartz, Zheng Zhang, and Webb Miller with advice from Ross Hardison. Lineage-specific repeats were identified by Arian Smit and his RepeatMasker program. The axtChain program was developed at the University of California at Santa Cruz by Jim Kent with advice from Webb Miller and David Haussler. The browser display and database storage of the chains and nets were created by Robert Baertsch and Jim Kent. The chainNet, netSyntenic, and netClass programs were developed at the University of California Santa Cruz by Jim Kent. References Harris, R.S. (2007) Improved pairwise alignment of genomic DNA Ph.D. Thesis, The Pennsylvania State University Chiaromonte F, Yap VB, Miller W. Scoring pairwise genomic sequence alignments. Pac Symp Biocomput. 2002:115-26. PMID: 11928468 Kent WJ, Baertsch R, Hinrichs A, Miller W, Haussler D. Evolution's cauldron: duplication, deletion, and rearrangement in the mouse and human genomes. Proc Natl Acad Sci U S A. 2003 Sep 30;100(20):11484-9. PMID: 14500911; PMC: PMC208784 Schwartz S, Kent WJ, Smit A, Zhang Z, Baertsch R, Hardison RC, Haussler D, Miller W. Human-mouse alignments with BLASTZ. Genome Res. 2003 Jan;13(1):103-7. PMID: 12529312; PMC: PMC430961 chainNetGalGal2 Chicken Chain/Net Chicken (Feb. 2004 (WUGSC 1.0/galGal2)), Chain and Net Alignments Comparative Genomics Description This track shows regions of the genome that are alignable to other genomes ("chain" subtracks) or in synteny ("net" subtracks). The alignable parts are shown with thick blocks that look like exons. Non-alignable parts between these are shown like introns. Chain Track The chain track shows alignments of chicken (Feb. 2004 (WUGSC 1.0/galGal2)) to the human genome using a gap scoring system that allows longer gaps than traditional affine gap scoring systems. It can also tolerate gaps in both chicken and human simultaneously. These "double-sided" gaps can be caused by local inversions and overlapping deletions in both species. The chain track displays boxes joined together by either single or double lines. The boxes represent aligning regions. Single lines indicate gaps that are largely due to a deletion in the chicken assembly or an insertion in the human assembly. Double lines represent more complex gaps that involve substantial sequence in both species. This may result from inversions, overlapping deletions, an abundance of local mutation, or an unsequenced gap in one species. In cases where multiple chains align over a particular region of the human genome, the chains with single-lined gaps are often due to processed pseudogenes, while chains with double-lined gaps are more often due to paralogs and unprocessed pseudogenes. In the "pack" and "full" display modes, the individual feature names indicate the chromosome, strand, and location (in thousands) of the match for each matching alignment. Net Track The net track shows the best chicken/human chain for every part of the human genome. It is useful for finding syntenic regions, possibly orthologs, and for studying genome rearrangement. The chicken sequence used in this annotation is from the Feb. 2004 (WUGSC 1.0/galGal2) assembly. Display Conventions and Configuration Chain Track By default, the chains to chromosome-based assemblies are colored based on which chromosome they map to in the aligning organism. To turn off the coloring, check the "off" button next to: Color track based on chromosome. To display only the chains of one chromosome in the aligning organism, enter the name of that chromosome (e.g. chr4) in box next to: Filter by chromosome. Net Track In full display mode, the top-level (level 1) chains are the largest, highest-scoring chains that span this region. In many cases gaps exist in the top-level chain. When possible, these are filled in by other chains that are displayed at level 2. The gaps in level 2 chains may be filled by level 3 chains and so forth. In the graphical display, the boxes represent ungapped alignments; the lines represent gaps. Click on a box to view detailed information about the chain as a whole; click on a line to display information about the gap. The detailed information is useful in determining the cause of the gap or, for lower level chains, the genomic rearrangement. Individual items in the display are categorized as one of four types (other than gap): Top - the best, longest match. Displayed on level 1. Syn - line-ups on the same chromosome as the gap in the level above it. Inv - a line-up on the same chromosome as the gap above it, but in the opposite orientation. NonSyn - a match to a chromosome different from the gap in the level above. Methods Chain track Transposons that have been inserted since the chicken/human split were removed from the assemblies. The abbreviated genomes were aligned with lastz, and the transposons were added back in. The resulting alignments were converted into axt format using the lavToAxt program. The axt alignments were fed into axtChain, which organizes all alignments between a single chicken chromosome and a single human chromosome into a group and creates a kd-tree out of the gapless subsections (blocks) of the alignments. A dynamic program was then run over the kd-trees to find the maximally scoring chains of these blocks. The following matrix was used:  ACGT A91-90-25-100 C-90100-100-25 G-25-100100-90 T-100-25-9091 Chains scoring below a minimum score of "5000" were discarded; the remaining chains are displayed in this track. The linear gap matrix used with axtChain: -linearGap=loose tablesize 11 smallSize 111 position 1 2 3 11 111 2111 12111 32111 72111 152111 252111 qGap 325 360 400 450 600 1100 3600 7600 15600 31600 56600 tGap 325 360 400 450 600 1100 3600 7600 15600 31600 56600 bothGap 625 660 700 750 900 1400 4000 8000 16000 32000 57000 Net track Chains were derived from lastz alignments, using the methods described on the chain tracks description pages, and sorted with the highest-scoring chains in the genome ranked first. The program chainNet was then used to place the chains one at a time, trimming them as necessary to fit into sections not already covered by a higher-scoring chain. During this process, a natural hierarchy emerged in which a chain that filled a gap in a higher-scoring chain was placed underneath that chain. The program netSyntenic was used to fill in information about the relationship between higher- and lower-level chains, such as whether a lower-level chain was syntenic or inverted relative to the higher-level chain. The program netClass was then used to fill in how much of the gaps and chains contained Ns (sequencing gaps) in one or both species and how much was filled with transposons inserted before and after the two organisms diverged. Credits Lastz (previously known as blastz) was developed at Pennsylvania State University by Minmei Hou, Scott Schwartz, Zheng Zhang, and Webb Miller with advice from Ross Hardison. Lineage-specific repeats were identified by Arian Smit and his RepeatMasker program. The axtChain program was developed at the University of California at Santa Cruz by Jim Kent with advice from Webb Miller and David Haussler. The browser display and database storage of the chains and nets were created by Robert Baertsch and Jim Kent. The chainNet, netSyntenic, and netClass programs were developed at the University of California Santa Cruz by Jim Kent. References Harris, R.S. (2007) Improved pairwise alignment of genomic DNA Ph.D. Thesis, The Pennsylvania State University Chiaromonte F, Yap VB, Miller W. Scoring pairwise genomic sequence alignments. Pac Symp Biocomput. 2002:115-26. PMID: 11928468 Kent WJ, Baertsch R, Hinrichs A, Miller W, Haussler D. Evolution's cauldron: duplication, deletion, and rearrangement in the mouse and human genomes. Proc Natl Acad Sci U S A. 2003 Sep 30;100(20):11484-9. PMID: 14500911; PMC: PMC208784 Schwartz S, Kent WJ, Smit A, Zhang Z, Baertsch R, Hardison RC, Haussler D, Miller W. Human-mouse alignments with BLASTZ. Genome Res. 2003 Jan;13(1):103-7. PMID: 12529312; PMC: PMC430961 chainNetGalGal2Viewnet Net Chicken (Feb. 2004 (WUGSC 1.0/galGal2)), Chain and Net Alignments Comparative Genomics netGalGal2 Chicken Net Chicken (Feb. 2004 (WUGSC 1.0/galGal2)) Alignment Net Comparative Genomics Description This track shows regions of the genome that are alignable to other genomes ("chain" subtracks) or in synteny ("net" subtracks). The alignable parts are shown with thick blocks that look like exons. Non-alignable parts between these are shown like introns. Chain Track The chain track shows alignments of chicken (Feb. 2004 (WUGSC 1.0/galGal2)) to the human genome using a gap scoring system that allows longer gaps than traditional affine gap scoring systems. It can also tolerate gaps in both chicken and human simultaneously. These "double-sided" gaps can be caused by local inversions and overlapping deletions in both species. The chain track displays boxes joined together by either single or double lines. The boxes represent aligning regions. Single lines indicate gaps that are largely due to a deletion in the chicken assembly or an insertion in the human assembly. Double lines represent more complex gaps that involve substantial sequence in both species. This may result from inversions, overlapping deletions, an abundance of local mutation, or an unsequenced gap in one species. In cases where multiple chains align over a particular region of the human genome, the chains with single-lined gaps are often due to processed pseudogenes, while chains with double-lined gaps are more often due to paralogs and unprocessed pseudogenes. In the "pack" and "full" display modes, the individual feature names indicate the chromosome, strand, and location (in thousands) of the match for each matching alignment. Net Track The net track shows the best chicken/human chain for every part of the human genome. It is useful for finding syntenic regions, possibly orthologs, and for studying genome rearrangement. The chicken sequence used in this annotation is from the Feb. 2004 (WUGSC 1.0/galGal2) assembly. Display Conventions and Configuration Chain Track By default, the chains to chromosome-based assemblies are colored based on which chromosome they map to in the aligning organism. To turn off the coloring, check the "off" button next to: Color track based on chromosome. To display only the chains of one chromosome in the aligning organism, enter the name of that chromosome (e.g. chr4) in box next to: Filter by chromosome. Net Track In full display mode, the top-level (level 1) chains are the largest, highest-scoring chains that span this region. In many cases gaps exist in the top-level chain. When possible, these are filled in by other chains that are displayed at level 2. The gaps in level 2 chains may be filled by level 3 chains and so forth. In the graphical display, the boxes represent ungapped alignments; the lines represent gaps. Click on a box to view detailed information about the chain as a whole; click on a line to display information about the gap. The detailed information is useful in determining the cause of the gap or, for lower level chains, the genomic rearrangement. Individual items in the display are categorized as one of four types (other than gap): Top - the best, longest match. Displayed on level 1. Syn - line-ups on the same chromosome as the gap in the level above it. Inv - a line-up on the same chromosome as the gap above it, but in the opposite orientation. NonSyn - a match to a chromosome different from the gap in the level above. Methods Chain track Transposons that have been inserted since the chicken/human split were removed from the assemblies. The abbreviated genomes were aligned with lastz, and the transposons were added back in. The resulting alignments were converted into axt format using the lavToAxt program. The axt alignments were fed into axtChain, which organizes all alignments between a single chicken chromosome and a single human chromosome into a group and creates a kd-tree out of the gapless subsections (blocks) of the alignments. A dynamic program was then run over the kd-trees to find the maximally scoring chains of these blocks. The following matrix was used:  ACGT A91-90-25-100 C-90100-100-25 G-25-100100-90 T-100-25-9091 Chains scoring below a minimum score of "5000" were discarded; the remaining chains are displayed in this track. The linear gap matrix used with axtChain: -linearGap=loose tablesize 11 smallSize 111 position 1 2 3 11 111 2111 12111 32111 72111 152111 252111 qGap 325 360 400 450 600 1100 3600 7600 15600 31600 56600 tGap 325 360 400 450 600 1100 3600 7600 15600 31600 56600 bothGap 625 660 700 750 900 1400 4000 8000 16000 32000 57000 Net track Chains were derived from lastz alignments, using the methods described on the chain tracks description pages, and sorted with the highest-scoring chains in the genome ranked first. The program chainNet was then used to place the chains one at a time, trimming them as necessary to fit into sections not already covered by a higher-scoring chain. During this process, a natural hierarchy emerged in which a chain that filled a gap in a higher-scoring chain was placed underneath that chain. The program netSyntenic was used to fill in information about the relationship between higher- and lower-level chains, such as whether a lower-level chain was syntenic or inverted relative to the higher-level chain. The program netClass was then used to fill in how much of the gaps and chains contained Ns (sequencing gaps) in one or both species and how much was filled with transposons inserted before and after the two organisms diverged. Credits Lastz (previously known as blastz) was developed at Pennsylvania State University by Minmei Hou, Scott Schwartz, Zheng Zhang, and Webb Miller with advice from Ross Hardison. Lineage-specific repeats were identified by Arian Smit and his RepeatMasker program. The axtChain program was developed at the University of California at Santa Cruz by Jim Kent with advice from Webb Miller and David Haussler. The browser display and database storage of the chains and nets were created by Robert Baertsch and Jim Kent. The chainNet, netSyntenic, and netClass programs were developed at the University of California Santa Cruz by Jim Kent. References Harris, R.S. (2007) Improved pairwise alignment of genomic DNA Ph.D. Thesis, The Pennsylvania State University Chiaromonte F, Yap VB, Miller W. Scoring pairwise genomic sequence alignments. Pac Symp Biocomput. 2002:115-26. PMID: 11928468 Kent WJ, Baertsch R, Hinrichs A, Miller W, Haussler D. Evolution's cauldron: duplication, deletion, and rearrangement in the mouse and human genomes. Proc Natl Acad Sci U S A. 2003 Sep 30;100(20):11484-9. PMID: 14500911; PMC: PMC208784 Schwartz S, Kent WJ, Smit A, Zhang Z, Baertsch R, Hardison RC, Haussler D, Miller W. Human-mouse alignments with BLASTZ. Genome Res. 2003 Jan;13(1):103-7. PMID: 12529312; PMC: PMC430961 chainNetGalGal2Viewchain Chain Chicken (Feb. 2004 (WUGSC 1.0/galGal2)), Chain and Net Alignments Comparative Genomics chainGalGal2 Chicken Chain Chicken (Feb. 2004 (WUGSC 1.0/galGal2)) Chained Alignments Comparative Genomics Description This track shows regions of the genome that are alignable to other genomes ("chain" subtracks) or in synteny ("net" subtracks). The alignable parts are shown with thick blocks that look like exons. Non-alignable parts between these are shown like introns. Chain Track The chain track shows alignments of chicken (Feb. 2004 (WUGSC 1.0/galGal2)) to the human genome using a gap scoring system that allows longer gaps than traditional affine gap scoring systems. It can also tolerate gaps in both chicken and human simultaneously. These "double-sided" gaps can be caused by local inversions and overlapping deletions in both species. The chain track displays boxes joined together by either single or double lines. The boxes represent aligning regions. Single lines indicate gaps that are largely due to a deletion in the chicken assembly or an insertion in the human assembly. Double lines represent more complex gaps that involve substantial sequence in both species. This may result from inversions, overlapping deletions, an abundance of local mutation, or an unsequenced gap in one species. In cases where multiple chains align over a particular region of the human genome, the chains with single-lined gaps are often due to processed pseudogenes, while chains with double-lined gaps are more often due to paralogs and unprocessed pseudogenes. In the "pack" and "full" display modes, the individual feature names indicate the chromosome, strand, and location (in thousands) of the match for each matching alignment. Net Track The net track shows the best chicken/human chain for every part of the human genome. It is useful for finding syntenic regions, possibly orthologs, and for studying genome rearrangement. The chicken sequence used in this annotation is from the Feb. 2004 (WUGSC 1.0/galGal2) assembly. Display Conventions and Configuration Chain Track By default, the chains to chromosome-based assemblies are colored based on which chromosome they map to in the aligning organism. To turn off the coloring, check the "off" button next to: Color track based on chromosome. To display only the chains of one chromosome in the aligning organism, enter the name of that chromosome (e.g. chr4) in box next to: Filter by chromosome. Net Track In full display mode, the top-level (level 1) chains are the largest, highest-scoring chains that span this region. In many cases gaps exist in the top-level chain. When possible, these are filled in by other chains that are displayed at level 2. The gaps in level 2 chains may be filled by level 3 chains and so forth. In the graphical display, the boxes represent ungapped alignments; the lines represent gaps. Click on a box to view detailed information about the chain as a whole; click on a line to display information about the gap. The detailed information is useful in determining the cause of the gap or, for lower level chains, the genomic rearrangement. Individual items in the display are categorized as one of four types (other than gap): Top - the best, longest match. Displayed on level 1. Syn - line-ups on the same chromosome as the gap in the level above it. Inv - a line-up on the same chromosome as the gap above it, but in the opposite orientation. NonSyn - a match to a chromosome different from the gap in the level above. Methods Chain track Transposons that have been inserted since the chicken/human split were removed from the assemblies. The abbreviated genomes were aligned with lastz, and the transposons were added back in. The resulting alignments were converted into axt format using the lavToAxt program. The axt alignments were fed into axtChain, which organizes all alignments between a single chicken chromosome and a single human chromosome into a group and creates a kd-tree out of the gapless subsections (blocks) of the alignments. A dynamic program was then run over the kd-trees to find the maximally scoring chains of these blocks. The following matrix was used:  ACGT A91-90-25-100 C-90100-100-25 G-25-100100-90 T-100-25-9091 Chains scoring below a minimum score of "5000" were discarded; the remaining chains are displayed in this track. The linear gap matrix used with axtChain: -linearGap=loose tablesize 11 smallSize 111 position 1 2 3 11 111 2111 12111 32111 72111 152111 252111 qGap 325 360 400 450 600 1100 3600 7600 15600 31600 56600 tGap 325 360 400 450 600 1100 3600 7600 15600 31600 56600 bothGap 625 660 700 750 900 1400 4000 8000 16000 32000 57000 Net track Chains were derived from lastz alignments, using the methods described on the chain tracks description pages, and sorted with the highest-scoring chains in the genome ranked first. The program chainNet was then used to place the chains one at a time, trimming them as necessary to fit into sections not already covered by a higher-scoring chain. During this process, a natural hierarchy emerged in which a chain that filled a gap in a higher-scoring chain was placed underneath that chain. The program netSyntenic was used to fill in information about the relationship between higher- and lower-level chains, such as whether a lower-level chain was syntenic or inverted relative to the higher-level chain. The program netClass was then used to fill in how much of the gaps and chains contained Ns (sequencing gaps) in one or both species and how much was filled with transposons inserted before and after the two organisms diverged. Credits Lastz (previously known as blastz) was developed at Pennsylvania State University by Minmei Hou, Scott Schwartz, Zheng Zhang, and Webb Miller with advice from Ross Hardison. Lineage-specific repeats were identified by Arian Smit and his RepeatMasker program. The axtChain program was developed at the University of California at Santa Cruz by Jim Kent with advice from Webb Miller and David Haussler. The browser display and database storage of the chains and nets were created by Robert Baertsch and Jim Kent. The chainNet, netSyntenic, and netClass programs were developed at the University of California Santa Cruz by Jim Kent. References Harris, R.S. (2007) Improved pairwise alignment of genomic DNA Ph.D. Thesis, The Pennsylvania State University Chiaromonte F, Yap VB, Miller W. Scoring pairwise genomic sequence alignments. Pac Symp Biocomput. 2002:115-26. PMID: 11928468 Kent WJ, Baertsch R, Hinrichs A, Miller W, Haussler D. Evolution's cauldron: duplication, deletion, and rearrangement in the mouse and human genomes. Proc Natl Acad Sci U S A. 2003 Sep 30;100(20):11484-9. PMID: 14500911; PMC: PMC208784 Schwartz S, Kent WJ, Smit A, Zhang Z, Baertsch R, Hardison RC, Haussler D, Miller W. Human-mouse alignments with BLASTZ. Genome Res. 2003 Jan;13(1):103-7. PMID: 12529312; PMC: PMC430961 chainNetXenTro1 X. tropicalis Chain/Net X. tropicalis (Oct. 2004 (JGI 3.0/xenTro1)), Chain and Net Alignments Comparative Genomics Description This track shows regions of the genome that are alignable to other genomes ("chain" subtracks) or in synteny ("net" subtracks). The alignable parts are shown with thick blocks that look like exons. Non-alignable parts between these are shown like introns. Chain Track The chain track shows alignments of X. tropicalis (Oct. 2004 (JGI 3.0/xenTro1)) to the human genome using a gap scoring system that allows longer gaps than traditional affine gap scoring systems. It can also tolerate gaps in both X. tropicalis and human simultaneously. These "double-sided" gaps can be caused by local inversions and overlapping deletions in both species. The chain track displays boxes joined together by either single or double lines. The boxes represent aligning regions. Single lines indicate gaps that are largely due to a deletion in the X. tropicalis assembly or an insertion in the human assembly. Double lines represent more complex gaps that involve substantial sequence in both species. This may result from inversions, overlapping deletions, an abundance of local mutation, or an unsequenced gap in one species. In cases where multiple chains align over a particular region of the human genome, the chains with single-lined gaps are often due to processed pseudogenes, while chains with double-lined gaps are more often due to paralogs and unprocessed pseudogenes. In the "pack" and "full" display modes, the individual feature names indicate the chromosome, strand, and location (in thousands) of the match for each matching alignment. Net Track The net track shows the best X. tropicalis/human chain for every part of the human genome. It is useful for finding syntenic regions, possibly orthologs, and for studying genome rearrangement. The X. tropicalis sequence used in this annotation is from the Oct. 2004 (JGI 3.0/xenTro1) assembly. Display Conventions and Configuration Chain Track By default, the chains to chromosome-based assemblies are colored based on which chromosome they map to in the aligning organism. To turn off the coloring, check the "off" button next to: Color track based on chromosome. To display only the chains of one chromosome in the aligning organism, enter the name of that chromosome (e.g. chr4) in box next to: Filter by chromosome. Net Track In full display mode, the top-level (level 1) chains are the largest, highest-scoring chains that span this region. In many cases gaps exist in the top-level chain. When possible, these are filled in by other chains that are displayed at level 2. The gaps in level 2 chains may be filled by level 3 chains and so forth. In the graphical display, the boxes represent ungapped alignments; the lines represent gaps. Click on a box to view detailed information about the chain as a whole; click on a line to display information about the gap. The detailed information is useful in determining the cause of the gap or, for lower level chains, the genomic rearrangement. Individual items in the display are categorized as one of four types (other than gap): Top - the best, longest match. Displayed on level 1. Syn - line-ups on the same chromosome as the gap in the level above it. Inv - a line-up on the same chromosome as the gap above it, but in the opposite orientation. NonSyn - a match to a chromosome different from the gap in the level above. Methods Chain track Transposons that have been inserted since the X. tropicalis/human split were removed from the assemblies. The abbreviated genomes were aligned with lastz, and the transposons were added back in. The resulting alignments were converted into axt format using the lavToAxt program. The axt alignments were fed into axtChain, which organizes all alignments between a single X. tropicalis chromosome and a single human chromosome into a group and creates a kd-tree out of the gapless subsections (blocks) of the alignments. A dynamic program was then run over the kd-trees to find the maximally scoring chains of these blocks. The following matrix was used:  ACGT A91-90-25-100 C-90100-100-25 G-25-100100-90 T-100-25-9091 Chains scoring below a minimum score of "1000" were discarded; the remaining chains are displayed in this track. The linear gap matrix used with axtChain: -linearGap=loose tablesize 11 smallSize 111 position 1 2 3 11 111 2111 12111 32111 72111 152111 252111 qGap 325 360 400 450 600 1100 3600 7600 15600 31600 56600 tGap 325 360 400 450 600 1100 3600 7600 15600 31600 56600 bothGap 625 660 700 750 900 1400 4000 8000 16000 32000 57000 Net track Chains were derived from lastz alignments, using the methods described on the chain tracks description pages, and sorted with the highest-scoring chains in the genome ranked first. The program chainNet was then used to place the chains one at a time, trimming them as necessary to fit into sections not already covered by a higher-scoring chain. During this process, a natural hierarchy emerged in which a chain that filled a gap in a higher-scoring chain was placed underneath that chain. The program netSyntenic was used to fill in information about the relationship between higher- and lower-level chains, such as whether a lower-level chain was syntenic or inverted relative to the higher-level chain. The program netClass was then used to fill in how much of the gaps and chains contained Ns (sequencing gaps) in one or both species and how much was filled with transposons inserted before and after the two organisms diverged. Credits Lastz (previously known as blastz) was developed at Pennsylvania State University by Minmei Hou, Scott Schwartz, Zheng Zhang, and Webb Miller with advice from Ross Hardison. Lineage-specific repeats were identified by Arian Smit and his RepeatMasker program. The axtChain program was developed at the University of California at Santa Cruz by Jim Kent with advice from Webb Miller and David Haussler. The browser display and database storage of the chains and nets were created by Robert Baertsch and Jim Kent. The chainNet, netSyntenic, and netClass programs were developed at the University of California Santa Cruz by Jim Kent. References Harris, R.S. (2007) Improved pairwise alignment of genomic DNA Ph.D. Thesis, The Pennsylvania State University Chiaromonte F, Yap VB, Miller W. Scoring pairwise genomic sequence alignments. Pac Symp Biocomput. 2002:115-26. PMID: 11928468 Kent WJ, Baertsch R, Hinrichs A, Miller W, Haussler D. Evolution's cauldron: duplication, deletion, and rearrangement in the mouse and human genomes. Proc Natl Acad Sci U S A. 2003 Sep 30;100(20):11484-9. PMID: 14500911; PMC: PMC208784 Schwartz S, Kent WJ, Smit A, Zhang Z, Baertsch R, Hardison RC, Haussler D, Miller W. Human-mouse alignments with BLASTZ. Genome Res. 2003 Jan;13(1):103-7. PMID: 12529312; PMC: PMC430961 chainNetXenTro1Viewnet Net X. tropicalis (Oct. 2004 (JGI 3.0/xenTro1)), Chain and Net Alignments Comparative Genomics netXenTro1 X. tropicalis Net X. tropicalis (Oct. 2004 (JGI 3.0/xenTro1)) Alignment Net Comparative Genomics Description This track shows regions of the genome that are alignable to other genomes ("chain" subtracks) or in synteny ("net" subtracks). The alignable parts are shown with thick blocks that look like exons. Non-alignable parts between these are shown like introns. Chain Track The chain track shows alignments of X. tropicalis (Oct. 2004 (JGI 3.0/xenTro1)) to the human genome using a gap scoring system that allows longer gaps than traditional affine gap scoring systems. It can also tolerate gaps in both X. tropicalis and human simultaneously. These "double-sided" gaps can be caused by local inversions and overlapping deletions in both species. The chain track displays boxes joined together by either single or double lines. The boxes represent aligning regions. Single lines indicate gaps that are largely due to a deletion in the X. tropicalis assembly or an insertion in the human assembly. Double lines represent more complex gaps that involve substantial sequence in both species. This may result from inversions, overlapping deletions, an abundance of local mutation, or an unsequenced gap in one species. In cases where multiple chains align over a particular region of the human genome, the chains with single-lined gaps are often due to processed pseudogenes, while chains with double-lined gaps are more often due to paralogs and unprocessed pseudogenes. In the "pack" and "full" display modes, the individual feature names indicate the chromosome, strand, and location (in thousands) of the match for each matching alignment. Net Track The net track shows the best X. tropicalis/human chain for every part of the human genome. It is useful for finding syntenic regions, possibly orthologs, and for studying genome rearrangement. The X. tropicalis sequence used in this annotation is from the Oct. 2004 (JGI 3.0/xenTro1) assembly. Display Conventions and Configuration Chain Track By default, the chains to chromosome-based assemblies are colored based on which chromosome they map to in the aligning organism. To turn off the coloring, check the "off" button next to: Color track based on chromosome. To display only the chains of one chromosome in the aligning organism, enter the name of that chromosome (e.g. chr4) in box next to: Filter by chromosome. Net Track In full display mode, the top-level (level 1) chains are the largest, highest-scoring chains that span this region. In many cases gaps exist in the top-level chain. When possible, these are filled in by other chains that are displayed at level 2. The gaps in level 2 chains may be filled by level 3 chains and so forth. In the graphical display, the boxes represent ungapped alignments; the lines represent gaps. Click on a box to view detailed information about the chain as a whole; click on a line to display information about the gap. The detailed information is useful in determining the cause of the gap or, for lower level chains, the genomic rearrangement. Individual items in the display are categorized as one of four types (other than gap): Top - the best, longest match. Displayed on level 1. Syn - line-ups on the same chromosome as the gap in the level above it. Inv - a line-up on the same chromosome as the gap above it, but in the opposite orientation. NonSyn - a match to a chromosome different from the gap in the level above. Methods Chain track Transposons that have been inserted since the X. tropicalis/human split were removed from the assemblies. The abbreviated genomes were aligned with lastz, and the transposons were added back in. The resulting alignments were converted into axt format using the lavToAxt program. The axt alignments were fed into axtChain, which organizes all alignments between a single X. tropicalis chromosome and a single human chromosome into a group and creates a kd-tree out of the gapless subsections (blocks) of the alignments. A dynamic program was then run over the kd-trees to find the maximally scoring chains of these blocks. The following matrix was used:  ACGT A91-90-25-100 C-90100-100-25 G-25-100100-90 T-100-25-9091 Chains scoring below a minimum score of "1000" were discarded; the remaining chains are displayed in this track. The linear gap matrix used with axtChain: -linearGap=loose tablesize 11 smallSize 111 position 1 2 3 11 111 2111 12111 32111 72111 152111 252111 qGap 325 360 400 450 600 1100 3600 7600 15600 31600 56600 tGap 325 360 400 450 600 1100 3600 7600 15600 31600 56600 bothGap 625 660 700 750 900 1400 4000 8000 16000 32000 57000 Net track Chains were derived from lastz alignments, using the methods described on the chain tracks description pages, and sorted with the highest-scoring chains in the genome ranked first. The program chainNet was then used to place the chains one at a time, trimming them as necessary to fit into sections not already covered by a higher-scoring chain. During this process, a natural hierarchy emerged in which a chain that filled a gap in a higher-scoring chain was placed underneath that chain. The program netSyntenic was used to fill in information about the relationship between higher- and lower-level chains, such as whether a lower-level chain was syntenic or inverted relative to the higher-level chain. The program netClass was then used to fill in how much of the gaps and chains contained Ns (sequencing gaps) in one or both species and how much was filled with transposons inserted before and after the two organisms diverged. Credits Lastz (previously known as blastz) was developed at Pennsylvania State University by Minmei Hou, Scott Schwartz, Zheng Zhang, and Webb Miller with advice from Ross Hardison. Lineage-specific repeats were identified by Arian Smit and his RepeatMasker program. The axtChain program was developed at the University of California at Santa Cruz by Jim Kent with advice from Webb Miller and David Haussler. The browser display and database storage of the chains and nets were created by Robert Baertsch and Jim Kent. The chainNet, netSyntenic, and netClass programs were developed at the University of California Santa Cruz by Jim Kent. References Harris, R.S. (2007) Improved pairwise alignment of genomic DNA Ph.D. Thesis, The Pennsylvania State University Chiaromonte F, Yap VB, Miller W. Scoring pairwise genomic sequence alignments. Pac Symp Biocomput. 2002:115-26. PMID: 11928468 Kent WJ, Baertsch R, Hinrichs A, Miller W, Haussler D. Evolution's cauldron: duplication, deletion, and rearrangement in the mouse and human genomes. Proc Natl Acad Sci U S A. 2003 Sep 30;100(20):11484-9. PMID: 14500911; PMC: PMC208784 Schwartz S, Kent WJ, Smit A, Zhang Z, Baertsch R, Hardison RC, Haussler D, Miller W. Human-mouse alignments with BLASTZ. Genome Res. 2003 Jan;13(1):103-7. PMID: 12529312; PMC: PMC430961 chainNetXenTro1Viewchain Chain X. tropicalis (Oct. 2004 (JGI 3.0/xenTro1)), Chain and Net Alignments Comparative Genomics chainXenTro1 X. tropicalis Chain X. tropicalis (Oct. 2004 (JGI 3.0/xenTro1)) Chained Alignments Comparative Genomics Description This track shows regions of the genome that are alignable to other genomes ("chain" subtracks) or in synteny ("net" subtracks). The alignable parts are shown with thick blocks that look like exons. Non-alignable parts between these are shown like introns. Chain Track The chain track shows alignments of X. tropicalis (Oct. 2004 (JGI 3.0/xenTro1)) to the human genome using a gap scoring system that allows longer gaps than traditional affine gap scoring systems. It can also tolerate gaps in both X. tropicalis and human simultaneously. These "double-sided" gaps can be caused by local inversions and overlapping deletions in both species. The chain track displays boxes joined together by either single or double lines. The boxes represent aligning regions. Single lines indicate gaps that are largely due to a deletion in the X. tropicalis assembly or an insertion in the human assembly. Double lines represent more complex gaps that involve substantial sequence in both species. This may result from inversions, overlapping deletions, an abundance of local mutation, or an unsequenced gap in one species. In cases where multiple chains align over a particular region of the human genome, the chains with single-lined gaps are often due to processed pseudogenes, while chains with double-lined gaps are more often due to paralogs and unprocessed pseudogenes. In the "pack" and "full" display modes, the individual feature names indicate the chromosome, strand, and location (in thousands) of the match for each matching alignment. Net Track The net track shows the best X. tropicalis/human chain for every part of the human genome. It is useful for finding syntenic regions, possibly orthologs, and for studying genome rearrangement. The X. tropicalis sequence used in this annotation is from the Oct. 2004 (JGI 3.0/xenTro1) assembly. Display Conventions and Configuration Chain Track By default, the chains to chromosome-based assemblies are colored based on which chromosome they map to in the aligning organism. To turn off the coloring, check the "off" button next to: Color track based on chromosome. To display only the chains of one chromosome in the aligning organism, enter the name of that chromosome (e.g. chr4) in box next to: Filter by chromosome. Net Track In full display mode, the top-level (level 1) chains are the largest, highest-scoring chains that span this region. In many cases gaps exist in the top-level chain. When possible, these are filled in by other chains that are displayed at level 2. The gaps in level 2 chains may be filled by level 3 chains and so forth. In the graphical display, the boxes represent ungapped alignments; the lines represent gaps. Click on a box to view detailed information about the chain as a whole; click on a line to display information about the gap. The detailed information is useful in determining the cause of the gap or, for lower level chains, the genomic rearrangement. Individual items in the display are categorized as one of four types (other than gap): Top - the best, longest match. Displayed on level 1. Syn - line-ups on the same chromosome as the gap in the level above it. Inv - a line-up on the same chromosome as the gap above it, but in the opposite orientation. NonSyn - a match to a chromosome different from the gap in the level above. Methods Chain track Transposons that have been inserted since the X. tropicalis/human split were removed from the assemblies. The abbreviated genomes were aligned with lastz, and the transposons were added back in. The resulting alignments were converted into axt format using the lavToAxt program. The axt alignments were fed into axtChain, which organizes all alignments between a single X. tropicalis chromosome and a single human chromosome into a group and creates a kd-tree out of the gapless subsections (blocks) of the alignments. A dynamic program was then run over the kd-trees to find the maximally scoring chains of these blocks. The following matrix was used:  ACGT A91-90-25-100 C-90100-100-25 G-25-100100-90 T-100-25-9091 Chains scoring below a minimum score of "1000" were discarded; the remaining chains are displayed in this track. The linear gap matrix used with axtChain: -linearGap=loose tablesize 11 smallSize 111 position 1 2 3 11 111 2111 12111 32111 72111 152111 252111 qGap 325 360 400 450 600 1100 3600 7600 15600 31600 56600 tGap 325 360 400 450 600 1100 3600 7600 15600 31600 56600 bothGap 625 660 700 750 900 1400 4000 8000 16000 32000 57000 Net track Chains were derived from lastz alignments, using the methods described on the chain tracks description pages, and sorted with the highest-scoring chains in the genome ranked first. The program chainNet was then used to place the chains one at a time, trimming them as necessary to fit into sections not already covered by a higher-scoring chain. During this process, a natural hierarchy emerged in which a chain that filled a gap in a higher-scoring chain was placed underneath that chain. The program netSyntenic was used to fill in information about the relationship between higher- and lower-level chains, such as whether a lower-level chain was syntenic or inverted relative to the higher-level chain. The program netClass was then used to fill in how much of the gaps and chains contained Ns (sequencing gaps) in one or both species and how much was filled with transposons inserted before and after the two organisms diverged. Credits Lastz (previously known as blastz) was developed at Pennsylvania State University by Minmei Hou, Scott Schwartz, Zheng Zhang, and Webb Miller with advice from Ross Hardison. Lineage-specific repeats were identified by Arian Smit and his RepeatMasker program. The axtChain program was developed at the University of California at Santa Cruz by Jim Kent with advice from Webb Miller and David Haussler. The browser display and database storage of the chains and nets were created by Robert Baertsch and Jim Kent. The chainNet, netSyntenic, and netClass programs were developed at the University of California Santa Cruz by Jim Kent. References Harris, R.S. (2007) Improved pairwise alignment of genomic DNA Ph.D. Thesis, The Pennsylvania State University Chiaromonte F, Yap VB, Miller W. Scoring pairwise genomic sequence alignments. Pac Symp Biocomput. 2002:115-26. PMID: 11928468 Kent WJ, Baertsch R, Hinrichs A, Miller W, Haussler D. Evolution's cauldron: duplication, deletion, and rearrangement in the mouse and human genomes. Proc Natl Acad Sci U S A. 2003 Sep 30;100(20):11484-9. PMID: 14500911; PMC: PMC208784 Schwartz S, Kent WJ, Smit A, Zhang Z, Baertsch R, Hardison RC, Haussler D, Miller W. Human-mouse alignments with BLASTZ. Genome Res. 2003 Jan;13(1):103-7. PMID: 12529312; PMC: PMC430961 chainNetDanRer3 Zebrafish Chain/Net Zebrafish (May 2005 (Zv5/danRer3)), Chain and Net Alignments Comparative Genomics Description This track shows regions of the genome that are alignable to other genomes ("chain" subtracks) or in synteny ("net" subtracks). The alignable parts are shown with thick blocks that look like exons. Non-alignable parts between these are shown like introns. Chain Track The chain track shows alignments of zebrafish (May 2005 (Zv5/danRer3)) to the human genome using a gap scoring system that allows longer gaps than traditional affine gap scoring systems. It can also tolerate gaps in both zebrafish and human simultaneously. These "double-sided" gaps can be caused by local inversions and overlapping deletions in both species. The chain track displays boxes joined together by either single or double lines. The boxes represent aligning regions. Single lines indicate gaps that are largely due to a deletion in the zebrafish assembly or an insertion in the human assembly. Double lines represent more complex gaps that involve substantial sequence in both species. This may result from inversions, overlapping deletions, an abundance of local mutation, or an unsequenced gap in one species. In cases where multiple chains align over a particular region of the human genome, the chains with single-lined gaps are often due to processed pseudogenes, while chains with double-lined gaps are more often due to paralogs and unprocessed pseudogenes. In the "pack" and "full" display modes, the individual feature names indicate the chromosome, strand, and location (in thousands) of the match for each matching alignment. Net Track The net track shows the best zebrafish/human chain for every part of the human genome. It is useful for finding syntenic regions, possibly orthologs, and for studying genome rearrangement. The zebrafish sequence used in this annotation is from the May 2005 (Zv5/danRer3) assembly. Display Conventions and Configuration Chain Track By default, the chains to chromosome-based assemblies are colored based on which chromosome they map to in the aligning organism. To turn off the coloring, check the "off" button next to: Color track based on chromosome. To display only the chains of one chromosome in the aligning organism, enter the name of that chromosome (e.g. chr4) in box next to: Filter by chromosome. Net Track In full display mode, the top-level (level 1) chains are the largest, highest-scoring chains that span this region. In many cases gaps exist in the top-level chain. When possible, these are filled in by other chains that are displayed at level 2. The gaps in level 2 chains may be filled by level 3 chains and so forth. In the graphical display, the boxes represent ungapped alignments; the lines represent gaps. Click on a box to view detailed information about the chain as a whole; click on a line to display information about the gap. The detailed information is useful in determining the cause of the gap or, for lower level chains, the genomic rearrangement. Individual items in the display are categorized as one of four types (other than gap): Top - the best, longest match. Displayed on level 1. Syn - line-ups on the same chromosome as the gap in the level above it. Inv - a line-up on the same chromosome as the gap above it, but in the opposite orientation. NonSyn - a match to a chromosome different from the gap in the level above. Methods Chain track Transposons that have been inserted since the zebrafish/human split were removed from the assemblies. The abbreviated genomes were aligned with lastz, and the transposons were added back in. The resulting alignments were converted into axt format using the lavToAxt program. The axt alignments were fed into axtChain, which organizes all alignments between a single zebrafish chromosome and a single human chromosome into a group and creates a kd-tree out of the gapless subsections (blocks) of the alignments. A dynamic program was then run over the kd-trees to find the maximally scoring chains of these blocks. The following matrix was used:  ACGT A91-90-25-100 C-90100-100-25 G-25-100100-90 T-100-25-9091 Chains scoring below a minimum score of "5000" were discarded; the remaining chains are displayed in this track. The linear gap matrix used with axtChain: -linearGap=loose tablesize 11 smallSize 111 position 1 2 3 11 111 2111 12111 32111 72111 152111 252111 qGap 325 360 400 450 600 1100 3600 7600 15600 31600 56600 tGap 325 360 400 450 600 1100 3600 7600 15600 31600 56600 bothGap 625 660 700 750 900 1400 4000 8000 16000 32000 57000 Net track Chains were derived from lastz alignments, using the methods described on the chain tracks description pages, and sorted with the highest-scoring chains in the genome ranked first. The program chainNet was then used to place the chains one at a time, trimming them as necessary to fit into sections not already covered by a higher-scoring chain. During this process, a natural hierarchy emerged in which a chain that filled a gap in a higher-scoring chain was placed underneath that chain. The program netSyntenic was used to fill in information about the relationship between higher- and lower-level chains, such as whether a lower-level chain was syntenic or inverted relative to the higher-level chain. The program netClass was then used to fill in how much of the gaps and chains contained Ns (sequencing gaps) in one or both species and how much was filled with transposons inserted before and after the two organisms diverged. Credits Lastz (previously known as blastz) was developed at Pennsylvania State University by Minmei Hou, Scott Schwartz, Zheng Zhang, and Webb Miller with advice from Ross Hardison. Lineage-specific repeats were identified by Arian Smit and his RepeatMasker program. The axtChain program was developed at the University of California at Santa Cruz by Jim Kent with advice from Webb Miller and David Haussler. The browser display and database storage of the chains and nets were created by Robert Baertsch and Jim Kent. The chainNet, netSyntenic, and netClass programs were developed at the University of California Santa Cruz by Jim Kent. References Harris, R.S. (2007) Improved pairwise alignment of genomic DNA Ph.D. Thesis, The Pennsylvania State University Chiaromonte F, Yap VB, Miller W. Scoring pairwise genomic sequence alignments. Pac Symp Biocomput. 2002:115-26. PMID: 11928468 Kent WJ, Baertsch R, Hinrichs A, Miller W, Haussler D. Evolution's cauldron: duplication, deletion, and rearrangement in the mouse and human genomes. Proc Natl Acad Sci U S A. 2003 Sep 30;100(20):11484-9. PMID: 14500911; PMC: PMC208784 Schwartz S, Kent WJ, Smit A, Zhang Z, Baertsch R, Hardison RC, Haussler D, Miller W. Human-mouse alignments with BLASTZ. Genome Res. 2003 Jan;13(1):103-7. PMID: 12529312; PMC: PMC430961 chainNetDanRer3Viewnet Net Zebrafish (May 2005 (Zv5/danRer3)), Chain and Net Alignments Comparative Genomics netDanRer3 Zebrafish Net Zebrafish (May 2005 (Zv5/danRer3)) Alignment Net Comparative Genomics Description This track shows regions of the genome that are alignable to other genomes ("chain" subtracks) or in synteny ("net" subtracks). The alignable parts are shown with thick blocks that look like exons. Non-alignable parts between these are shown like introns. Chain Track The chain track shows alignments of zebrafish (May 2005 (Zv5/danRer3)) to the human genome using a gap scoring system that allows longer gaps than traditional affine gap scoring systems. It can also tolerate gaps in both zebrafish and human simultaneously. These "double-sided" gaps can be caused by local inversions and overlapping deletions in both species. The chain track displays boxes joined together by either single or double lines. The boxes represent aligning regions. Single lines indicate gaps that are largely due to a deletion in the zebrafish assembly or an insertion in the human assembly. Double lines represent more complex gaps that involve substantial sequence in both species. This may result from inversions, overlapping deletions, an abundance of local mutation, or an unsequenced gap in one species. In cases where multiple chains align over a particular region of the human genome, the chains with single-lined gaps are often due to processed pseudogenes, while chains with double-lined gaps are more often due to paralogs and unprocessed pseudogenes. In the "pack" and "full" display modes, the individual feature names indicate the chromosome, strand, and location (in thousands) of the match for each matching alignment. Net Track The net track shows the best zebrafish/human chain for every part of the human genome. It is useful for finding syntenic regions, possibly orthologs, and for studying genome rearrangement. The zebrafish sequence used in this annotation is from the May 2005 (Zv5/danRer3) assembly. Display Conventions and Configuration Chain Track By default, the chains to chromosome-based assemblies are colored based on which chromosome they map to in the aligning organism. To turn off the coloring, check the "off" button next to: Color track based on chromosome. To display only the chains of one chromosome in the aligning organism, enter the name of that chromosome (e.g. chr4) in box next to: Filter by chromosome. Net Track In full display mode, the top-level (level 1) chains are the largest, highest-scoring chains that span this region. In many cases gaps exist in the top-level chain. When possible, these are filled in by other chains that are displayed at level 2. The gaps in level 2 chains may be filled by level 3 chains and so forth. In the graphical display, the boxes represent ungapped alignments; the lines represent gaps. Click on a box to view detailed information about the chain as a whole; click on a line to display information about the gap. The detailed information is useful in determining the cause of the gap or, for lower level chains, the genomic rearrangement. Individual items in the display are categorized as one of four types (other than gap): Top - the best, longest match. Displayed on level 1. Syn - line-ups on the same chromosome as the gap in the level above it. Inv - a line-up on the same chromosome as the gap above it, but in the opposite orientation. NonSyn - a match to a chromosome different from the gap in the level above. Methods Chain track Transposons that have been inserted since the zebrafish/human split were removed from the assemblies. The abbreviated genomes were aligned with lastz, and the transposons were added back in. The resulting alignments were converted into axt format using the lavToAxt program. The axt alignments were fed into axtChain, which organizes all alignments between a single zebrafish chromosome and a single human chromosome into a group and creates a kd-tree out of the gapless subsections (blocks) of the alignments. A dynamic program was then run over the kd-trees to find the maximally scoring chains of these blocks. The following matrix was used:  ACGT A91-90-25-100 C-90100-100-25 G-25-100100-90 T-100-25-9091 Chains scoring below a minimum score of "5000" were discarded; the remaining chains are displayed in this track. The linear gap matrix used with axtChain: -linearGap=loose tablesize 11 smallSize 111 position 1 2 3 11 111 2111 12111 32111 72111 152111 252111 qGap 325 360 400 450 600 1100 3600 7600 15600 31600 56600 tGap 325 360 400 450 600 1100 3600 7600 15600 31600 56600 bothGap 625 660 700 750 900 1400 4000 8000 16000 32000 57000 Net track Chains were derived from lastz alignments, using the methods described on the chain tracks description pages, and sorted with the highest-scoring chains in the genome ranked first. The program chainNet was then used to place the chains one at a time, trimming them as necessary to fit into sections not already covered by a higher-scoring chain. During this process, a natural hierarchy emerged in which a chain that filled a gap in a higher-scoring chain was placed underneath that chain. The program netSyntenic was used to fill in information about the relationship between higher- and lower-level chains, such as whether a lower-level chain was syntenic or inverted relative to the higher-level chain. The program netClass was then used to fill in how much of the gaps and chains contained Ns (sequencing gaps) in one or both species and how much was filled with transposons inserted before and after the two organisms diverged. Credits Lastz (previously known as blastz) was developed at Pennsylvania State University by Minmei Hou, Scott Schwartz, Zheng Zhang, and Webb Miller with advice from Ross Hardison. Lineage-specific repeats were identified by Arian Smit and his RepeatMasker program. The axtChain program was developed at the University of California at Santa Cruz by Jim Kent with advice from Webb Miller and David Haussler. The browser display and database storage of the chains and nets were created by Robert Baertsch and Jim Kent. The chainNet, netSyntenic, and netClass programs were developed at the University of California Santa Cruz by Jim Kent. References Harris, R.S. (2007) Improved pairwise alignment of genomic DNA Ph.D. Thesis, The Pennsylvania State University Chiaromonte F, Yap VB, Miller W. Scoring pairwise genomic sequence alignments. Pac Symp Biocomput. 2002:115-26. PMID: 11928468 Kent WJ, Baertsch R, Hinrichs A, Miller W, Haussler D. Evolution's cauldron: duplication, deletion, and rearrangement in the mouse and human genomes. Proc Natl Acad Sci U S A. 2003 Sep 30;100(20):11484-9. PMID: 14500911; PMC: PMC208784 Schwartz S, Kent WJ, Smit A, Zhang Z, Baertsch R, Hardison RC, Haussler D, Miller W. Human-mouse alignments with BLASTZ. Genome Res. 2003 Jan;13(1):103-7. PMID: 12529312; PMC: PMC430961 chainNetDanRer3Viewchain Chain Zebrafish (May 2005 (Zv5/danRer3)), Chain and Net Alignments Comparative Genomics chainDanRer3 Zebrafish Chain Zebrafish (May 2005 (Zv5/danRer3)) Chained Alignments Comparative Genomics Description This track shows regions of the genome that are alignable to other genomes ("chain" subtracks) or in synteny ("net" subtracks). The alignable parts are shown with thick blocks that look like exons. Non-alignable parts between these are shown like introns. Chain Track The chain track shows alignments of zebrafish (May 2005 (Zv5/danRer3)) to the human genome using a gap scoring system that allows longer gaps than traditional affine gap scoring systems. It can also tolerate gaps in both zebrafish and human simultaneously. These "double-sided" gaps can be caused by local inversions and overlapping deletions in both species. The chain track displays boxes joined together by either single or double lines. The boxes represent aligning regions. Single lines indicate gaps that are largely due to a deletion in the zebrafish assembly or an insertion in the human assembly. Double lines represent more complex gaps that involve substantial sequence in both species. This may result from inversions, overlapping deletions, an abundance of local mutation, or an unsequenced gap in one species. In cases where multiple chains align over a particular region of the human genome, the chains with single-lined gaps are often due to processed pseudogenes, while chains with double-lined gaps are more often due to paralogs and unprocessed pseudogenes. In the "pack" and "full" display modes, the individual feature names indicate the chromosome, strand, and location (in thousands) of the match for each matching alignment. Net Track The net track shows the best zebrafish/human chain for every part of the human genome. It is useful for finding syntenic regions, possibly orthologs, and for studying genome rearrangement. The zebrafish sequence used in this annotation is from the May 2005 (Zv5/danRer3) assembly. Display Conventions and Configuration Chain Track By default, the chains to chromosome-based assemblies are colored based on which chromosome they map to in the aligning organism. To turn off the coloring, check the "off" button next to: Color track based on chromosome. To display only the chains of one chromosome in the aligning organism, enter the name of that chromosome (e.g. chr4) in box next to: Filter by chromosome. Net Track In full display mode, the top-level (level 1) chains are the largest, highest-scoring chains that span this region. In many cases gaps exist in the top-level chain. When possible, these are filled in by other chains that are displayed at level 2. The gaps in level 2 chains may be filled by level 3 chains and so forth. In the graphical display, the boxes represent ungapped alignments; the lines represent gaps. Click on a box to view detailed information about the chain as a whole; click on a line to display information about the gap. The detailed information is useful in determining the cause of the gap or, for lower level chains, the genomic rearrangement. Individual items in the display are categorized as one of four types (other than gap): Top - the best, longest match. Displayed on level 1. Syn - line-ups on the same chromosome as the gap in the level above it. Inv - a line-up on the same chromosome as the gap above it, but in the opposite orientation. NonSyn - a match to a chromosome different from the gap in the level above. Methods Chain track Transposons that have been inserted since the zebrafish/human split were removed from the assemblies. The abbreviated genomes were aligned with lastz, and the transposons were added back in. The resulting alignments were converted into axt format using the lavToAxt program. The axt alignments were fed into axtChain, which organizes all alignments between a single zebrafish chromosome and a single human chromosome into a group and creates a kd-tree out of the gapless subsections (blocks) of the alignments. A dynamic program was then run over the kd-trees to find the maximally scoring chains of these blocks. The following matrix was used:  ACGT A91-90-25-100 C-90100-100-25 G-25-100100-90 T-100-25-9091 Chains scoring below a minimum score of "5000" were discarded; the remaining chains are displayed in this track. The linear gap matrix used with axtChain: -linearGap=loose tablesize 11 smallSize 111 position 1 2 3 11 111 2111 12111 32111 72111 152111 252111 qGap 325 360 400 450 600 1100 3600 7600 15600 31600 56600 tGap 325 360 400 450 600 1100 3600 7600 15600 31600 56600 bothGap 625 660 700 750 900 1400 4000 8000 16000 32000 57000 Net track Chains were derived from lastz alignments, using the methods described on the chain tracks description pages, and sorted with the highest-scoring chains in the genome ranked first. The program chainNet was then used to place the chains one at a time, trimming them as necessary to fit into sections not already covered by a higher-scoring chain. During this process, a natural hierarchy emerged in which a chain that filled a gap in a higher-scoring chain was placed underneath that chain. The program netSyntenic was used to fill in information about the relationship between higher- and lower-level chains, such as whether a lower-level chain was syntenic or inverted relative to the higher-level chain. The program netClass was then used to fill in how much of the gaps and chains contained Ns (sequencing gaps) in one or both species and how much was filled with transposons inserted before and after the two organisms diverged. Credits Lastz (previously known as blastz) was developed at Pennsylvania State University by Minmei Hou, Scott Schwartz, Zheng Zhang, and Webb Miller with advice from Ross Hardison. Lineage-specific repeats were identified by Arian Smit and his RepeatMasker program. The axtChain program was developed at the University of California at Santa Cruz by Jim Kent with advice from Webb Miller and David Haussler. The browser display and database storage of the chains and nets were created by Robert Baertsch and Jim Kent. The chainNet, netSyntenic, and netClass programs were developed at the University of California Santa Cruz by Jim Kent. References Harris, R.S. (2007) Improved pairwise alignment of genomic DNA Ph.D. Thesis, The Pennsylvania State University Chiaromonte F, Yap VB, Miller W. Scoring pairwise genomic sequence alignments. Pac Symp Biocomput. 2002:115-26. PMID: 11928468 Kent WJ, Baertsch R, Hinrichs A, Miller W, Haussler D. Evolution's cauldron: duplication, deletion, and rearrangement in the mouse and human genomes. Proc Natl Acad Sci U S A. 2003 Sep 30;100(20):11484-9. PMID: 14500911; PMC: PMC208784 Schwartz S, Kent WJ, Smit A, Zhang Z, Baertsch R, Hardison RC, Haussler D, Miller W. Human-mouse alignments with BLASTZ. Genome Res. 2003 Jan;13(1):103-7. PMID: 12529312; PMC: PMC430961 blatFr1 Fugu Blat Fugu (Aug. 2002 (JGI 3.0/fr1)) Translated Blat Alignments Comparative Genomics Description This track shows blat translated protein alignments of the Fugu (Aug. 2002 (JGI 3.0/fr1)) genome assembly to the human genome. The v3.0 Fugu whole genome shotgun assembly was provided by the US DOE Joint Genome Institute (JGI). The strand information (+/-) for this track is in two parts. The first + or - indicates the orientation of the query sequence whose translated protein produced the match. The second + or - indicates the orientation of the matching translated genomic sequence. Because the two orientations of a DNA sequence give different predicted protein sequences, there are four combinations. ++ is not the same as --; nor is +- the same as -+. Methods The alignments were made with blat in translated protein mode requiring two nearby 4-mer matches to trigger a detailed alignment. The human genome was masked with RepeatMasker and Tandem Repeat Finder before running blat. Credits The 3.0 draft from JGI was used in the UCSC Fugu blat alignments. These data were provided freely by the JGI for use in this publication only. References Kent WJ. BLAT - the BLAST-like alignment tool. Genome Res. 2002 Apr;12(4):656-64. PMID: 11932250; PMC: PMC187518 chainNetFr1 Fugu Chain/Net Fugu (Aug. 2002 (JGI 3.0/fr1)), Chain and Net Alignments Comparative Genomics Description This track shows regions of the genome that are alignable to other genomes ("chain" subtracks) or in synteny ("net" subtracks). The alignable parts are shown with thick blocks that look like exons. Non-alignable parts between these are shown like introns. Chain Track The chain track shows alignments of fugu (Aug. 2002 (JGI 3.0/fr1)) to the human genome using a gap scoring system that allows longer gaps than traditional affine gap scoring systems. It can also tolerate gaps in both fugu and human simultaneously. These "double-sided" gaps can be caused by local inversions and overlapping deletions in both species. The chain track displays boxes joined together by either single or double lines. The boxes represent aligning regions. Single lines indicate gaps that are largely due to a deletion in the fugu assembly or an insertion in the human assembly. Double lines represent more complex gaps that involve substantial sequence in both species. This may result from inversions, overlapping deletions, an abundance of local mutation, or an unsequenced gap in one species. In cases where multiple chains align over a particular region of the human genome, the chains with single-lined gaps are often due to processed pseudogenes, while chains with double-lined gaps are more often due to paralogs and unprocessed pseudogenes. In the "pack" and "full" display modes, the individual feature names indicate the chromosome, strand, and location (in thousands) of the match for each matching alignment. Net Track The net track shows the best fugu/human chain for every part of the human genome. It is useful for finding syntenic regions, possibly orthologs, and for studying genome rearrangement. The fugu sequence used in this annotation is from the Aug. 2002 (JGI 3.0/fr1) assembly. Display Conventions and Configuration Chain Track By default, the chains to chromosome-based assemblies are colored based on which chromosome they map to in the aligning organism. To turn off the coloring, check the "off" button next to: Color track based on chromosome. To display only the chains of one chromosome in the aligning organism, enter the name of that chromosome (e.g. chr4) in box next to: Filter by chromosome. Net Track In full display mode, the top-level (level 1) chains are the largest, highest-scoring chains that span this region. In many cases gaps exist in the top-level chain. When possible, these are filled in by other chains that are displayed at level 2. The gaps in level 2 chains may be filled by level 3 chains and so forth. In the graphical display, the boxes represent ungapped alignments; the lines represent gaps. Click on a box to view detailed information about the chain as a whole; click on a line to display information about the gap. The detailed information is useful in determining the cause of the gap or, for lower level chains, the genomic rearrangement. Individual items in the display are categorized as one of four types (other than gap): Top - the best, longest match. Displayed on level 1. Syn - line-ups on the same chromosome as the gap in the level above it. Inv - a line-up on the same chromosome as the gap above it, but in the opposite orientation. NonSyn - a match to a chromosome different from the gap in the level above. Methods Chain track Transposons that have been inserted since the fugu/human split were removed from the assemblies. The abbreviated genomes were aligned with lastz, and the transposons were added back in. The resulting alignments were converted into axt format using the lavToAxt program. The axt alignments were fed into axtChain, which organizes all alignments between a single fugu chromosome and a single human chromosome into a group and creates a kd-tree out of the gapless subsections (blocks) of the alignments. A dynamic program was then run over the kd-trees to find the maximally scoring chains of these blocks. The following matrix was used:  ACGT A91-90-25-100 C-90100-100-25 G-25-100100-90 T-100-25-9091 Chains scoring below a minimum score of "5000" were discarded; the remaining chains are displayed in this track. The linear gap matrix used with axtChain: -linearGap=loose tablesize 11 smallSize 111 position 1 2 3 11 111 2111 12111 32111 72111 152111 252111 qGap 325 360 400 450 600 1100 3600 7600 15600 31600 56600 tGap 325 360 400 450 600 1100 3600 7600 15600 31600 56600 bothGap 625 660 700 750 900 1400 4000 8000 16000 32000 57000 Net track Chains were derived from lastz alignments, using the methods described on the chain tracks description pages, and sorted with the highest-scoring chains in the genome ranked first. The program chainNet was then used to place the chains one at a time, trimming them as necessary to fit into sections not already covered by a higher-scoring chain. During this process, a natural hierarchy emerged in which a chain that filled a gap in a higher-scoring chain was placed underneath that chain. The program netSyntenic was used to fill in information about the relationship between higher- and lower-level chains, such as whether a lower-level chain was syntenic or inverted relative to the higher-level chain. The program netClass was then used to fill in how much of the gaps and chains contained Ns (sequencing gaps) in one or both species and how much was filled with transposons inserted before and after the two organisms diverged. Credits Lastz (previously known as blastz) was developed at Pennsylvania State University by Minmei Hou, Scott Schwartz, Zheng Zhang, and Webb Miller with advice from Ross Hardison. Lineage-specific repeats were identified by Arian Smit and his RepeatMasker program. The axtChain program was developed at the University of California at Santa Cruz by Jim Kent with advice from Webb Miller and David Haussler. The browser display and database storage of the chains and nets were created by Robert Baertsch and Jim Kent. The chainNet, netSyntenic, and netClass programs were developed at the University of California Santa Cruz by Jim Kent. References Harris, R.S. (2007) Improved pairwise alignment of genomic DNA Ph.D. Thesis, The Pennsylvania State University Chiaromonte F, Yap VB, Miller W. Scoring pairwise genomic sequence alignments. Pac Symp Biocomput. 2002:115-26. PMID: 11928468 Kent WJ, Baertsch R, Hinrichs A, Miller W, Haussler D. Evolution's cauldron: duplication, deletion, and rearrangement in the mouse and human genomes. Proc Natl Acad Sci U S A. 2003 Sep 30;100(20):11484-9. PMID: 14500911; PMC: PMC208784 Schwartz S, Kent WJ, Smit A, Zhang Z, Baertsch R, Hardison RC, Haussler D, Miller W. Human-mouse alignments with BLASTZ. Genome Res. 2003 Jan;13(1):103-7. PMID: 12529312; PMC: PMC430961 chainNetFr1Viewnet Net Fugu (Aug. 2002 (JGI 3.0/fr1)), Chain and Net Alignments Comparative Genomics netFr1 Fugu Net Fugu (Aug. 2002 (JGI 3.0/fr1)) Alignment Net Comparative Genomics Description This track shows regions of the genome that are alignable to other genomes ("chain" subtracks) or in synteny ("net" subtracks). The alignable parts are shown with thick blocks that look like exons. Non-alignable parts between these are shown like introns. Chain Track The chain track shows alignments of fugu (Aug. 2002 (JGI 3.0/fr1)) to the human genome using a gap scoring system that allows longer gaps than traditional affine gap scoring systems. It can also tolerate gaps in both fugu and human simultaneously. These "double-sided" gaps can be caused by local inversions and overlapping deletions in both species. The chain track displays boxes joined together by either single or double lines. The boxes represent aligning regions. Single lines indicate gaps that are largely due to a deletion in the fugu assembly or an insertion in the human assembly. Double lines represent more complex gaps that involve substantial sequence in both species. This may result from inversions, overlapping deletions, an abundance of local mutation, or an unsequenced gap in one species. In cases where multiple chains align over a particular region of the human genome, the chains with single-lined gaps are often due to processed pseudogenes, while chains with double-lined gaps are more often due to paralogs and unprocessed pseudogenes. In the "pack" and "full" display modes, the individual feature names indicate the chromosome, strand, and location (in thousands) of the match for each matching alignment. Net Track The net track shows the best fugu/human chain for every part of the human genome. It is useful for finding syntenic regions, possibly orthologs, and for studying genome rearrangement. The fugu sequence used in this annotation is from the Aug. 2002 (JGI 3.0/fr1) assembly. Display Conventions and Configuration Chain Track By default, the chains to chromosome-based assemblies are colored based on which chromosome they map to in the aligning organism. To turn off the coloring, check the "off" button next to: Color track based on chromosome. To display only the chains of one chromosome in the aligning organism, enter the name of that chromosome (e.g. chr4) in box next to: Filter by chromosome. Net Track In full display mode, the top-level (level 1) chains are the largest, highest-scoring chains that span this region. In many cases gaps exist in the top-level chain. When possible, these are filled in by other chains that are displayed at level 2. The gaps in level 2 chains may be filled by level 3 chains and so forth. In the graphical display, the boxes represent ungapped alignments; the lines represent gaps. Click on a box to view detailed information about the chain as a whole; click on a line to display information about the gap. The detailed information is useful in determining the cause of the gap or, for lower level chains, the genomic rearrangement. Individual items in the display are categorized as one of four types (other than gap): Top - the best, longest match. Displayed on level 1. Syn - line-ups on the same chromosome as the gap in the level above it. Inv - a line-up on the same chromosome as the gap above it, but in the opposite orientation. NonSyn - a match to a chromosome different from the gap in the level above. Methods Chain track Transposons that have been inserted since the fugu/human split were removed from the assemblies. The abbreviated genomes were aligned with lastz, and the transposons were added back in. The resulting alignments were converted into axt format using the lavToAxt program. The axt alignments were fed into axtChain, which organizes all alignments between a single fugu chromosome and a single human chromosome into a group and creates a kd-tree out of the gapless subsections (blocks) of the alignments. A dynamic program was then run over the kd-trees to find the maximally scoring chains of these blocks. The following matrix was used:  ACGT A91-90-25-100 C-90100-100-25 G-25-100100-90 T-100-25-9091 Chains scoring below a minimum score of "5000" were discarded; the remaining chains are displayed in this track. The linear gap matrix used with axtChain: -linearGap=loose tablesize 11 smallSize 111 position 1 2 3 11 111 2111 12111 32111 72111 152111 252111 qGap 325 360 400 450 600 1100 3600 7600 15600 31600 56600 tGap 325 360 400 450 600 1100 3600 7600 15600 31600 56600 bothGap 625 660 700 750 900 1400 4000 8000 16000 32000 57000 Net track Chains were derived from lastz alignments, using the methods described on the chain tracks description pages, and sorted with the highest-scoring chains in the genome ranked first. The program chainNet was then used to place the chains one at a time, trimming them as necessary to fit into sections not already covered by a higher-scoring chain. During this process, a natural hierarchy emerged in which a chain that filled a gap in a higher-scoring chain was placed underneath that chain. The program netSyntenic was used to fill in information about the relationship between higher- and lower-level chains, such as whether a lower-level chain was syntenic or inverted relative to the higher-level chain. The program netClass was then used to fill in how much of the gaps and chains contained Ns (sequencing gaps) in one or both species and how much was filled with transposons inserted before and after the two organisms diverged. Credits Lastz (previously known as blastz) was developed at Pennsylvania State University by Minmei Hou, Scott Schwartz, Zheng Zhang, and Webb Miller with advice from Ross Hardison. Lineage-specific repeats were identified by Arian Smit and his RepeatMasker program. The axtChain program was developed at the University of California at Santa Cruz by Jim Kent with advice from Webb Miller and David Haussler. The browser display and database storage of the chains and nets were created by Robert Baertsch and Jim Kent. The chainNet, netSyntenic, and netClass programs were developed at the University of California Santa Cruz by Jim Kent. References Harris, R.S. (2007) Improved pairwise alignment of genomic DNA Ph.D. Thesis, The Pennsylvania State University Chiaromonte F, Yap VB, Miller W. Scoring pairwise genomic sequence alignments. Pac Symp Biocomput. 2002:115-26. PMID: 11928468 Kent WJ, Baertsch R, Hinrichs A, Miller W, Haussler D. Evolution's cauldron: duplication, deletion, and rearrangement in the mouse and human genomes. Proc Natl Acad Sci U S A. 2003 Sep 30;100(20):11484-9. PMID: 14500911; PMC: PMC208784 Schwartz S, Kent WJ, Smit A, Zhang Z, Baertsch R, Hardison RC, Haussler D, Miller W. Human-mouse alignments with BLASTZ. Genome Res. 2003 Jan;13(1):103-7. PMID: 12529312; PMC: PMC430961 chainNetFr1Viewchain Chain Fugu (Aug. 2002 (JGI 3.0/fr1)), Chain and Net Alignments Comparative Genomics chainFr1 Fugu Chain Fugu (Aug. 2002 (JGI 3.0/fr1)) Chained Alignments Comparative Genomics Description This track shows regions of the genome that are alignable to other genomes ("chain" subtracks) or in synteny ("net" subtracks). The alignable parts are shown with thick blocks that look like exons. Non-alignable parts between these are shown like introns. Chain Track The chain track shows alignments of fugu (Aug. 2002 (JGI 3.0/fr1)) to the human genome using a gap scoring system that allows longer gaps than traditional affine gap scoring systems. It can also tolerate gaps in both fugu and human simultaneously. These "double-sided" gaps can be caused by local inversions and overlapping deletions in both species. The chain track displays boxes joined together by either single or double lines. The boxes represent aligning regions. Single lines indicate gaps that are largely due to a deletion in the fugu assembly or an insertion in the human assembly. Double lines represent more complex gaps that involve substantial sequence in both species. This may result from inversions, overlapping deletions, an abundance of local mutation, or an unsequenced gap in one species. In cases where multiple chains align over a particular region of the human genome, the chains with single-lined gaps are often due to processed pseudogenes, while chains with double-lined gaps are more often due to paralogs and unprocessed pseudogenes. In the "pack" and "full" display modes, the individual feature names indicate the chromosome, strand, and location (in thousands) of the match for each matching alignment. Net Track The net track shows the best fugu/human chain for every part of the human genome. It is useful for finding syntenic regions, possibly orthologs, and for studying genome rearrangement. The fugu sequence used in this annotation is from the Aug. 2002 (JGI 3.0/fr1) assembly. Display Conventions and Configuration Chain Track By default, the chains to chromosome-based assemblies are colored based on which chromosome they map to in the aligning organism. To turn off the coloring, check the "off" button next to: Color track based on chromosome. To display only the chains of one chromosome in the aligning organism, enter the name of that chromosome (e.g. chr4) in box next to: Filter by chromosome. Net Track In full display mode, the top-level (level 1) chains are the largest, highest-scoring chains that span this region. In many cases gaps exist in the top-level chain. When possible, these are filled in by other chains that are displayed at level 2. The gaps in level 2 chains may be filled by level 3 chains and so forth. In the graphical display, the boxes represent ungapped alignments; the lines represent gaps. Click on a box to view detailed information about the chain as a whole; click on a line to display information about the gap. The detailed information is useful in determining the cause of the gap or, for lower level chains, the genomic rearrangement. Individual items in the display are categorized as one of four types (other than gap): Top - the best, longest match. Displayed on level 1. Syn - line-ups on the same chromosome as the gap in the level above it. Inv - a line-up on the same chromosome as the gap above it, but in the opposite orientation. NonSyn - a match to a chromosome different from the gap in the level above. Methods Chain track Transposons that have been inserted since the fugu/human split were removed from the assemblies. The abbreviated genomes were aligned with lastz, and the transposons were added back in. The resulting alignments were converted into axt format using the lavToAxt program. The axt alignments were fed into axtChain, which organizes all alignments between a single fugu chromosome and a single human chromosome into a group and creates a kd-tree out of the gapless subsections (blocks) of the alignments. A dynamic program was then run over the kd-trees to find the maximally scoring chains of these blocks. The following matrix was used:  ACGT A91-90-25-100 C-90100-100-25 G-25-100100-90 T-100-25-9091 Chains scoring below a minimum score of "5000" were discarded; the remaining chains are displayed in this track. The linear gap matrix used with axtChain: -linearGap=loose tablesize 11 smallSize 111 position 1 2 3 11 111 2111 12111 32111 72111 152111 252111 qGap 325 360 400 450 600 1100 3600 7600 15600 31600 56600 tGap 325 360 400 450 600 1100 3600 7600 15600 31600 56600 bothGap 625 660 700 750 900 1400 4000 8000 16000 32000 57000 Net track Chains were derived from lastz alignments, using the methods described on the chain tracks description pages, and sorted with the highest-scoring chains in the genome ranked first. The program chainNet was then used to place the chains one at a time, trimming them as necessary to fit into sections not already covered by a higher-scoring chain. During this process, a natural hierarchy emerged in which a chain that filled a gap in a higher-scoring chain was placed underneath that chain. The program netSyntenic was used to fill in information about the relationship between higher- and lower-level chains, such as whether a lower-level chain was syntenic or inverted relative to the higher-level chain. The program netClass was then used to fill in how much of the gaps and chains contained Ns (sequencing gaps) in one or both species and how much was filled with transposons inserted before and after the two organisms diverged. Credits Lastz (previously known as blastz) was developed at Pennsylvania State University by Minmei Hou, Scott Schwartz, Zheng Zhang, and Webb Miller with advice from Ross Hardison. Lineage-specific repeats were identified by Arian Smit and his RepeatMasker program. The axtChain program was developed at the University of California at Santa Cruz by Jim Kent with advice from Webb Miller and David Haussler. The browser display and database storage of the chains and nets were created by Robert Baertsch and Jim Kent. The chainNet, netSyntenic, and netClass programs were developed at the University of California Santa Cruz by Jim Kent. References Harris, R.S. (2007) Improved pairwise alignment of genomic DNA Ph.D. Thesis, The Pennsylvania State University Chiaromonte F, Yap VB, Miller W. Scoring pairwise genomic sequence alignments. Pac Symp Biocomput. 2002:115-26. PMID: 11928468 Kent WJ, Baertsch R, Hinrichs A, Miller W, Haussler D. Evolution's cauldron: duplication, deletion, and rearrangement in the mouse and human genomes. Proc Natl Acad Sci U S A. 2003 Sep 30;100(20):11484-9. PMID: 14500911; PMC: PMC208784 Schwartz S, Kent WJ, Smit A, Zhang Z, Baertsch R, Hardison RC, Haussler D, Miller W. Human-mouse alignments with BLASTZ. Genome Res. 2003 Jan;13(1):103-7. PMID: 12529312; PMC: PMC430961 chainNetTetNig1 Tetraodon Chain/Net Tetraodon (Feb. 2004 (Genoscope 7/tetNig1)), Chain and Net Alignments Comparative Genomics Description This track shows regions of the genome that are alignable to other genomes ("chain" subtracks) or in synteny ("net" subtracks). The alignable parts are shown with thick blocks that look like exons. Non-alignable parts between these are shown like introns. Chain Track The chain track shows alignments of tetraodon (Feb. 2004 (Genoscope 7/tetNig1)) to the human genome using a gap scoring system that allows longer gaps than traditional affine gap scoring systems. It can also tolerate gaps in both tetraodon and human simultaneously. These "double-sided" gaps can be caused by local inversions and overlapping deletions in both species. The chain track displays boxes joined together by either single or double lines. The boxes represent aligning regions. Single lines indicate gaps that are largely due to a deletion in the tetraodon assembly or an insertion in the human assembly. Double lines represent more complex gaps that involve substantial sequence in both species. This may result from inversions, overlapping deletions, an abundance of local mutation, or an unsequenced gap in one species. In cases where multiple chains align over a particular region of the human genome, the chains with single-lined gaps are often due to processed pseudogenes, while chains with double-lined gaps are more often due to paralogs and unprocessed pseudogenes. In the "pack" and "full" display modes, the individual feature names indicate the chromosome, strand, and location (in thousands) of the match for each matching alignment. Net Track The net track shows the best tetraodon/human chain for every part of the human genome. It is useful for finding syntenic regions, possibly orthologs, and for studying genome rearrangement. The tetraodon sequence used in this annotation is from the Feb. 2004 (Genoscope 7/tetNig1) assembly. Display Conventions and Configuration Chain Track By default, the chains to chromosome-based assemblies are colored based on which chromosome they map to in the aligning organism. To turn off the coloring, check the "off" button next to: Color track based on chromosome. To display only the chains of one chromosome in the aligning organism, enter the name of that chromosome (e.g. chr4) in box next to: Filter by chromosome. Net Track In full display mode, the top-level (level 1) chains are the largest, highest-scoring chains that span this region. In many cases gaps exist in the top-level chain. When possible, these are filled in by other chains that are displayed at level 2. The gaps in level 2 chains may be filled by level 3 chains and so forth. In the graphical display, the boxes represent ungapped alignments; the lines represent gaps. Click on a box to view detailed information about the chain as a whole; click on a line to display information about the gap. The detailed information is useful in determining the cause of the gap or, for lower level chains, the genomic rearrangement. Individual items in the display are categorized as one of four types (other than gap): Top - the best, longest match. Displayed on level 1. Syn - line-ups on the same chromosome as the gap in the level above it. Inv - a line-up on the same chromosome as the gap above it, but in the opposite orientation. NonSyn - a match to a chromosome different from the gap in the level above. Methods Chain track Transposons that have been inserted since the tetraodon/human split were removed from the assemblies. The abbreviated genomes were aligned with lastz, and the transposons were added back in. The resulting alignments were converted into axt format using the lavToAxt program. The axt alignments were fed into axtChain, which organizes all alignments between a single tetraodon chromosome and a single human chromosome into a group and creates a kd-tree out of the gapless subsections (blocks) of the alignments. A dynamic program was then run over the kd-trees to find the maximally scoring chains of these blocks. The following matrix was used:  ACGT A91-114-31-123 C-114100-125-31 G-31-125100-114 T-123-31-11491 Chains scoring below a minimum score of "1000" were discarded; the remaining chains are displayed in this track. The linear gap matrix used with axtChain: -linearGap=loose tablesize 11 smallSize 111 position 1 2 3 11 111 2111 12111 32111 72111 152111 252111 qGap 325 360 400 450 600 1100 3600 7600 15600 31600 56600 tGap 325 360 400 450 600 1100 3600 7600 15600 31600 56600 bothGap 625 660 700 750 900 1400 4000 8000 16000 32000 57000 Net track Chains were derived from lastz alignments, using the methods described on the chain tracks description pages, and sorted with the highest-scoring chains in the genome ranked first. The program chainNet was then used to place the chains one at a time, trimming them as necessary to fit into sections not already covered by a higher-scoring chain. During this process, a natural hierarchy emerged in which a chain that filled a gap in a higher-scoring chain was placed underneath that chain. The program netSyntenic was used to fill in information about the relationship between higher- and lower-level chains, such as whether a lower-level chain was syntenic or inverted relative to the higher-level chain. The program netClass was then used to fill in how much of the gaps and chains contained Ns (sequencing gaps) in one or both species and how much was filled with transposons inserted before and after the two organisms diverged. Credits Lastz (previously known as blastz) was developed at Pennsylvania State University by Minmei Hou, Scott Schwartz, Zheng Zhang, and Webb Miller with advice from Ross Hardison. Lineage-specific repeats were identified by Arian Smit and his RepeatMasker program. The axtChain program was developed at the University of California at Santa Cruz by Jim Kent with advice from Webb Miller and David Haussler. The browser display and database storage of the chains and nets were created by Robert Baertsch and Jim Kent. The chainNet, netSyntenic, and netClass programs were developed at the University of California Santa Cruz by Jim Kent. References Harris, R.S. (2007) Improved pairwise alignment of genomic DNA Ph.D. Thesis, The Pennsylvania State University Chiaromonte F, Yap VB, Miller W. Scoring pairwise genomic sequence alignments. Pac Symp Biocomput. 2002:115-26. PMID: 11928468 Kent WJ, Baertsch R, Hinrichs A, Miller W, Haussler D. Evolution's cauldron: duplication, deletion, and rearrangement in the mouse and human genomes. Proc Natl Acad Sci U S A. 2003 Sep 30;100(20):11484-9. PMID: 14500911; PMC: PMC208784 Schwartz S, Kent WJ, Smit A, Zhang Z, Baertsch R, Hardison RC, Haussler D, Miller W. Human-mouse alignments with BLASTZ. Genome Res. 2003 Jan;13(1):103-7. PMID: 12529312; PMC: PMC430961 chainNetTetNig1Viewnet Net Tetraodon (Feb. 2004 (Genoscope 7/tetNig1)), Chain and Net Alignments Comparative Genomics netTetNig1 Tetraodon Net Tetraodon (Feb. 2004 (Genoscope 7/tetNig1)) Alignment Net Comparative Genomics Description This track shows regions of the genome that are alignable to other genomes ("chain" subtracks) or in synteny ("net" subtracks). The alignable parts are shown with thick blocks that look like exons. Non-alignable parts between these are shown like introns. Chain Track The chain track shows alignments of tetraodon (Feb. 2004 (Genoscope 7/tetNig1)) to the human genome using a gap scoring system that allows longer gaps than traditional affine gap scoring systems. It can also tolerate gaps in both tetraodon and human simultaneously. These "double-sided" gaps can be caused by local inversions and overlapping deletions in both species. The chain track displays boxes joined together by either single or double lines. The boxes represent aligning regions. Single lines indicate gaps that are largely due to a deletion in the tetraodon assembly or an insertion in the human assembly. Double lines represent more complex gaps that involve substantial sequence in both species. This may result from inversions, overlapping deletions, an abundance of local mutation, or an unsequenced gap in one species. In cases where multiple chains align over a particular region of the human genome, the chains with single-lined gaps are often due to processed pseudogenes, while chains with double-lined gaps are more often due to paralogs and unprocessed pseudogenes. In the "pack" and "full" display modes, the individual feature names indicate the chromosome, strand, and location (in thousands) of the match for each matching alignment. Net Track The net track shows the best tetraodon/human chain for every part of the human genome. It is useful for finding syntenic regions, possibly orthologs, and for studying genome rearrangement. The tetraodon sequence used in this annotation is from the Feb. 2004 (Genoscope 7/tetNig1) assembly. Display Conventions and Configuration Chain Track By default, the chains to chromosome-based assemblies are colored based on which chromosome they map to in the aligning organism. To turn off the coloring, check the "off" button next to: Color track based on chromosome. To display only the chains of one chromosome in the aligning organism, enter the name of that chromosome (e.g. chr4) in box next to: Filter by chromosome. Net Track In full display mode, the top-level (level 1) chains are the largest, highest-scoring chains that span this region. In many cases gaps exist in the top-level chain. When possible, these are filled in by other chains that are displayed at level 2. The gaps in level 2 chains may be filled by level 3 chains and so forth. In the graphical display, the boxes represent ungapped alignments; the lines represent gaps. Click on a box to view detailed information about the chain as a whole; click on a line to display information about the gap. The detailed information is useful in determining the cause of the gap or, for lower level chains, the genomic rearrangement. Individual items in the display are categorized as one of four types (other than gap): Top - the best, longest match. Displayed on level 1. Syn - line-ups on the same chromosome as the gap in the level above it. Inv - a line-up on the same chromosome as the gap above it, but in the opposite orientation. NonSyn - a match to a chromosome different from the gap in the level above. Methods Chain track Transposons that have been inserted since the tetraodon/human split were removed from the assemblies. The abbreviated genomes were aligned with lastz, and the transposons were added back in. The resulting alignments were converted into axt format using the lavToAxt program. The axt alignments were fed into axtChain, which organizes all alignments between a single tetraodon chromosome and a single human chromosome into a group and creates a kd-tree out of the gapless subsections (blocks) of the alignments. A dynamic program was then run over the kd-trees to find the maximally scoring chains of these blocks. The following matrix was used:  ACGT A91-114-31-123 C-114100-125-31 G-31-125100-114 T-123-31-11491 Chains scoring below a minimum score of "1000" were discarded; the remaining chains are displayed in this track. The linear gap matrix used with axtChain: -linearGap=loose tablesize 11 smallSize 111 position 1 2 3 11 111 2111 12111 32111 72111 152111 252111 qGap 325 360 400 450 600 1100 3600 7600 15600 31600 56600 tGap 325 360 400 450 600 1100 3600 7600 15600 31600 56600 bothGap 625 660 700 750 900 1400 4000 8000 16000 32000 57000 Net track Chains were derived from lastz alignments, using the methods described on the chain tracks description pages, and sorted with the highest-scoring chains in the genome ranked first. The program chainNet was then used to place the chains one at a time, trimming them as necessary to fit into sections not already covered by a higher-scoring chain. During this process, a natural hierarchy emerged in which a chain that filled a gap in a higher-scoring chain was placed underneath that chain. The program netSyntenic was used to fill in information about the relationship between higher- and lower-level chains, such as whether a lower-level chain was syntenic or inverted relative to the higher-level chain. The program netClass was then used to fill in how much of the gaps and chains contained Ns (sequencing gaps) in one or both species and how much was filled with transposons inserted before and after the two organisms diverged. Credits Lastz (previously known as blastz) was developed at Pennsylvania State University by Minmei Hou, Scott Schwartz, Zheng Zhang, and Webb Miller with advice from Ross Hardison. Lineage-specific repeats were identified by Arian Smit and his RepeatMasker program. The axtChain program was developed at the University of California at Santa Cruz by Jim Kent with advice from Webb Miller and David Haussler. The browser display and database storage of the chains and nets were created by Robert Baertsch and Jim Kent. The chainNet, netSyntenic, and netClass programs were developed at the University of California Santa Cruz by Jim Kent. References Harris, R.S. (2007) Improved pairwise alignment of genomic DNA Ph.D. Thesis, The Pennsylvania State University Chiaromonte F, Yap VB, Miller W. Scoring pairwise genomic sequence alignments. Pac Symp Biocomput. 2002:115-26. PMID: 11928468 Kent WJ, Baertsch R, Hinrichs A, Miller W, Haussler D. Evolution's cauldron: duplication, deletion, and rearrangement in the mouse and human genomes. Proc Natl Acad Sci U S A. 2003 Sep 30;100(20):11484-9. PMID: 14500911; PMC: PMC208784 Schwartz S, Kent WJ, Smit A, Zhang Z, Baertsch R, Hardison RC, Haussler D, Miller W. Human-mouse alignments with BLASTZ. Genome Res. 2003 Jan;13(1):103-7. PMID: 12529312; PMC: PMC430961 chainNetTetNig1Viewchain Chain Tetraodon (Feb. 2004 (Genoscope 7/tetNig1)), Chain and Net Alignments Comparative Genomics chainTetNig1 Tetraodon Chain Tetraodon (Feb. 2004 (Genoscope 7/tetNig1)) Chained Alignments Comparative Genomics Description This track shows regions of the genome that are alignable to other genomes ("chain" subtracks) or in synteny ("net" subtracks). The alignable parts are shown with thick blocks that look like exons. Non-alignable parts between these are shown like introns. Chain Track The chain track shows alignments of tetraodon (Feb. 2004 (Genoscope 7/tetNig1)) to the human genome using a gap scoring system that allows longer gaps than traditional affine gap scoring systems. It can also tolerate gaps in both tetraodon and human simultaneously. These "double-sided" gaps can be caused by local inversions and overlapping deletions in both species. The chain track displays boxes joined together by either single or double lines. The boxes represent aligning regions. Single lines indicate gaps that are largely due to a deletion in the tetraodon assembly or an insertion in the human assembly. Double lines represent more complex gaps that involve substantial sequence in both species. This may result from inversions, overlapping deletions, an abundance of local mutation, or an unsequenced gap in one species. In cases where multiple chains align over a particular region of the human genome, the chains with single-lined gaps are often due to processed pseudogenes, while chains with double-lined gaps are more often due to paralogs and unprocessed pseudogenes. In the "pack" and "full" display modes, the individual feature names indicate the chromosome, strand, and location (in thousands) of the match for each matching alignment. Net Track The net track shows the best tetraodon/human chain for every part of the human genome. It is useful for finding syntenic regions, possibly orthologs, and for studying genome rearrangement. The tetraodon sequence used in this annotation is from the Feb. 2004 (Genoscope 7/tetNig1) assembly. Display Conventions and Configuration Chain Track By default, the chains to chromosome-based assemblies are colored based on which chromosome they map to in the aligning organism. To turn off the coloring, check the "off" button next to: Color track based on chromosome. To display only the chains of one chromosome in the aligning organism, enter the name of that chromosome (e.g. chr4) in box next to: Filter by chromosome. Net Track In full display mode, the top-level (level 1) chains are the largest, highest-scoring chains that span this region. In many cases gaps exist in the top-level chain. When possible, these are filled in by other chains that are displayed at level 2. The gaps in level 2 chains may be filled by level 3 chains and so forth. In the graphical display, the boxes represent ungapped alignments; the lines represent gaps. Click on a box to view detailed information about the chain as a whole; click on a line to display information about the gap. The detailed information is useful in determining the cause of the gap or, for lower level chains, the genomic rearrangement. Individual items in the display are categorized as one of four types (other than gap): Top - the best, longest match. Displayed on level 1. Syn - line-ups on the same chromosome as the gap in the level above it. Inv - a line-up on the same chromosome as the gap above it, but in the opposite orientation. NonSyn - a match to a chromosome different from the gap in the level above. Methods Chain track Transposons that have been inserted since the tetraodon/human split were removed from the assemblies. The abbreviated genomes were aligned with lastz, and the transposons were added back in. The resulting alignments were converted into axt format using the lavToAxt program. The axt alignments were fed into axtChain, which organizes all alignments between a single tetraodon chromosome and a single human chromosome into a group and creates a kd-tree out of the gapless subsections (blocks) of the alignments. A dynamic program was then run over the kd-trees to find the maximally scoring chains of these blocks. The following matrix was used:  ACGT A91-114-31-123 C-114100-125-31 G-31-125100-114 T-123-31-11491 Chains scoring below a minimum score of "1000" were discarded; the remaining chains are displayed in this track. The linear gap matrix used with axtChain: -linearGap=loose tablesize 11 smallSize 111 position 1 2 3 11 111 2111 12111 32111 72111 152111 252111 qGap 325 360 400 450 600 1100 3600 7600 15600 31600 56600 tGap 325 360 400 450 600 1100 3600 7600 15600 31600 56600 bothGap 625 660 700 750 900 1400 4000 8000 16000 32000 57000 Net track Chains were derived from lastz alignments, using the methods described on the chain tracks description pages, and sorted with the highest-scoring chains in the genome ranked first. The program chainNet was then used to place the chains one at a time, trimming them as necessary to fit into sections not already covered by a higher-scoring chain. During this process, a natural hierarchy emerged in which a chain that filled a gap in a higher-scoring chain was placed underneath that chain. The program netSyntenic was used to fill in information about the relationship between higher- and lower-level chains, such as whether a lower-level chain was syntenic or inverted relative to the higher-level chain. The program netClass was then used to fill in how much of the gaps and chains contained Ns (sequencing gaps) in one or both species and how much was filled with transposons inserted before and after the two organisms diverged. Credits Lastz (previously known as blastz) was developed at Pennsylvania State University by Minmei Hou, Scott Schwartz, Zheng Zhang, and Webb Miller with advice from Ross Hardison. Lineage-specific repeats were identified by Arian Smit and his RepeatMasker program. The axtChain program was developed at the University of California at Santa Cruz by Jim Kent with advice from Webb Miller and David Haussler. The browser display and database storage of the chains and nets were created by Robert Baertsch and Jim Kent. The chainNet, netSyntenic, and netClass programs were developed at the University of California Santa Cruz by Jim Kent. References Harris, R.S. (2007) Improved pairwise alignment of genomic DNA Ph.D. Thesis, The Pennsylvania State University Chiaromonte F, Yap VB, Miller W. Scoring pairwise genomic sequence alignments. Pac Symp Biocomput. 2002:115-26. PMID: 11928468 Kent WJ, Baertsch R, Hinrichs A, Miller W, Haussler D. Evolution's cauldron: duplication, deletion, and rearrangement in the mouse and human genomes. Proc Natl Acad Sci U S A. 2003 Sep 30;100(20):11484-9. PMID: 14500911; PMC: PMC208784 Schwartz S, Kent WJ, Smit A, Zhang Z, Baertsch R, Hardison RC, Haussler D, Miller W. Human-mouse alignments with BLASTZ. Genome Res. 2003 Jan;13(1):103-7. PMID: 12529312; PMC: PMC430961 ecoresTetNig1 Tetraodon Ecores Human(hg17)/Tetraodon (Feb. 2004 (Genoscope 7/tetNig1)) Evolutionary Conserved Regions Comparative Genomics Description This track shows Evolutionary Conserved Regions computed by the Exofish program at Genoscope. Each singleton block corresponds to an "ecore"; two blocks connected by a thin line correspond to an "ecotig", a set of colinear ecores in a syntenic region. Methods Genome-wide sequence comparisons were done at the protein-coding level between the genome sequences of human, Homo sapiens, and Tetraodon (green spotted pufferfish), Tetraodon nigroviridis, to detect evolutionarily conserved regions (ECORES). Credits Thanks to Olivier Jaillon at Genoscope for contributing the data.