cons35way Conservation Vertebrate Multiz Alignment & Conservation (35 Species) Comparative Genomics Description This track shows multiple alignments of 35 vertebrate species and measurements of evolutionary conservation using two methods (phastCons and phyloP) from the PHAST package. The multiple alignments were generated using multiz and other tools in the UCSC/Penn State Bioinformatics comparative genomics alignment pipeline. Conserved elements identified by phastCons are also displayed in this track. The conservation measurements were created using the phastCons package from Adam Siepel at Cold Spring Harbor Laboratory. Both phastCons and phyloP treat alignment gaps and unaligned nucleotides as missing data. See also: lastz parameters and other details for the chaining minimum score and gap parameters used in these alignments. PhastCons (which has been used in previous Conservation tracks) is a hidden Markov model-based method that estimates the probability that each nucleotide belongs to a conserved element, based on the multiple alignment. It considers not just each individual alignment column, but also its flanking columns. By contrast, phyloP separately measures conservation at individual columns, ignoring the effects of their neighbors. As a consequence, the phyloP plots have a less smooth appearance than the phastCons plots, with more "texture" at individual sites. The two methods have different strengths and weaknesses. PhastCons is sensitive to "runs" of conserved sites, and is therefore effective for picking out conserved elements. PhyloP, on the other hand, is more appropriate for evaluating signatures of selection at particular nucleotides or classes of nucleotides (e.g., third codon positions, or first positions of miRNA target sites). Another important difference is that phyloP can measure acceleration (faster evolution than expected under neutral drift) as well as conservation (slower than expected evolution). In the phyloP plots, sites predicted to be conserved are assigned positive scores (and shown in blue), while sites predicted to be fast-evolving are assigned negative scores (and shown in red). The absolute values of the scores represent -log p-values under a null hypothesis of neutral evolution. The phastCons scores, by contrast, represent probabilities of negative selection and range between 0 and 1. Missing sequence in the assemblies is highlighted in the track display by regions of yellow when zoomed out and Ns displayed at base level (see Gap Annotation, below). MouseMus musculus Jun. 2020 (GRCm39/mm39)Jun. 2020 (GRCm39/mm39)reference species BeaverCastor canadensis Feb. 2017 (C.can genome v1.0/casCan1)Feb. 2017 (C.can genome v1.0/casCan1)reciprocal best BonoboPan paniscus May 2020 (Mhudiblu_PPA_v0/panPan3)May 2020 (Mhudiblu_PPA_v0/panPan3)syntenic net BushbabyOtolemur garnettii Mar. 2011 (Broad/otoGar3)Mar. 2011 (Broad/otoGar3)reciprocal best ChickenGallus gallus Mar. 2018 (GRCg6a/galGal6)Mar. 2018 (GRCg6a/galGal6)maf net ChimpPan troglodytes Jan. 2018 (Clint_PTRv2/panTro6)Jan. 2018 (Clint_PTRv2/panTro6)syntenic net Chinese hamsterCricetulus griseus Jun. 2020 (GCF_0003668045.3 CriGri-PICRH-1.0)Jun. 2020 (GCF_0003668045.3 CriGri-PICRH-1.0)syntenic net Chinese pangolinManis pentadactyla Aug 2014 (M_pentadactyla-1.1.1/manPen1)Aug 2014 (M_pentadactyla-1.1.1/manPen1)reciprocal best CowBos taurus Apr. 2018 (ARS-UCD1.2/bosTau9)Apr. 2018 (ARS-UCD1.2/bosTau9)reciprocal best DogCanis lupus familiaris Mar. 2020 (UU_Cfam_GSD_1.0/canFam4)Mar. 2020 (UU_Cfam_GSD_1.0/canFam4)syntenic net DolphinTursiops truncatus Oct. 2011 (Baylor Ttru_1.4/turTru2)Oct. 2011 (Baylor Ttru_1.4/turTru2)reciprocal best ElephantLoxodonta africana Jul. 2009 (Broad/loxAfr3)Jul. 2009 (Broad/loxAfr3)reciprocal best GorillaGorilla gorilla gorilla Aug. 2019 (Kamilah_GGO_v0/gorGor6)Aug. 2019 (Kamilah_GGO_v0/gorGor6)syntenic net Guinea pigCavia porcellus Feb. 2008 (Broad/cavPor3)Feb. 2008 (Broad/cavPor3)syntenic net Hawaiian monk sealNeomonachus schauinslandi Jun. 2017 (ASM220157v1/neoSch1)Jun. 2017 (ASM220157v1/neoSch1)syntenic net HedgehogErinaceus europaeus May 2012 (EriEur2.0/eriEur2)May 2012 (EriEur2.0/eriEur2)reciprocal best HorseEquus caballus Jan. 2018 (EquCab3.0/equCab3)Jan. 2018 (EquCab3.0/equCab3)syntenic net HumanHomo sapiens Dec. 2013 (GRCh38/hg38)Dec. 2013 (GRCh38/hg38)syntenic net LampreyPetromyzon marinus Dec. 2017 (Pmar_germline 1.0/petMar3)Dec. 2017 (Pmar_germline 1.0/petMar3)maf net Malayan flying lemurGaleopterus variegatus Jun. 2014 (G_variegatus-3.0.2/galVar1)Jun. 2014 (G_variegatus-3.0.2/galVar1)maf net MarmosetCallithrix jacchus May 2020 (Callithrix_jacchus_cj1700_1.1/calJac4)May 2020 (Callithrix_jacchus_cj1700_1.1/calJac4)syntenic net OpossumMonodelphis domestica Oct. 2006 (Broad/monDom5)Oct. 2006 (Broad/monDom5)maf net PigSus scrofa Aug. 2011 (SGSC Sscrofa10.2/susScr3)Aug. 2011 (SGSC Sscrofa10.2/susScr3)reciprocal best PikaOchotona princeps May 2012 (OchPri3.0/ochPri3)May 2012 (OchPri3.0/ochPri3)reciprocal best RabbitOryctolagus cuniculus Apr. 2009 (Broad/oryCun2)Apr. 2009 (Broad/oryCun2)reciprocal best RatRattus norvegicus Jul. 2014 (RGSC 6.0/rn6)Jul. 2014 (RGSC 6.0/rn6)syntenic net RhesusMacaca mulatta Feb. 2019 (Mmul_10/rheMac10)Feb. 2019 (Mmul_10/rheMac10)syntenic net SheepOvis aries Nov. 2015 (Oar_v4.0/oviAri4)Nov. 2015 (Oar_v4.0/oviAri4)syntenic net ShrewSorex araneus Aug. 2008 (Broad/sorAra2)Aug. 2008 (Broad/sorAra2)reciprocal best SquirrelSpermophilus tridecemlineatus Nov. 2011 (Broad/speTri2)Nov. 2011 (Broad/speTri2)reciprocal best TarsierTarsius syrichta Sep. 2013 (Tarsius_syrichta-2.0.1/tarSyr2)Sep. 2013 (Tarsius_syrichta-2.0.1/tarSyr2)reciprocal best TenrecEchinops telfairi Nov. 2012 (Broad/echTel2)Nov. 2012 (Broad/echTel2)reciprocal best Tree shrewTupaia belangeri Dec. 2006 (Broad/tupBel1)Dec. 2006 (Broad/tupBel1)reciprocal best X. tropicalisXenopus tropicalis Jul. 2016 (Xenopus_tropicalis_v9.1/xenTro9)Jul. 2016 (Xenopus_tropicalis_v9.1/xenTro9)maf net ZebrafishDanio rerio May 2017 (GRCz11/danRer11)May 2017 (GRCz11/danRer11)maf net Table 1. Genome assemblies included in the 35-way Conservation track. * Data download only, browser not available. Display Conventions and Configuration The track configuration options allow the user to display either the vertebrate or placental mammal conservation scores, or both simultaneously. In full and pack display modes, conservation scores are displayed as a wiggle track (histogram) in which the height reflects the size of the score. The conservation wiggles can be configured in a variety of ways to highlight different aspects of the displayed information. Click the Graph configuration help link for an explanation of the configuration options. Pairwise alignments of each species to the mouse genome are displayed below the conservation histogram as a grayscale density plot (in pack mode) or as a wiggle (in full mode) that indicates alignment quality. In dense display mode, conservation is shown in grayscale using darker values to indicate higher levels of overall conservation as scored by phastCons. Checkboxes on the track configuration page allow selection of the species to include in the pairwise display. Note that excluding species from the pairwise display does not alter the the conservation score display. To view detailed information about the alignments at a specific position, zoom the display in to 30,000 or fewer bases, then click on the alignment. Gap Annotation The Display chains between alignments configuration option enables display of gaps between alignment blocks in the pairwise alignments in a manner similar to the Chain track display. The following conventions are used: Single line: No bases in the aligned species. Possibly due to a lineage-specific insertion between the aligned blocks in the mouse genome or a lineage-specific deletion between the aligned blocks in the aligning species. Double line: Aligning species has one or more unalignable bases in the gap region. Possibly due to excessive evolutionary distance between species or independent indels in the region between the aligned blocks in both species. Pale yellow coloring: Aligning species has Ns in the gap region. Reflects uncertainty in the relationship between the DNA of both species, due to lack of sequence in relevant portions of the aligning species. Genomic Breaks Discontinuities in the genomic context (chromosome, scaffold or region) of the aligned DNA in the aligning species are shown as follows: Vertical blue bar: Represents a discontinuity that persists indefinitely on either side, e.g. a large region of DNA on either side of the bar comes from a different chromosome in the aligned species due to a large scale rearrangement. Green square brackets: Enclose shorter alignments consisting of DNA from one genomic context in the aligned species nested inside a larger chain of alignments from a different genomic context. The alignment within the brackets may represent a short misalignment, a lineage-specific insertion of a transposon in the mouse genome that aligns to a paralogous copy somewhere else in the aligned species, or other similar occurrence. Base Level When zoomed-in to the base-level display, the track shows the base composition of each alignment. The numbers and symbols on the Gaps line indicate the lengths of gaps in the mouse sequence at those alignment positions relative to the longest non-mouse sequence. If there is sufficient space in the display, the size of the gap is shown. If the space is insufficient and the gap size is a multiple of 3, a "*" is displayed; other gap sizes are indicated by "+". Codon translation is available in base-level display mode if the displayed region is identified as a coding segment. To display this annotation, select the species for translation from the pull-down menu in the Codon Translation configuration section at the top of the page. Then, select one of the following modes: No codon translation: The gene annotation is not used; the bases are displayed without translation. Use default species reading frames for translation: The annotations from the genome displayed in the Default species to establish reading frame pull-down menu are used to translate all the aligned species present in the alignment. Use reading frames for species if available, otherwise no translation: Codon translation is performed only for those species where the region is annotated as protein coding. Use reading frames for species if available, otherwise use default species: Codon translation is done on those species that are annotated as being protein coding over the aligned region using species-specific annotation; the remaining species are translated using the default species annotation. Codon translation uses the following gene tracks as the basis for translation, depending on the species chosen (Table 2). Gene TrackSpecies Known Geneshuman Ensembl Genestree shrew, opossum NCBI RefSeqbeaver, bonobo, bushbaby, chicken, Chinese hamster, chimp, cow, elephant, gorilla, guinea pig, hawaiian monk seal, hedgehog, horse, malayan flying lemur, marmoset, mouse, pig, pika, rabbit, rat, rhesus, sheep, shrew, squirrel, tarsier, tenrec, X. tropicalis, zebrafish Xeno RefGeneChinese pangolin, dog, dolphin, lamprey Table 2. Gene tracks used for codon translation. Methods Pairwise alignments with the mouse genome were generated for each species using lastz from repeat-masked genomic sequence. Pairwise alignments were then linked into chains using a dynamic programming algorithm that finds maximally scoring chains of gapless subsections of the alignments organized in a kd-tree. The scoring matrix and parameters for pairwise alignment and chaining were tuned for each species based on phylogenetic distance from the reference. High-scoring chains were then placed along the genome, with gaps filled by lower-scoring chains, to produce an alignment net. For more information about the chaining and netting process and parameters for each species, see the description pages for the Chain and Net tracks. An additional filtering step was introduced in the generation of the 35-way conservation track to reduce the number of paralogs and pseudogenes from the high-quality assemblies and the suspect alignments from the low-quality assemblies: the pairwise alignments of high-quality mammalian sequences (placental and marsupial) were filtered based on synteny; those for 2X mammalian genomes were filtered to retain only alignments of best quality in both the target and query ("reciprocal best"). The resulting best-in-genome pairwise alignments were progressively aligned using multiz/autoMZ, following the tree topology diagrammed above, to produce multiple alignments. The multiple alignments were post-processed to add annotations indicating alignment gaps, genomic breaks, and base quality of the component sequences. The annotated multiple alignments, in MAF format, are available for bulk download. An alignment summary table containing an entry for each alignment block in each species was generated to improve track display performance at large scales. Framing tables were constructed to enable visualization of codons in the multiple alignment display. Conservation scoring was performed using the PhastCons package (A. Siepel), which computes conservation based on a two-state phylogenetic hidden Markov model (HMM). PhastCons measurements rely on a tree model containing the tree topology, branch lengths representing evolutionary distance at neutrally evolving sites, the background distribution of nucleotides, and a substitution rate matrix. The vertebrate tree model for this track was generated using the phyloFit program from the phastCons package (REV model, EM algorithm, medium precision) using multiple alignments of 4-fold degenerate sites extracted from the 28-way human(hg18) alignment (msa_view). The 4d sites were derived from the Oct 2005 Gencode Reference Gene set, which was filtered to select single-coverage long transcripts. The phastCons parameters used for the conservation measurement were: expected-length=45, target-coverage=.3 and rho=.31 The phastCons program computes conservation scores based on a phylo-HMM, a type of probabilistic model that describes both the process of DNA substitution at each site in a genome and the way this process changes from one site to the next (Felsenstein and Churchill 1996, Yang 1995, Siepel and Haussler 2005). PhastCons uses a two-state phylo-HMM, with a state for conserved regions and a state for non-conserved regions. The value plotted at each site is the posterior probability that the corresponding alignment column was "generated" by the conserved state of the phylo-HMM. These scores reflect the phylogeny (including branch lengths) of the species in question, a continuous-time Markov model of the nucleotide substitution process, and a tendency for conservation levels to be autocorrelated along the genome (i.e., to be similar at adjacent sites). The general reversible (REV) substitution model was used. Unlike many conservation-scoring programs, note that phastCons does not rely on a sliding window of fixed size; therefore, short highly-conserved regions and long moderately conserved regions can both obtain high scores. More information about phastCons can be found in Siepel et al. 2005. PhastCons currently treats alignment gaps as missing data, which sometimes has the effect of producing undesirably high conservation scores in gappy regions of the alignment. We are looking at several possible ways of improving the handling of alignment gaps. Data Access You can access this data in the Table Browser for position or identifier based queries in multiple formats. Downloads for data in this track are available in the following locations: Multiz alignments (MAF format), and phylogenetic trees PhyloP conservation (WIG format) PhastCons conservation (WIG format) Credits This track was created using the following programs: Alignment tools: lastz (formerly blastz) and multiz by Minmei Hou, Scott Schwartz and Webb Miller of the Penn State Bioinformatics Group Chaining and Netting: axtChain, chainNet by Jim Kent at UCSC Conservation scoring: PhastCons, phyloFit, tree_doctor, msa_view by Adam Siepel while at UCSC, now at Cold Spring Harbor Laboratory MAF Annotation tools: mafAddIRows by Brian Raney, UCSC; mafAddQRows by Richard Burhans, Penn State; genePredToMafFrames by Mark Diekhans, UCSC Tree image generator: phyloPng by Galt Barber, UCSC Conservation track display: Kate Rosenbloom, Hiram Clawson (wiggle display), and Brian Raney (gap annotation and codon framing) at UCSC The phylogenetic tree is based on Murphy et al. (2001) and general consensus in the vertebrate phylogeny community as of March 2007. References Phylo-HMMs and phastCons: Felsenstein J, Churchill GA. A Hidden Markov Model approach to variation among sites in rate of evolution. Mol Biol Evol. 1996 Jan;13(1):93-104. PMID: 8583911 Siepel A, Haussler D. Phylogenetic Hidden Markov Models. In: Nielsen R, editor. Statistical Methods in Molecular Evolution. New York: Springer; 2005. pp. 325-351. Siepel A, Bejerano G, Pedersen JS, Hinrichs AS, Hou M, Rosenbloom K, Clawson H, Spieth J, Hillier LW, Richards S, et al. Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. Genome Res. 2005 Aug;15(8):1034-50. PMID: 16024819; PMC: PMC1182216 Yang Z. A space-time process model for the evolution of DNA sequences. Genetics. 1995 Feb;139(2):993-1005. PMID: 7713447; PMC: PMC1206396 Chain/Net: Kent WJ, Baertsch R, Hinrichs A, Miller W, Haussler D. Evolution's cauldron: duplication, deletion, and rearrangement in the mouse and human genomes. Proc Natl Acad Sci U S A. 2003 Sep 30;100(20):11484-9. PMID: 14500911; PMC: PMC208784 Multiz: Blanchette M, Kent WJ, Riemer C, Elnitski L, Smit AF, Roskin KM, Baertsch R, Rosenbloom K, Clawson H, Green ED, et al. Aligning multiple genomic sequences with the threaded blockset aligner. Genome Res. 2004 Apr;14(4):708-15. PMID: 15060014; PMC: PMC383317 Lastz (formerly Blastz): Chiaromonte F, Yap VB, Miller W. Scoring pairwise genomic sequence alignments. Pac Symp Biocomput. 2002:115-26. PMID: 11928468 Harris RS. Improved pairwise alignment of genomic DNA. Ph.D. Thesis. Pennsylvania State University, USA. 2007. Schwartz S, Kent WJ, Smit A, Zhang Z, Baertsch R, Hardison RC, Haussler D, Miller W. Human-mouse alignments with BLASTZ. Genome Res. 2003 Jan;13(1):103-7. PMID: 12529312; PMC: PMC430961 Phylogenetic Tree: Murphy WJ, Eizirik E, O'Brien SJ, Madsen O, Scally M, Douady CJ, Teeling E, Ryder OA, Stanhope MJ, de Jong WW, Springer MS. Resolution of the early placental mammal radiation using Bayesian phylogenetics. Science. 2001 Dec 14;294(5550):2348-51. PMID: 11743200 cons35wayViewalign Multiz Alignments Vertebrate Multiz Alignment & Conservation (35 Species) Comparative Genomics multiz35way Multiz Align Multiz Alignments of 35 Vertebrates Comparative Genomics cons35wayViewphastcons Element Conservation (phastCons) Vertebrate Multiz Alignment & Conservation (35 Species) Comparative Genomics phastCons35way Cons 35 Verts 35 vertebrates conservation by PhastCons Comparative Genomics cons35wayViewelements Conserved Elements Vertebrate Multiz Alignment & Conservation (35 Species) Comparative Genomics phastConsElements35way 35 Vert. El 35 vertebrates Conserved Elements Comparative Genomics cons35wayViewphyloP Basewise Conservation (phyloP) Vertebrate Multiz Alignment & Conservation (35 Species) Comparative Genomics phyloP35way Cons 35 Verts 35 vertebrates Basewise Conservation by PhyloP Comparative Genomics cpgIslandExt CpG Islands CpG Islands (Islands < 300 Bases are Light Green) Expression and Regulation Description CpG islands are associated with genes, particularly housekeeping genes, in vertebrates. CpG islands are typically common near transcription start sites and may be associated with promoter regions. Normally a C (cytosine) base followed immediately by a G (guanine) base (a CpG) is rare in vertebrate DNA because the Cs in such an arrangement tend to be methylated. This methylation helps distinguish the newly synthesized DNA strand from the parent strand, which aids in the final stages of DNA proofreading after duplication. However, over evolutionary time, methylated Cs tend to turn into Ts because of spontaneous deamination. The result is that CpGs are relatively rare unless there is selective pressure to keep them or a region is not methylated for some other reason, perhaps having to do with the regulation of gene expression. CpG islands are regions where CpGs are present at significantly higher levels than is typical for the genome as a whole. The unmasked version of the track displays potential CpG islands that exist in repeat regions and would otherwise not be visible in the repeat masked version. By default, only the masked version of the track is displayed. To view the unmasked version, change the visibility settings in the track controls at the top of this page. Methods CpG islands were predicted by searching the sequence one base at a time, scoring each dinucleotide (+17 for CG and -1 for others) and identifying maximally scoring segments. Each segment was then evaluated for the following criteria: GC content of 50% or greater length greater than 200 bp ratio greater than 0.6 of observed number of CG dinucleotides to the expected number on the basis of the number of Gs and Cs in the segment The entire genome sequence, masking areas included, was used for the construction of the track Unmasked CpG. The track CpG Islands is constructed on the sequence after all masked sequence is removed. The CpG count is the number of CG dinucleotides in the island. The Percentage CpG is the ratio of CpG nucleotide bases (twice the CpG count) to the length. The ratio of observed to expected CpG is calculated according to the formula (cited in Gardiner-Garden et al. (1987)): Obs/Exp CpG = Number of CpG * N / (Number of C * Number of G) where N = length of sequence. The calculation of the track data is performed by the following command sequence: twoBitToFa assembly.2bit stdout | maskOutFa stdin hard stdout \ | cpg_lh /dev/stdin 2> cpg_lh.err \ | awk '{$2 = $2 - 1; width = $3 - $2; printf("%s\t%d\t%s\t%s %s\t%s\t%s\t%0.0f\t%0.1f\t%s\t%s\n", $1, $2, $3, $5, $6, width, $6, width*$7*0.01, 100.0*2*$6/width, $7, $9);}' \ | sort -k1,1 -k2,2n > cpgIsland.bed The unmasked track data is constructed from twoBitToFa -noMask output for the twoBitToFa command. Data access CpG islands and its associated tables can be explored interactively using the REST API, the Table Browser or the Data Integrator. All the tables can also be queried directly from our public MySQL servers, with more information available on our help page as well as on our blog. The source for the cpg_lh program can be obtained from src/utils/cpgIslandExt/. The cpg_lh program binary can be obtained from: http://hgdownload.soe.ucsc.edu/admin/exe/linux.x86_64/cpg_lh (choose "save file") Credits This track was generated using a modification of a program developed by G. Miklem and L. Hillier (unpublished). References Gardiner-Garden M, Frommer M. CpG islands in vertebrate genomes. J Mol Biol. 1987 Jul 20;196(2):261-82. PMID: 3656447 cpgIslandSuper CpG Islands CpG Islands (Islands < 300 Bases are Light Green) Expression and Regulation Description CpG islands are associated with genes, particularly housekeeping genes, in vertebrates. CpG islands are typically common near transcription start sites and may be associated with promoter regions. Normally a C (cytosine) base followed immediately by a G (guanine) base (a CpG) is rare in vertebrate DNA because the Cs in such an arrangement tend to be methylated. This methylation helps distinguish the newly synthesized DNA strand from the parent strand, which aids in the final stages of DNA proofreading after duplication. However, over evolutionary time, methylated Cs tend to turn into Ts because of spontaneous deamination. The result is that CpGs are relatively rare unless there is selective pressure to keep them or a region is not methylated for some other reason, perhaps having to do with the regulation of gene expression. CpG islands are regions where CpGs are present at significantly higher levels than is typical for the genome as a whole. The unmasked version of the track displays potential CpG islands that exist in repeat regions and would otherwise not be visible in the repeat masked version. By default, only the masked version of the track is displayed. To view the unmasked version, change the visibility settings in the track controls at the top of this page. Methods CpG islands were predicted by searching the sequence one base at a time, scoring each dinucleotide (+17 for CG and -1 for others) and identifying maximally scoring segments. Each segment was then evaluated for the following criteria: GC content of 50% or greater length greater than 200 bp ratio greater than 0.6 of observed number of CG dinucleotides to the expected number on the basis of the number of Gs and Cs in the segment The entire genome sequence, masking areas included, was used for the construction of the track Unmasked CpG. The track CpG Islands is constructed on the sequence after all masked sequence is removed. The CpG count is the number of CG dinucleotides in the island. The Percentage CpG is the ratio of CpG nucleotide bases (twice the CpG count) to the length. The ratio of observed to expected CpG is calculated according to the formula (cited in Gardiner-Garden et al. (1987)): Obs/Exp CpG = Number of CpG * N / (Number of C * Number of G) where N = length of sequence. The calculation of the track data is performed by the following command sequence: twoBitToFa assembly.2bit stdout | maskOutFa stdin hard stdout \ | cpg_lh /dev/stdin 2> cpg_lh.err \ | awk '{$2 = $2 - 1; width = $3 - $2; printf("%s\t%d\t%s\t%s %s\t%s\t%s\t%0.0f\t%0.1f\t%s\t%s\n", $1, $2, $3, $5, $6, width, $6, width*$7*0.01, 100.0*2*$6/width, $7, $9);}' \ | sort -k1,1 -k2,2n > cpgIsland.bed The unmasked track data is constructed from twoBitToFa -noMask output for the twoBitToFa command. Data access CpG islands and its associated tables can be explored interactively using the REST API, the Table Browser or the Data Integrator. All the tables can also be queried directly from our public MySQL servers, with more information available on our help page as well as on our blog. The source for the cpg_lh program can be obtained from src/utils/cpgIslandExt/. The cpg_lh program binary can be obtained from: http://hgdownload.soe.ucsc.edu/admin/exe/linux.x86_64/cpg_lh (choose "save file") Credits This track was generated using a modification of a program developed by G. Miklem and L. Hillier (unpublished). References Gardiner-Garden M, Frommer M. CpG islands in vertebrate genomes. J Mol Biol. 1987 Jul 20;196(2):261-82. PMID: 3656447 knownGene GENCODE VM33 GENCODE VM33 Genes and Gene Predictions Description The GENCODE Genes track (version M33, Jul 2023) shows high-quality manual annotations merged with evidence-based automated annotations across the entire human genome generated by the GENCODE project. By default, only the basic gene set is displayed, which is a subset of the comprehensive gene set. The basic set represents transcripts that GENCODE believes will be useful to the majority of users. The track includes protein-coding genes, non-coding RNA genes, and pseudo-genes, though pseudo-genes are not displayed by default. It contains annotations on the reference chromosomes as well as assembly patches and alternative loci (haplotypes). The following table provides statistics for the VM33 release derived from the GTF file that contains annotations only on the main chromosomes. More information on how they were generated can be found in the GENCODE site. GENCODE VM33 Release Stats GenesObservedTranscriptsObserved Protein-coding genes21,403Protein-coding transcripts58,750 Long non-coding RNA genes14,842- full length protein-coding45,112 Small non-coding RNA genes6,105- partial length protein-coding13,638 Pseudogenes13,809Nonsense mediated decay transcripts7,218 Immunoglobulin/T-cell receptor gene segments701Long non-coding RNA loci transcripts26,564 Total No of distinct translations44,993Genes that have more than one distinct translations10,893 For more information on the different gene tracks, see our Genes FAQ. Display Conventions and Configuration By default, this track displays only the basic GENCODE set, splice variants, and non-coding genes. It includes options to display the entire GENCODE set and pseudogenes. To customize these options, the respective boxes can be checked or unchecked at the top of this description page. This track also includes a variety of labels which identify the transcripts when visibility is set to "full" or "pack". Gene symbols (e.g. NIPA1) are displayed by default, but additional options include GENCODE Transcript ID (ENSMUST00000052204.6), UCSC Known Gene ID (uc009hdu.3), UniProt Display ID (Q8BHK1). Additional information about gene and transcript names can be found in our FAQ. This track, in general, follows the display conventions for gene prediction tracks. The exons for putative non-coding genes and untranslated regions are represented by relatively thin blocks, while those for coding open reading frames are thicker. Coloring for the gene annotations is based on the annotation type: coding non-coding pseudogene problem all 2-way pseudogenes all polyA annotations This track contains an optional codon coloring feature that allows users to quickly validate and compare gene predictions. There is also an option to display the data as a density graph, which can be helpful for visualizing the distribution of items over a region. Methods The GENCODE VM33 track was built from the GENCODE downloads comprehensive gene annotation (all regions) file gencode.vM33.chr_patch_hapl_scaff.annotation.gff3.gz. Data from other sources were correlated with the GENCODE data to build association tables. Related Data The GENCODE Genes transcripts are annotated in numerous tables, each of which is also available as a downloadable file. One can see a full list of the associated tables in the Table Browser by selecting GENCODE Genes from the track menu; this list is then available on the table menu. Data access GENCODE Genes and its associated tables can be explored interactively using the REST API, the Table Browser or the Data Integrator. The genePred format files for mm39 are available from our downloads directory or in our GTF download directory. All the tables can also be queried directly from our public MySQL servers, with more information available on our help page as well as on our blog. Credits The GENCODE Genes track was produced at UCSC from the GENCODE comprehensive gene set using a computational pipeline developed by Jim Kent and Brian Raney. This version of the track was generated by Jonathan Casper. References Frankish A, Carbonell-Sala S, Diekhans M, Jungreis I, Loveland JE, Mudge JM, Sisu C, Wright JC, Arnan C, Barnes I et al. GENCODE: reference annotation for the human and mouse genomes in 2023. Nucleic Acids Res. 2023 Jan 6;51(D1):D942-D949. PMID: 36420896; PMC: PMC9825462 A full list of GENCODE publications is available at The GENCODE Project web site. Data Release Policy GENCODE data are available for use without restrictions. rmsk RepeatMasker Repeating Elements by RepeatMasker Variation and Repeats Description This track was created by using Arian Smit's RepeatMasker program, which screens DNA sequences for interspersed repeats and low complexity DNA sequences. The program outputs a detailed annotation of the repeats that are present in the query sequence (represented by this track), as well as a modified version of the query sequence in which all the annotated repeats have been masked (generally available on the Downloads page). RepeatMasker uses the Repbase Update library of repeats from the Genetic Information Research Institute (GIRI). Repbase Update is described in Jurka (2000) in the References section below. Some newer assemblies have been made with Dfam, not Repbase. You can find the details for how we make our database data here in our "makeDb/doc/" directory. Display Conventions and Configuration In full display mode, this track displays up to ten different classes of repeats: Short interspersed nuclear elements (SINE), which include ALUs Long interspersed nuclear elements (LINE) Long terminal repeat elements (LTR), which include retroposons DNA repeat elements (DNA) Simple repeats (micro-satellites) Low complexity repeats Satellite repeats RNA repeats (including RNA, tRNA, rRNA, snRNA, scRNA, srpRNA) Other repeats, which includes class RC (Rolling Circle) Unknown The level of color shading in the graphical display reflects the amount of base mismatch, base deletion, and base insertion associated with a repeat element. The higher the combined number of these, the lighter the shading. A "?" at the end of the "Family" or "Class" (for example, DNA?) signifies that the curator was unsure of the classification. At some point in the future, either the "?" will be removed or the classification will be changed. Methods Data are generated using the RepeatMasker -s flag. Additional flags may be used for certain organisms. Repeats are soft-masked. Alignments may extend through repeats, but are not permitted to initiate in them. See the FAQ for more information. Credits Thanks to Arian Smit, Robert Hubley and GIRI for providing the tools and repeat libraries used to generate this track. References Smit AFA, Hubley R, Green P. RepeatMasker Open-3.0. http://www.repeatmasker.org. 1996-2010. Repbase Update is described in: Jurka J. Repbase Update: a database and an electronic journal of repetitive elements. Trends Genet. 2000 Sep;16(9):418-420. PMID: 10973072 For a discussion of repeats in mammalian genomes, see: Smit AF. Interspersed repeats and other mementos of transposable elements in mammalian genomes. Curr Opin Genet Dev. 1999 Dec;9(6):657-63. PMID: 10607616 Smit AF. The origin of interspersed repeats in the human genome. Curr Opin Genet Dev. 1996 Dec;6(6):743-8. PMID: 8994846 refSeqComposite NCBI RefSeq RefSeq genes from NCBI Genes and Gene Predictions Description The NCBI RefSeq Genes composite track shows mouse protein-coding and non-protein-coding genes taken from the NCBI RNA reference sequences collection (RefSeq). All subtracks use coordinates provided by RefSeq, except for the UCSC RefSeq track, which UCSC produces by realigning the RefSeq RNAs to the genome. This realignment may result in occasional differences between the annotation coordinates provided by UCSC and NCBI. See the Methods section for more details about how the different tracks were created. Please visit NCBI's Feedback for Gene and Reference Sequences (RefSeq) page to make suggestions, submit additions and corrections, or ask for help concerning RefSeq records. For more information on the different gene tracks, see our Genes FAQ. Display Conventions and Configuration This track is a composite track that contains differing data sets. To show only a selected set of subtracks, uncheck the boxes next to the tracks that you wish to hide. Note: Not all subtracts are available on all assemblies. The possible subtracks include: RefSeq aligned annotations and UCSC alignment of RefSeq annotations RefSeq All – all curated and predicted annotations provided by RefSeq. RefSeq Curated – subset of RefSeq All that includes only those annotations whose accessions begin with NM, NR, NP or YP. (NP and YP are used only for protein-coding genes on the mitochondrion; YP is used for human only.) RefSeq Predicted – subset of RefSeq All that includes those annotations whose accessions begin with XM or XR. RefSeq Other – all other annotations produced by the RefSeq group that do not fit the requirements for inclusion in the RefSeq Curated or the RefSeq Predicted tracks. RefSeq Alignments – alignments of RefSeq RNAs to the mouse genome provided by the RefSeq group, following the display conventions for PSL tracks. RefSeq Diffs – alignment differences between the mouse reference genome(s) and RefSeq transcripts. (Track not currently available for every assembly.) UCSC RefSeq – annotations generated from UCSC's realignment of RNAs with NM and NR accessions to the mouse genome. This track was previously known as the "RefSeq Genes" track. RefSeq Select – Subset of RefSeq Curated, transcripts marked as part of the RefSeq Select dataset. A single Select transcript is chosen as representative for each protein-coding gene. See NCBI RefSeq Select. The RefSeq All, RefSeq Curated, RefSeq Predicted, RefSeq Select and UCSC RefSeq tracks follow the display conventions for gene prediction tracks. The color shading indicates the level of review the RefSeq record has undergone: predicted (light), provisional (medium), or reviewed (dark), as defined by RefSeq. Color Level of review Reviewed: the RefSeq record has been reviewed by NCBI staff or by a collaborator. The NCBI review process includes assessing available sequence data and the literature. Some RefSeq records may incorporate expanded sequence and annotation information. Provisional: the RefSeq record has not yet been subject to individual review. The initial sequence-to-gene association has been established by outside collaborators or NCBI staff. Predicted: the RefSeq record has not yet been subject to individual review, and some aspect of the RefSeq record is predicted. The item labels and codon display properties for features within this track can be configured through the check-box controls at the top of the track description page. To adjust the settings for an individual subtrack, click the wrench icon next to the track name in the subtrack list . Label: By default, items are labeled by gene name. Click the appropriate Label option to display the accession name or OMIM identifier instead of the gene name, show all or a subset of these labels including the gene name, OMIM identifier and accession names, or turn off the label completely. Codon coloring: This track has an optional codon coloring feature that allows users to quickly validate and compare gene predictions. To display codon colors, select the genomic codons option from the Color track by codons pull-down menu. For more information about this feature, go to the Coloring Gene Predictions and Annotations by Codon page. The RefSeq Diffs track contains five different types of inconsistency between the reference genome sequence and the RefSeq transcript sequences. The five types of differences are as follows: mismatch – aligned but mismatching bases, plus HGVS g. to show the genomic change required to match the transcript and HGVS c./n. to show the transcript change required to match the genome. short gap – genomic gaps that are too small to be introns (arbitrary cutoff of < 45 bp), most likely insertions/deletion variants or errors, with HGVS g. and c./n. showing differences. shift gap – shortGap items whose placement could be shifted left and/or right on the genome due to repetitive sequence, with HGVS c./n. position range of ambiguous region in transcript. Here, thin and thick lines are used -- the thin line shows the span of the repetitive sequence, and the thick line shows the rightmost shifted gap. double gap – genomic gaps that are long enough to be introns but that skip over transcript sequence (invisible in default setting), with HGVS c./n. deletion. skipped – sequence at the beginning or end of a transcript that is not aligned to the genome (invisible in default setting), with HGVS c./n. deletion HGVS Terminology (Human Genome Variation Society): g. = genomic sequence ; c. = coding DNA sequence ; n. = non-coding RNA reference sequence. When reporting HGVS with RefSeq sequences, to make sure that results from research articles can be mapped to the genome unambiguously, please specify the RefSeq annotation release displayed on the transcript's Genome Browser details page and also the RefSeq transcript ID with version (e.g. NM_012309.4 not NM_012309). Methods Tracks contained in the RefSeq annotation and RefSeq RNA alignment tracks were created at UCSC using data from the NCBI RefSeq project. Data files were downloaded from RefSeq in GFF file format and converted to the genePred and PSL table formats for display in the Genome Browser. Information about the NCBI annotation pipeline can be found here. The RefSeq Diffs track is generated by UCSC using NCBI's RefSeq RNA alignments. The UCSC RefSeq Genes track is constructed using the same methods as previous RefSeq Genes tracks. RefSeq RNAs were aligned against the mouse genome using BLAT. Those with an alignment of less than 15% were discarded. When a single RNA aligned in multiple places, the alignment having the highest base identity was identified. Only alignments having a base identity level within 0.1% of the best and at least 96% base identity with the genomic sequence were kept. Data Access The raw data for these tracks can be accessed in multiple ways. It can be explored interactively using the Table Browser or Data Integrator. The tables can also be accessed programmatically through our public MySQL server or downloaded from our downloads server for local processing. The previous track versions are available in the archives of our downloads server. You can also access any RefSeq table entries in JSON format through our JSON API. The data in the RefSeq Other and RefSeq Diffs tracks are organized in bigBed file format; more information about accessing the information in this bigBed file can be found below. The other subtracks are associated with database tables as follows: genePred format: RefSeq All - ncbiRefSeq RefSeq Curated - ncbiRefSeqCurated RefSeq Predicted - ncbiRefSeqPredicted RefSeq Select - ncbiRefSeqSelect UCSC RefSeq - refGene PSL format: RefSeq Alignments - ncbiRefSeqPsl The first column of each of these tables is "bin". This column is designed to speed up access for display in the Genome Browser, but can be safely ignored in downstream analysis. You can read more about the bin indexing system here. The annotations in the RefSeqOther and RefSeqDiffs tracks are stored in bigBed files, which can be obtained from our downloads server here, ncbiRefSeqOther.bb and ncbiRefSeqDiffs.bb. Individual regions or the whole set of genome-wide annotations can be obtained using our tool bigBedToBed which can be compiled from the source code or downloaded as a precompiled binary for your system from the utilities directory linked below. For example, to extract only annotations in a given region, you could use the following command: bigBedToBed http://hgdownload.soe.ucsc.edu/gbdb/mm39/ncbiRefSeq/ncbiRefSeqOther.bb -chrom=chr16 -start=34990190 -end=36727467 stdout You can download a GTF format version of the RefSeq All table from the GTF downloads directory. The genePred format tracks can also be converted to GTF format using the genePredToGtf utility, available from the utilities directory on the UCSC downloads server. The utility can be run from the command line like so: genePredToGtf mm39 ncbiRefSeqPredicted ncbiRefSeqPredicted.gtf Note that using genePredToGtf in this manner accesses our public MySQL server, and you therefore must set up your hg.conf as described on the MySQL page linked near the beginning of the Data Access section. A file containing the RNA sequences in FASTA format for all items in the RefSeq All, RefSeq Curated, and RefSeq Predicted tracks can be found on our downloads server here. Please refer to our mailing list archives for questions. Credits This track was produced at UCSC from data generated by scientists worldwide and curated by the NCBI RefSeq project. References Kent WJ. BLAT - the BLAST-like alignment tool. Genome Res. 2002 Apr;12(4):656-64. PMID: 11932250; PMC: PMC187518 Pruitt KD, Brown GR, Hiatt SM, Thibaud-Nissen F, Astashyn A, Ermolaeva O, Farrell CM, Hart J, Landrum MJ, McGarvey KM et al. RefSeq: an update on mammalian reference sequences. Nucleic Acids Res. 2014 Jan;42(Database issue):D756-63. PMID: 24259432; PMC: PMC3965018 Pruitt KD, Tatusova T, Maglott DR. NCBI Reference Sequence (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Res. 2005 Jan 1;33(Database issue):D501-4. PMID: 15608248; PMC: PMC539979 ncbiRefSeqSelect RefSeq Select NCBI RefSeq Select: One representative transcript per protein-coding gene Genes and Gene Predictions refGene UCSC RefSeq UCSC annotations of RefSeq RNAs (NM_* and NR_*) Genes and Gene Predictions Description The RefSeq Genes track shows known mouse protein-coding and non-protein-coding genes taken from the NCBI RNA reference sequences collection (RefSeq). The data underlying this track are updated weekly. Please visit the Feedback for Gene and Reference Sequences (RefSeq) page to make suggestions, submit additions and corrections, or ask for help concerning RefSeq records. For more information on the different gene tracks, see our Genes FAQ. Display Conventions and Configuration This track follows the display conventions for gene prediction tracks. The color shading indicates the level of review the RefSeq record has undergone: predicted (light), provisional (medium), reviewed (dark). The item labels and display colors of features within this track can be configured through the controls at the top of the track description page. Label: By default, items are labeled by gene name. Click the appropriate Label option to display the accession name instead of the gene name, show both the gene and accession names, or turn off the label completely. Codon coloring: This track contains an optional codon coloring feature that allows users to quickly validate and compare gene predictions. To display codon colors, select the genomic codons option from the Color track by codons pull-down menu. For more information about this feature, go to the Coloring Gene Predictions and Annotations by Codon page. Hide non-coding genes: By default, both the protein-coding and non-protein-coding genes are displayed. If you wish to see only the coding genes, click this box. Methods RefSeq RNAs were aligned against the mouse genome using BLAT. Those with an alignment of less than 15% were discarded. When a single RNA aligned in multiple places, the alignment having the highest base identity was identified. Only alignments having a base identity level within 0.1% of the best and at least 96% base identity with the genomic sequence were kept. Credits This track was produced at UCSC from RNA sequence data generated by scientists worldwide and curated by the NCBI RefSeq project. References Kent WJ. BLAT - the BLAST-like alignment tool. Genome Res. 2002 Apr;12(4):656-64. PMID: 11932250; PMC: PMC187518 Pruitt KD, Brown GR, Hiatt SM, Thibaud-Nissen F, Astashyn A, Ermolaeva O, Farrell CM, Hart J, Landrum MJ, McGarvey KM et al. RefSeq: an update on mammalian reference sequences. Nucleic Acids Res. 2014 Jan;42(Database issue):D756-63. PMID: 24259432; PMC: PMC3965018 Pruitt KD, Tatusova T, Maglott DR. NCBI Reference Sequence (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Res. 2005 Jan 1;33(Database issue):D501-4. PMID: 15608248; PMC: PMC539979 ncbiRefSeqGenomicDiff RefSeq Diffs Differences between NCBI RefSeq Transcripts and the Reference Genome Genes and Gene Predictions ncbiRefSeqPsl RefSeq Alignments RefSeq Alignments of RNAs Genes and Gene Predictions ncbiRefSeqOther RefSeq Other NCBI RefSeq Other Annotations (not NM_*, NR_*, XM_*, XR_*, NP_* or YP_*) Genes and Gene Predictions ncbiRefSeqPredicted RefSeq Predicted NCBI RefSeq genes, predicted subset (XM_* or XR_*) Genes and Gene Predictions ncbiRefSeqCurated RefSeq Curated NCBI RefSeq genes, curated subset (NM_*, NR_*, NP_* or YP_*) Genes and Gene Predictions ncbiRefSeq RefSeq All NCBI RefSeq genes, curated and predicted (NM_*, XM_*, NR_*, XR_*, NP_*, YP_*) Genes and Gene Predictions xenoMRnas RefSeq mRNAs RefSeq mRNAs mapped to this assembly mRNA and EST Description The RefSeq mRNAs gene track for the mouse (Jun. 2020 (GRCm39/mm39)) genome assembly displays translated blat alignments of vertebrate and invertebrate mRNA in GenBank. Track statistics summary Total genome size: 2,654,624,157 (not counting gaps) Gene count: 22,442 Bases in genes: 838,462,469 (txStart to txEnd) Genes percent genome coverage: % 31.585 Bases in exons: 53,564,706 Exons percent genome coverage: % 2.018 Search tips Please note, the name searching system is not completely case insensitive. When in doubt, enter search names in all lower case to find gene names. Methods The mRNAs were aligned against the mouse (Jun. 2020 (GRCm39/mm39)) genome using translated blat. When a single mRNA aligned in multiple places, the alignment having the highest base identity was found. Only those alignments having a base identity level within 1% of the best and at least 25% base identity with the genomic sequence were kept. Specifically, the translated blat command is: blat -noHead -q=rnax -t=dnax -mask=lower target.fa query.fa target.query.psl where target.fa is one of the chromosome sequence of the genome assembly, and the query.fa is the mRNAs from RefSeq The resulting PSL outputs are filtered: pslCDnaFilter -minId=0.35 -minCover=0.25 -globalNearBest=0.0100 -minQSize=20 \ -ignoreIntrons -repsAsMatch -ignoreNs -bestOverlap \ all.results.psl mm39.xenoRefGene.psl The filtered mm39.xenoRefGene.psl is converted to genePred data to display for this track. Credits The mRNA track was produced at UCSC from mRNA sequence data submitted to the international public sequence databases by scientists worldwide. References Benson DA, Cavanaugh M, Clark K, Karsch-Mizrachi I, Lipman DJ, Ostell J, Sayers EW. GenBank. Nucleic Acids Res. 2013 Jan;41(Database issue):D36-42. PMID: 23193287; PMC: PMC3531190 Benson DA, Karsch-Mizrachi I, Lipman DJ, Ostell J, Wheeler DL. GenBank: update. Nucleic Acids Res. 2004 Jan 1;32(Database issue):D23-6. PMID: 14681350; PMC: PMC308779 Kent WJ. BLAT - the BLAST-like alignment tool. Genome Res. 2002 Apr;12(4):656-64. PMID: 11932250; PMC: PMC187518 cpgIslandExtUnmasked Unmasked CpG CpG Islands on All Sequence (Islands < 300 Bases are Light Green) Expression and Regulation Description CpG islands are associated with genes, particularly housekeeping genes, in vertebrates. CpG islands are typically common near transcription start sites and may be associated with promoter regions. Normally a C (cytosine) base followed immediately by a G (guanine) base (a CpG) is rare in vertebrate DNA because the Cs in such an arrangement tend to be methylated. This methylation helps distinguish the newly synthesized DNA strand from the parent strand, which aids in the final stages of DNA proofreading after duplication. However, over evolutionary time, methylated Cs tend to turn into Ts because of spontaneous deamination. The result is that CpGs are relatively rare unless there is selective pressure to keep them or a region is not methylated for some other reason, perhaps having to do with the regulation of gene expression. CpG islands are regions where CpGs are present at significantly higher levels than is typical for the genome as a whole. The unmasked version of the track displays potential CpG islands that exist in repeat regions and would otherwise not be visible in the repeat masked version. By default, only the masked version of the track is displayed. To view the unmasked version, change the visibility settings in the track controls at the top of this page. Methods CpG islands were predicted by searching the sequence one base at a time, scoring each dinucleotide (+17 for CG and -1 for others) and identifying maximally scoring segments. Each segment was then evaluated for the following criteria: GC content of 50% or greater length greater than 200 bp ratio greater than 0.6 of observed number of CG dinucleotides to the expected number on the basis of the number of Gs and Cs in the segment The entire genome sequence, masking areas included, was used for the construction of the track Unmasked CpG. The track CpG Islands is constructed on the sequence after all masked sequence is removed. The CpG count is the number of CG dinucleotides in the island. The Percentage CpG is the ratio of CpG nucleotide bases (twice the CpG count) to the length. The ratio of observed to expected CpG is calculated according to the formula (cited in Gardiner-Garden et al. (1987)): Obs/Exp CpG = Number of CpG * N / (Number of C * Number of G) where N = length of sequence. The calculation of the track data is performed by the following command sequence: twoBitToFa assembly.2bit stdout | maskOutFa stdin hard stdout \ | cpg_lh /dev/stdin 2> cpg_lh.err \ | awk '{$2 = $2 - 1; width = $3 - $2; printf("%s\t%d\t%s\t%s %s\t%s\t%s\t%0.0f\t%0.1f\t%s\t%s\n", $1, $2, $3, $5, $6, width, $6, width*$7*0.01, 100.0*2*$6/width, $7, $9);}' \ | sort -k1,1 -k2,2n > cpgIsland.bed The unmasked track data is constructed from twoBitToFa -noMask output for the twoBitToFa command. Data access CpG islands and its associated tables can be explored interactively using the REST API, the Table Browser or the Data Integrator. All the tables can also be queried directly from our public MySQL servers, with more information available on our help page as well as on our blog. The source for the cpg_lh program can be obtained from src/utils/cpgIslandExt/. The cpg_lh program binary can be obtained from: http://hgdownload.soe.ucsc.edu/admin/exe/linux.x86_64/cpg_lh (choose "save file") Credits This track was generated using a modification of a program developed by G. Miklem and L. Hillier (unpublished). References Gardiner-Garden M, Frommer M. CpG islands in vertebrate genomes. J Mol Biol. 1987 Jul 20;196(2):261-82. PMID: 3656447 mrna Mouse mRNAs Mouse mRNAs from GenBank mRNA and EST Description The mRNA track shows alignments between mouse mRNAs in GenBank and the genome. Display Conventions and Configuration This track follows the display conventions for PSL alignment tracks. In dense display mode, the items that are more darkly shaded indicate matches of better quality. The description page for this track has a filter that can be used to change the display mode, alter the color, and include/exclude a subset of items within the track. This may be helpful when many items are shown in the track display, especially when only some are relevant to the current task. To use the filter: Type a term in one or more of the text boxes to filter the mRNA display. For example, to apply the filter to all mRNAs expressed in a specific organ, type the name of the organ in the tissue box. To view the list of valid terms for each text box, consult the table in the Table Browser that corresponds to the factor on which you wish to filter. For example, the "tissue" table contains all the types of tissues that can be entered into the tissue text box. Multiple terms may be entered at once, separated by a space. Wildcards may also be used in the filter. If filtering on more than one value, choose the desired combination logic. If "and" is selected, only mRNAs that match all filter criteria will be highlighted. If "or" is selected, mRNAs that match any one of the filter criteria will be highlighted. Choose the color or display characteristic that should be used to highlight or include/exclude the filtered items. If "exclude" is chosen, the browser will not display mRNAs that match the filter criteria. If "include" is selected, the browser will display only those mRNAs that match the filter criteria. This track may also be configured to display codon coloring, a feature that allows the user to quickly compare mRNAs against the genomic sequence. For more information about this option, go to the Codon and Base Coloring for Alignment Tracks page. Several types of alignment gap may also be colored; for more information, go to the Alignment Insertion/Deletion Display Options page. Methods GenBank mouse mRNAs were aligned against the genome using the blat program. When a single mRNA aligned in multiple places, the alignment having the highest base identity was found. Only alignments having a base identity level within 0.5% of the best and at least 96% base identity with the genomic sequence were kept. Credits The mRNA track was produced at UCSC from mRNA sequence data submitted to the international public sequence databases by scientists worldwide. References Benson DA, Cavanaugh M, Clark K, Karsch-Mizrachi I, Lipman DJ, Ostell J, Sayers EW. GenBank. Nucleic Acids Res. 2013 Jan;41(Database issue):D36-42. PMID: 23193287; PMC: PMC3531190 Benson DA, Karsch-Mizrachi I, Lipman DJ, Ostell J, Wheeler DL. GenBank: update. Nucleic Acids Res. 2004 Jan 1;32(Database issue):D23-6. PMID: 14681350; PMC: PMC308779 Kent WJ. BLAT - the BLAST-like alignment tool. Genome Res. 2002 Apr;12(4):656-64. PMID: 11932250; PMC: PMC187518 wgEncodeGencodeVM34 All GENCODE VM34 All GENCODE annotations from VM34 (Ensembl 111) Genes and Gene Predictions Description The GENCODE Genes track (version M34, Jan 2024) shows high-quality manual annotations merged with evidence-based automated annotations across the entire mouse genome generated by the GENCODE project. The GENCODE gene set presents a full merge between HAVANA manual annotation process and Ensembl automatic annotation pipeline. Priority is given to the manually curated HAVANA annotation using predicted Ensembl annotations when there are no corresponding manual annotations. The M34 annotation was carried out on genome assembly GRCm39 (mm39). The Ensembl human and mouse data sets are the same gene annotations as GENCODE for the corresponding release. Display Conventions and Configuration This track is a multi-view composite track that contains differing data sets (views). Instructions for configuring multi-view tracks are here. To show only selected subtracks, uncheck the boxes next to the tracks that you wish to hide. Views available on this track are: Genes The gene annotations in this view are divided into three subtracks: GENCODE Basic set is a subset of the Comprehensive set. The selection criteria are described in the methods section. GENCODE Comprehensive set contains all GENCODE coding and non-coding transcript annotations, including polymorphic pseudogenes. This includes both manual and automatic annotations. This is a super-set of the Basic set. GENCODE Pseudogenes include all annotations except polymorphic pseudogenes. PolyA GENCODE PolyA contains polyA signals and sites manually annotated on the genome based on transcribed evidence (ESTs and cDNAs) of 3' end of transcripts containing at least 3 A's not matching the genome. Maximum number of transcripts to display is available for the items in the GENCODE Basic, Comprehensive and Pseudogene tracks. Starting with the GENCODE human V42 and mouse VM31 releases, transcripts are assigned rank within the gene. The ranks may be used to filter the number of transcripts displayed in a principled manner. Transcript ranking is not available in the lift37 releases. See Methods for details of rank assignment. Filtering is available for the items in the GENCODE Basic, Comprehensive and Pseudogene tracks using the following criteria: Transcript class: filter by the basic biological function of a transcript annotation All - don't filter by transcript class coding - display protein coding transcripts, including polymorphic pseudogenes nonCoding - display non-protein coding transcripts pseudo - display pseudogene transcript annotations problem - display problem transcripts (Biotypes of retained_intron, TEC, or disrupted_domain) Transcript Annotation Method: filter by the method used to create the annotation All - don't filter by transcript class manual - display manually created annotations, including those that are also created automatically automatic - display automatically created annotations, including those that are also created manually manual_only - display manually created annotations that were not annotated by the automatic method automatic_only - display automatically created annotations that were not annotated by the manual method Transcript Biotype: filter transcripts by Biotype Support Level: filter transcripts by transcription support level Coloring for the gene annotations is based on the annotation type: coding non-coding pseudogene problem all polyA annotations Methods The GENCODE project aims to annotate all evidence-based gene features on the human and mouse reference sequence with high accuracy by integrating computational approaches (including comparative methods), manual annotation and targeted experimental verification. This goal includes identifying all protein-coding loci with associated alternative variants, non-coding loci which have transcript evidence, and pseudogenes. For a detailed description of the methods and references used, see Harrow et al. (2006). GENCODE Basic Set selection: The GENCODE Basic Set is intended to provide a simplified subset of the GENCODE transcript annotations that will be useful to the majority of users. The goal was to have a high-quality basic set that also covered all loci. Selection of GENCODE annotations for inclusion in the basic set was determined independently for the coding and non-coding transcripts at each gene locus. Criteria for selection of coding transcripts (including polymorphic pseudogenes) at a given locus: All full-length coding transcripts (except problem transcripts or transcripts that are nonsense-mediated decay) were included in the basic set. If there were no transcripts meeting the above criteria, then the partial coding transcript with the largest CDS was included in the basic set (excluding problem transcripts). Criteria for selection of non-coding transcripts at a given locus: All full-length non-coding transcripts (except problem transcripts) with a well characterized Biotype (see below) were included in the basic set. If there were no transcripts meeting the above criteria, then the largest non-coding transcript was included in the basic set (excluding problem transcripts). If no transcripts were included by either of the above criteria, the longest problem transcript is included. Non-coding transcript categorization: Non-coding transcripts are categorized using their biotype and the following criteria: well characterized: antisense, Mt_rRNA, Mt_tRNA, miRNA, rRNA, snRNA, snoRNA poorly characterized: 3prime_overlapping_ncrna, lincRNA, misc_RNA, non_coding, processed_transcript, sense_intronic, sense_overlapping Transcript ranking: Within each gene, transcripts have been ranked according to the following criteria. The ranking approach is preliminary and will change is future releases. Protein_coding genes MANE or Ensembl canonical -1st: MANE Select / Ensembl canonical -2nd: MANE Plus Clinical Coding biotypes -1st: protein_coding and protein_coding_LoF -2nd: NMDs and NSDs -3rd: retained intron and protein_coding_CDS_not_defined Completeness -1st: full length -2nd: CDS start/end not found CARS score (only for coding transcripts) Transcript genomic span and length (only for non-coding transcripts) Non-coding genes Transcript biotype -1st: transcript biotype identical to gene biotype Ensembl canonical GENCODE basic Transcript genomic span Transcript length Transcription Support Level (TSL): It is important that users understand how to assess transcript annotations that they see in GENCODE. While some transcript models have a high level of support through the full length of their exon structure, there are also transcripts that are poorly supported and that should be considered speculative. The Transcription Support Level (TSL) is a method to highlight the well-supported and poorly-supported transcript models for users. The method relies on the primary data that can support full-length transcript structure: mRNA and EST alignments supplied by UCSC and Ensembl. The mRNA and EST alignments are compared to the GENCODE transcripts and the transcripts are scored according to how well the alignment matches over its full length. The GENCODE TSL provides a consistent method of evaluating the level of support that a GENCODE transcript annotation is actually expressed in mouse. Mouse transcript sequences from the International Nucleotide Sequence Database Collaboration (GenBank, ENA, and DDBJ) are used as the evidence for this analysis. Exonerate RNA alignments from Ensembl, BLAT RNA and EST alignments from the UCSC Genome Browser Database are used in the analysis. Erroneous transcripts and libraries identified in lists maintained by the Ensembl, UCSC, HAVANA and RefSeq groups are flagged as suspect. GENCODE annotations for protein-coding and non-protein-coding transcripts are compared with the evidence alignments. Annotations in the MHC region and other immunological genes are not evaluated, as automatic alignments tend to be very problematic. Methods for evaluating single-exon genes are still being developed and they are not included in the current analysis. Multi-exon GENCODE annotations are evaluated using the criteria that all introns are supported by an evidence alignment and the evidence alignment does not indicate that there are unannotated exons. Small insertions and deletions in evidence alignments are assumed to be due to polymorphisms and not considered as differing from the annotations. All intron boundaries must match exactly. The transcript start and end locations are allowed to differ. The following categories are assigned to each of the evaluated annotations: tsl1 - all splice junctions of the transcript are supported by at least one non-suspect mRNA tsl2 - the best supporting mRNA is flagged as suspect or the support is from multiple ESTs tsl3 - the only support is from a single EST tsl4 - the best supporting EST is flagged as suspect tsl5 - no single transcript supports the model structure tslNA - the transcript was not analyzed for one of the following reasons: pseudogene annotation, including transcribed pseudogenes immunoglobin gene transcript T-cell receptor transcript single-exon transcript (will be included in a future version) APPRIS is a system to annotate alternatively spliced transcripts based on a range of computational methods. It provides value to the annotations of the human, mouse, zebrafish, rat, and pig genomes. APPRIS has selected a single CDS variant for each gene as the 'PRINCIPAL' isoform. Principal isoforms are tagged with the numbers 1 to 5, with 1 being the most reliable. PRINCIPAL:1 - Transcript(s) expected to code for the main functional isoform based solely on the core modules in the APPRIS. PRINCIPAL:2 - Where the APPRIS core modules are unable to choose a clear principal variant (approximately 25% of human protein coding genes), the database chooses two or more of the CDS variants as "candidates" to be the principal variant. PRINCIPAL:3 - Where the APPRIS core modules are unable to choose a clear principal variant and more than one of the variants have distinct CCDS identifiers, APPRIS selects the variant with lowest CCDS identifier as the principal variant. The lower the CCDS identifier, the earlier it was annotated. PRINCIPAL:4 - Where the APPRIS core modules are unable to choose a clear principal CDS and there is more than one variant with distinct (but consecutive) CCDS identifiers, APPRIS selects the longest CCDS isoform as the principal variant. PRINCIPAL:5 - Where the APPRIS core modules are unable to choose a clear principal variant and none of the candidate variants are annotated by CCDS, APPRIS selects the longest of the candidate isoforms as the principal variant. For genes in which the APPRIS core modules are unable to choose a clear principal variant (approximately 25% of human protein coding genes), the "candidate" variants not chosen as principal are labeled in the following way: ALTERNATIVE:1 - Candidate transcript(s) models that are conserved in at least three tested species. ALTERNATIVE:2 - Candidate transcript(s) models that appear to be conserved in fewer than three tested species. Non-candidate transcripts are not tagged and are considered as "Minor" transcripts. Further information and additional web services can be found at the APPRIS website. Downloads GENCODE GFF3 and GTF files are available from the GENCODE release M34 site. Release Notes GENCODE version M34 corresponds to Ensembl 111. See also: The GENCODE Project Credits The GENCODE project is an international collaboration funded by NIH/NHGRI grant U41HG007234. More information is available at www.gencodegenes.org, Participating GENCODE institutions and personnel can be found here. References Frankish A, Diekhans M, Jungreis I, Lagarde J, Loveland JE, Mudge JM, Sisu C, Wright JC, Armstrong J, Barnes I et al. GENCODE 2021. Nucleic Acids Res. 2021 Jan 8;49(D1):D916-D923. PMID: 33270111; PMC: PMC7778937; DOI: 10.1093/nar/gkaa1087 A full list of GENCODE publications are available at The GENCODE Project web site. Data Release Policy GENCODE data are available for use without restrictions. wgEncodeGencodeSuper GENCODE Versions Container of all new and previous GENCODE releases Genes and Gene Predictions Description The aim of the GENCODE Genes project (Harrow et al., 2006) is to produce a set of highly accurate annotations of evidence-based gene features on the human reference genome. This includes the identification of all protein-coding loci with associated alternative splice variants, non-coding with transcript evidence in the public databases (NCBI/EMBL/DDBJ) and pseudogenes. A high quality set of gene structures is necessary for many research studies such as comparative or evolutionary analyses, or for experimental design and interpretation of the results. The GENCODE Genes tracks display the high-quality manual annotations merged with evidence-based automated annotations across the entire human genome. The GENCODE gene set presents a full merge between HAVANA manual annotation and Ensembl automatic annotation. Priority is given to the manually curated HAVANA annotation using predicted Ensembl annotations when there are no corresponding manual annotations. With each release, there is an increase in the number of annotations that have undergone manual curation. This annotation was carried out on the GRCm38 (mm10) genome assembly. Experimental verification details are given in each descriptions for each track. For more information on the different gene tracks, see our Genes FAQ. Display Conventions These are multi-view composite tracks that contain differing data sets (views). Instructions for configuring multi-view tracks are here. Only some subtracks are shown by default. The user can select which subtracks are displayed via the display controls on the track details pages. Further details on display conventions and data interpretation are available in the track descriptions. Release Notes GENCODE version M34 corresponds to Ensembl 111. GENCODE version M33 corresponds to Ensembl 110. GENCODE version M32 corresponds to Ensembl 109. GENCODE version M31 corresponds to Ensembl 108. GENCODE version M30 corresponds to Ensembl 107. GENCODE version M29 corresponds to Ensembl 106. GENCODE version M28 corresponds to Ensembl 105. GENCODE version M27 corresponds to Ensembl 104. GENCODE version M26 corresponds to Ensembl 103. See also: The GENCODE Project Release History. Credits The GENCODE project is an international collaboration funded by NIH/NHGRI grant U41HG007234. More information is available at www.gencodegenes.org, Participating GENCODE institutions and personnel can be found here. References Frankish A, Diekhans M, Jungreis I, Lagarde J, Loveland JE, Mudge JM, Sisu C, Wright JC, Armstrong J, Barnes I et al. GENCODE 2021. Nucleic Acids Res. 2021 Jan 8;49(D1):D916-D923. PMID: 33270111; PMC: PMC7778937; DOI: 10.1093/nar/gkaa1087 A full list of GENCODE publications are available at The GENCODE Project web site. Data Release Policy GENCODE data are available for use without restrictions. wgEncodeGencodeVM34ViewPolya PolyA All GENCODE annotations from VM34 (Ensembl 111) Genes and Gene Predictions wgEncodeGencodePolyaVM34 PolyA PolyA Transcript Annotation Set from GENCODE Version M34 (Ensembl 111) Genes and Gene Predictions wgEncodeGencodeVM34ViewGenes Genes All GENCODE annotations from VM34 (Ensembl 111) Genes and Gene Predictions wgEncodeGencodePseudoGeneVM34 Pseudogenes Pseudogene Annotation Set from GENCODE Version M34 (Ensembl 111) Genes and Gene Predictions wgEncodeGencodeCompVM34 Comprehensive Comprehensive Gene Annotation Set from GENCODE Version M34 (Ensembl 111) Genes and Gene Predictions wgEncodeGencodeBasicVM34 Basic Basic Gene Annotation Set from GENCODE Version M34 (Ensembl 111) Genes and Gene Predictions wgEncodeGencodeVM33 All GENCODE VM33 All GENCODE annotations from VM33 (Ensembl 110) Genes and Gene Predictions Description The GENCODE Genes track (version M33, July 2023) shows high-quality manual annotations merged with evidence-based automated annotations across the entire mouse genome generated by the GENCODE project. The GENCODE gene set presents a full merge between HAVANA manual annotation process and Ensembl automatic annotation pipeline. Priority is given to the manually curated HAVANA annotation using predicted Ensembl annotations when there are no corresponding manual annotations. The M33 annotation was carried out on genome assembly GRCm39 (mm39). The Ensembl human and mouse data sets are the same gene annotations as GENCODE for the corresponding release. Display Conventions and Configuration This track is a multi-view composite track that contains differing data sets (views). Instructions for configuring multi-view tracks are here. To show only selected subtracks, uncheck the boxes next to the tracks that you wish to hide. Views available on this track are: Genes The gene annotations in this view are divided into three subtracks: GENCODE Basic set is a subset of the Comprehensive set. The selection criteria are described in the methods section. GENCODE Comprehensive set contains all GENCODE coding and non-coding transcript annotations, including polymorphic pseudogenes. This includes both manual and automatic annotations. This is a super-set of the Basic set. GENCODE Pseudogenes include all annotations except polymorphic pseudogenes. PolyA GENCODE PolyA contains polyA signals and sites manually annotated on the genome based on transcribed evidence (ESTs and cDNAs) of 3' end of transcripts containing at least 3 A's not matching the genome. Maximum number of transcripts to display is available for the items in the GENCODE Basic, Comprehensive and Pseudogene tracks. Starting with the GENCODE human V42 and mouse VM31 releases, transcripts are assigned rank within the gene. The ranks may be used to filter the number of transcripts displayed in a principled manner. Transcript ranking is not available in the lift37 releases. See Methods for details of rank assignment. Filtering is available for the items in the GENCODE Basic, Comprehensive and Pseudogene tracks using the following criteria: Transcript class: filter by the basic biological function of a transcript annotation All - don't filter by transcript class coding - display protein coding transcripts, including polymorphic pseudogenes nonCoding - display non-protein coding transcripts pseudo - display pseudogene transcript annotations problem - display problem transcripts (Biotypes of retained_intron, TEC, or disrupted_domain) Transcript Annotation Method: filter by the method used to create the annotation All - don't filter by transcript class manual - display manually created annotations, including those that are also created automatically automatic - display automatically created annotations, including those that are also created manually manual_only - display manually created annotations that were not annotated by the automatic method automatic_only - display automatically created annotations that were not annotated by the manual method Transcript Biotype: filter transcripts by Biotype Support Level: filter transcripts by transcription support level Coloring for the gene annotations is based on the annotation type: coding non-coding pseudogene problem all polyA annotations Methods The GENCODE project aims to annotate all evidence-based gene features on the human and mouse reference sequence with high accuracy by integrating computational approaches (including comparative methods), manual annotation and targeted experimental verification. This goal includes identifying all protein-coding loci with associated alternative variants, non-coding loci which have transcript evidence, and pseudogenes. For a detailed description of the methods and references used, see Harrow et al. (2006). GENCODE Basic Set selection: The GENCODE Basic Set is intended to provide a simplified subset of the GENCODE transcript annotations that will be useful to the majority of users. The goal was to have a high-quality basic set that also covered all loci. Selection of GENCODE annotations for inclusion in the basic set was determined independently for the coding and non-coding transcripts at each gene locus. Criteria for selection of coding transcripts (including polymorphic pseudogenes) at a given locus: All full-length coding transcripts (except problem transcripts or transcripts that are nonsense-mediated decay) were included in the basic set. If there were no transcripts meeting the above criteria, then the partial coding transcript with the largest CDS was included in the basic set (excluding problem transcripts). Criteria for selection of non-coding transcripts at a given locus: All full-length non-coding transcripts (except problem transcripts) with a well characterized Biotype (see below) were included in the basic set. If there were no transcripts meeting the above criteria, then the largest non-coding transcript was included in the basic set (excluding problem transcripts). If no transcripts were included by either of the above criteria, the longest problem transcript is included. Non-coding transcript categorization: Non-coding transcripts are categorized using their biotype and the following criteria: well characterized: antisense, Mt_rRNA, Mt_tRNA, miRNA, rRNA, snRNA, snoRNA poorly characterized: 3prime_overlapping_ncrna, lincRNA, misc_RNA, non_coding, processed_transcript, sense_intronic, sense_overlapping Transcript ranking: Within each gene, transcripts have been ranked according to the following criteria. The ranking approach is preliminary and will change is future releases. Protein_coding genes MANE or Ensembl canonical -1st: MANE Select / Ensembl canonical -2nd: MANE Plus Clinical Coding biotypes -1st: protein_coding and protein_coding_LoF -2nd: NMDs and NSDs -3rd: retained intron and protein_coding_CDS_not_defined Completeness -1st: full length -2nd: CDS start/end not found CARS score (only for coding transcripts) Transcript genomic span and length (only for non-coding transcripts) Non-coding genes Transcript biotype -1st: transcript biotype identical to gene biotype Ensembl canonical GENCODE basic Transcript genomic span Transcript length Transcription Support Level (TSL): It is important that users understand how to assess transcript annotations that they see in GENCODE. While some transcript models have a high level of support through the full length of their exon structure, there are also transcripts that are poorly supported and that should be considered speculative. The Transcription Support Level (TSL) is a method to highlight the well-supported and poorly-supported transcript models for users. The method relies on the primary data that can support full-length transcript structure: mRNA and EST alignments supplied by UCSC and Ensembl. The mRNA and EST alignments are compared to the GENCODE transcripts and the transcripts are scored according to how well the alignment matches over its full length. The GENCODE TSL provides a consistent method of evaluating the level of support that a GENCODE transcript annotation is actually expressed in mouse. Mouse transcript sequences from the International Nucleotide Sequence Database Collaboration (GenBank, ENA, and DDBJ) are used as the evidence for this analysis. Exonerate RNA alignments from Ensembl, BLAT RNA and EST alignments from the UCSC Genome Browser Database are used in the analysis. Erroneous transcripts and libraries identified in lists maintained by the Ensembl, UCSC, HAVANA and RefSeq groups are flagged as suspect. GENCODE annotations for protein-coding and non-protein-coding transcripts are compared with the evidence alignments. Annotations in the MHC region and other immunological genes are not evaluated, as automatic alignments tend to be very problematic. Methods for evaluating single-exon genes are still being developed and they are not included in the current analysis. Multi-exon GENCODE annotations are evaluated using the criteria that all introns are supported by an evidence alignment and the evidence alignment does not indicate that there are unannotated exons. Small insertions and deletions in evidence alignments are assumed to be due to polymorphisms and not considered as differing from the annotations. All intron boundaries must match exactly. The transcript start and end locations are allowed to differ. The following categories are assigned to each of the evaluated annotations: tsl1 - all splice junctions of the transcript are supported by at least one non-suspect mRNA tsl2 - the best supporting mRNA is flagged as suspect or the support is from multiple ESTs tsl3 - the only support is from a single EST tsl4 - the best supporting EST is flagged as suspect tsl5 - no single transcript supports the model structure tslNA - the transcript was not analyzed for one of the following reasons: pseudogene annotation, including transcribed pseudogenes immunoglobin gene transcript T-cell receptor transcript single-exon transcript (will be included in a future version) APPRIS is a system to annotate alternatively spliced transcripts based on a range of computational methods. It provides value to the annotations of the human, mouse, zebrafish, rat, and pig genomes. APPRIS has selected a single CDS variant for each gene as the 'PRINCIPAL' isoform. Principal isoforms are tagged with the numbers 1 to 5, with 1 being the most reliable. PRINCIPAL:1 - Transcript(s) expected to code for the main functional isoform based solely on the core modules in the APPRIS. PRINCIPAL:2 - Where the APPRIS core modules are unable to choose a clear principal variant (approximately 25% of human protein coding genes), the database chooses two or more of the CDS variants as "candidates" to be the principal variant. PRINCIPAL:3 - Where the APPRIS core modules are unable to choose a clear principal variant and more than one of the variants have distinct CCDS identifiers, APPRIS selects the variant with lowest CCDS identifier as the principal variant. The lower the CCDS identifier, the earlier it was annotated. PRINCIPAL:4 - Where the APPRIS core modules are unable to choose a clear principal CDS and there is more than one variant with distinct (but consecutive) CCDS identifiers, APPRIS selects the longest CCDS isoform as the principal variant. PRINCIPAL:5 - Where the APPRIS core modules are unable to choose a clear principal variant and none of the candidate variants are annotated by CCDS, APPRIS selects the longest of the candidate isoforms as the principal variant. For genes in which the APPRIS core modules are unable to choose a clear principal variant (approximately 25% of human protein coding genes), the "candidate" variants not chosen as principal are labeled in the following way: ALTERNATIVE:1 - Candidate transcript(s) models that are conserved in at least three tested species. ALTERNATIVE:2 - Candidate transcript(s) models that appear to be conserved in fewer than three tested species. Non-candidate transcripts are not tagged and are considered as "Minor" transcripts. Further information and additional web services can be found at the APPRIS website. Downloads GENCODE GFF3 and GTF files are available from the GENCODE release M33 site. Release Notes GENCODE version M33 corresponds to Ensembl 110. See also: The GENCODE Project Credits The GENCODE project is an international collaboration funded by NIH/NHGRI grant U41HG007234. More information is available at www.gencodegenes.org, Participating GENCODE institutions and personnel can be found here. References Frankish A, Diekhans M, Jungreis I, Lagarde J, Loveland JE, Mudge JM, Sisu C, Wright JC, Armstrong J, Barnes I et al. GENCODE 2021. Nucleic Acids Res. 2021 Jan 8;49(D1):D916-D923. PMID: 33270111; PMC: PMC7778937; DOI: 10.1093/nar/gkaa1087 A full list of GENCODE publications are available at The GENCODE Project web site. Data Release Policy GENCODE data are available for use without restrictions. wgEncodeGencodeVM33ViewPolya PolyA All GENCODE annotations from VM33 (Ensembl 110) Genes and Gene Predictions wgEncodeGencodePolyaVM33 PolyA PolyA Transcript Annotation Set from GENCODE Version M33 (Ensembl 110) Genes and Gene Predictions wgEncodeGencodeVM33ViewGenes Genes All GENCODE annotations from VM33 (Ensembl 110) Genes and Gene Predictions wgEncodeGencodePseudoGeneVM33 Pseudogenes Pseudogene Annotation Set from GENCODE Version M33 (Ensembl 110) Genes and Gene Predictions wgEncodeGencodeCompVM33 Comprehensive Comprehensive Gene Annotation Set from GENCODE Version M33 (Ensembl 110) Genes and Gene Predictions wgEncodeGencodeBasicVM33 Basic Basic Gene Annotation Set from GENCODE Version M33 (Ensembl 110) Genes and Gene Predictions wgEncodeGencodeVM32 All GENCODE VM32 All GENCODE annotations from VM32 (Ensembl 109) Genes and Gene Predictions Description The GENCODE Genes track (version M32, Feb 2023) shows high-quality manual annotations merged with evidence-based automated annotations across the entire mouse genome generated by the GENCODE project. The GENCODE gene set presents a full merge between HAVANA manual annotation process and Ensembl automatic annotation pipeline. Priority is given to the manually curated HAVANA annotation using predicted Ensembl annotations when there are no corresponding manual annotations. The M32 annotation was carried out on genome assembly GRCm39 (mm39). The Ensembl human and mouse data sets are the same gene annotations as GENCODE for the corresponding release. Display Conventions and Configuration This track is a multi-view composite track that contains differing data sets (views). Instructions for configuring multi-view tracks are here. To show only selected subtracks, uncheck the boxes next to the tracks that you wish to hide. Views available on this track are: Genes The gene annotations in this view are divided into three subtracks: GENCODE Basic set is a subset of the Comprehensive set. The selection criteria are described in the methods section. GENCODE Comprehensive set contains all GENCODE coding and non-coding transcript annotations, including polymorphic pseudogenes. This includes both manual and automatic annotations. This is a super-set of the Basic set. GENCODE Pseudogenes include all annotations except polymorphic pseudogenes. PolyA GENCODE PolyA contains polyA signals and sites manually annotated on the genome based on transcribed evidence (ESTs and cDNAs) of 3' end of transcripts containing at least 3 A's not matching the genome. Maximum number of transcripts to display is available for the items in the GENCODE Basic, Comprehensive and Pseudogene tracks. Starting with the GENCODE human V42 and mouse VM31 releases, transcripts are assigned rank within the gene. The ranks may be used to filter the number of transcripts displayed in a principled manner. Transcript ranking is not available in the lift37 releases. See Methods for details of rank assignment. Filtering is available for the items in the GENCODE Basic, Comprehensive and Pseudogene tracks using the following criteria: Transcript class: filter by the basic biological function of a transcript annotation All - don't filter by transcript class coding - display protein coding transcripts, including polymorphic pseudogenes nonCoding - display non-protein coding transcripts pseudo - display pseudogene transcript annotations problem - display problem transcripts (Biotypes of retained_intron, TEC, or disrupted_domain) Transcript Annotation Method: filter by the method used to create the annotation All - don't filter by transcript class manual - display manually created annotations, including those that are also created automatically automatic - display automatically created annotations, including those that are also created manually manual_only - display manually created annotations that were not annotated by the automatic method automatic_only - display automatically created annotations that were not annotated by the manual method Transcript Biotype: filter transcripts by Biotype Support Level: filter transcripts by transcription support level Coloring for the gene annotations is based on the annotation type: coding non-coding pseudogene problem all polyA annotations Methods The GENCODE project aims to annotate all evidence-based gene features on the human and mouse reference sequence with high accuracy by integrating computational approaches (including comparative methods), manual annotation and targeted experimental verification. This goal includes identifying all protein-coding loci with associated alternative variants, non-coding loci which have transcript evidence, and pseudogenes. For a detailed description of the methods and references used, see Harrow et al. (2006). GENCODE Basic Set selection: The GENCODE Basic Set is intended to provide a simplified subset of the GENCODE transcript annotations that will be useful to the majority of users. The goal was to have a high-quality basic set that also covered all loci. Selection of GENCODE annotations for inclusion in the basic set was determined independently for the coding and non-coding transcripts at each gene locus. Criteria for selection of coding transcripts (including polymorphic pseudogenes) at a given locus: All full-length coding transcripts (except problem transcripts or transcripts that are nonsense-mediated decay) were included in the basic set. If there were no transcripts meeting the above criteria, then the partial coding transcript with the largest CDS was included in the basic set (excluding problem transcripts). Criteria for selection of non-coding transcripts at a given locus: All full-length non-coding transcripts (except problem transcripts) with a well characterized Biotype (see below) were included in the basic set. If there were no transcripts meeting the above criteria, then the largest non-coding transcript was included in the basic set (excluding problem transcripts). If no transcripts were included by either of the above criteria, the longest problem transcript is included. Non-coding transcript categorization: Non-coding transcripts are categorized using their biotype and the following criteria: well characterized: antisense, Mt_rRNA, Mt_tRNA, miRNA, rRNA, snRNA, snoRNA poorly characterized: 3prime_overlapping_ncrna, lincRNA, misc_RNA, non_coding, processed_transcript, sense_intronic, sense_overlapping Transcript ranking: Within each gene, transcripts have been ranked according to the following criteria. The ranking approach is preliminary and will change is future releases. Protein_coding genes MANE or Ensembl canonical -1st: MANE Select / Ensembl canonical -2nd: MANE Plus Clinical Coding biotypes -1st: protein_coding and protein_coding_LoF -2nd: NMDs and NSDs -3rd: retained intron and protein_coding_CDS_not_defined Completeness -1st: full length -2nd: CDS start/end not found CARS score (only for coding transcripts) Transcript genomic span and length (only for non-coding transcripts) Non-coding genes Transcript biotype -1st: transcript biotype identical to gene biotype Ensembl canonical GENCODE basic Transcript genomic span Transcript length Transcription Support Level (TSL): It is important that users understand how to assess transcript annotations that they see in GENCODE. While some transcript models have a high level of support through the full length of their exon structure, there are also transcripts that are poorly supported and that should be considered speculative. The Transcription Support Level (TSL) is a method to highlight the well-supported and poorly-supported transcript models for users. The method relies on the primary data that can support full-length transcript structure: mRNA and EST alignments supplied by UCSC and Ensembl. The mRNA and EST alignments are compared to the GENCODE transcripts and the transcripts are scored according to how well the alignment matches over its full length. The GENCODE TSL provides a consistent method of evaluating the level of support that a GENCODE transcript annotation is actually expressed in mouse. Mouse transcript sequences from the International Nucleotide Sequence Database Collaboration (GenBank, ENA, and DDBJ) are used as the evidence for this analysis. Exonerate RNA alignments from Ensembl, BLAT RNA and EST alignments from the UCSC Genome Browser Database are used in the analysis. Erroneous transcripts and libraries identified in lists maintained by the Ensembl, UCSC, HAVANA and RefSeq groups are flagged as suspect. GENCODE annotations for protein-coding and non-protein-coding transcripts are compared with the evidence alignments. Annotations in the MHC region and other immunological genes are not evaluated, as automatic alignments tend to be very problematic. Methods for evaluating single-exon genes are still being developed and they are not included in the current analysis. Multi-exon GENCODE annotations are evaluated using the criteria that all introns are supported by an evidence alignment and the evidence alignment does not indicate that there are unannotated exons. Small insertions and deletions in evidence alignments are assumed to be due to polymorphisms and not considered as differing from the annotations. All intron boundaries must match exactly. The transcript start and end locations are allowed to differ. The following categories are assigned to each of the evaluated annotations: tsl1 - all splice junctions of the transcript are supported by at least one non-suspect mRNA tsl2 - the best supporting mRNA is flagged as suspect or the support is from multiple ESTs tsl3 - the only support is from a single EST tsl4 - the best supporting EST is flagged as suspect tsl5 - no single transcript supports the model structure tslNA - the transcript was not analyzed for one of the following reasons: pseudogene annotation, including transcribed pseudogenes immunoglobin gene transcript T-cell receptor transcript single-exon transcript (will be included in a future version) APPRIS is a system to annotate alternatively spliced transcripts based on a range of computational methods. It provides value to the annotations of the human, mouse, zebrafish, rat, and pig genomes. APPRIS has selected a single CDS variant for each gene as the 'PRINCIPAL' isoform. Principal isoforms are tagged with the numbers 1 to 5, with 1 being the most reliable. PRINCIPAL:1 - Transcript(s) expected to code for the main functional isoform based solely on the core modules in the APPRIS. PRINCIPAL:2 - Where the APPRIS core modules are unable to choose a clear principal variant (approximately 25% of human protein coding genes), the database chooses two or more of the CDS variants as "candidates" to be the principal variant. PRINCIPAL:3 - Where the APPRIS core modules are unable to choose a clear principal variant and more than one of the variants have distinct CCDS identifiers, APPRIS selects the variant with lowest CCDS identifier as the principal variant. The lower the CCDS identifier, the earlier it was annotated. PRINCIPAL:4 - Where the APPRIS core modules are unable to choose a clear principal CDS and there is more than one variant with distinct (but consecutive) CCDS identifiers, APPRIS selects the longest CCDS isoform as the principal variant. PRINCIPAL:5 - Where the APPRIS core modules are unable to choose a clear principal variant and none of the candidate variants are annotated by CCDS, APPRIS selects the longest of the candidate isoforms as the principal variant. For genes in which the APPRIS core modules are unable to choose a clear principal variant (approximately 25% of human protein coding genes), the "candidate" variants not chosen as principal are labeled in the following way: ALTERNATIVE:1 - Candidate transcript(s) models that are conserved in at least three tested species. ALTERNATIVE:2 - Candidate transcript(s) models that appear to be conserved in fewer than three tested species. Non-candidate transcripts are not tagged and are considered as "Minor" transcripts. Further information and additional web services can be found at the APPRIS website. Downloads GENCODE GFF3 and GTF files are available from the GENCODE release M32 site. Release Notes GENCODE version M32 corresponds to Ensembl 109. See also: The GENCODE Project Credits The GENCODE project is an international collaboration funded by NIH/NHGRI grant U41HG007234. More information is available at www.gencodegenes.org, Participating GENCODE institutions and personnel can be found here. References Frankish A, Diekhans M, Jungreis I, Lagarde J, Loveland JE, Mudge JM, Sisu C, Wright JC, Armstrong J, Barnes I et al. GENCODE 2021. Nucleic Acids Res. 2021 Jan 8;49(D1):D916-D923. PMID: 33270111; PMC: PMC7778937; DOI: 10.1093/nar/gkaa1087 A full list of GENCODE publications are available at The GENCODE Project web site. Data Release Policy GENCODE data are available for use without restrictions. wgEncodeGencodeVM32ViewPolya PolyA All GENCODE annotations from VM32 (Ensembl 109) Genes and Gene Predictions wgEncodeGencodePolyaVM32 PolyA PolyA Transcript Annotation Set from GENCODE Version M32 (Ensembl 109) Genes and Gene Predictions wgEncodeGencodeVM32ViewGenes Genes All GENCODE annotations from VM32 (Ensembl 109) Genes and Gene Predictions wgEncodeGencodePseudoGeneVM32 Pseudogenes Pseudogene Annotation Set from GENCODE Version M32 (Ensembl 109) Genes and Gene Predictions wgEncodeGencodeCompVM32 Comprehensive Comprehensive Gene Annotation Set from GENCODE Version M32 (Ensembl 109) Genes and Gene Predictions wgEncodeGencodeBasicVM32 Basic Basic Gene Annotation Set from GENCODE Version M32 (Ensembl 109) Genes and Gene Predictions wgEncodeGencodeVM31 All GENCODE VM31 All GENCODE annotations from VM31 (Ensembl 108) Genes and Gene Predictions Description The GENCODE Genes track (version M31, Oct 2022) shows high-quality manual annotations merged with evidence-based automated annotations across the entire mouse genome generated by the GENCODE project. The GENCODE gene set presents a full merge between HAVANA manual annotation process and Ensembl automatic annotation pipeline. Priority is given to the manually curated HAVANA annotation using predicted Ensembl annotations when there are no corresponding manual annotations. The M31 annotation was carried out on genome assembly GRCm39 (mm39). The Ensembl human and mouse data sets are the same gene annotations as GENCODE for the corresponding release. Display Conventions and Configuration This track is a multi-view composite track that contains differing data sets (views). Instructions for configuring multi-view tracks are here. To show only selected subtracks, uncheck the boxes next to the tracks that you wish to hide. Views available on this track are: Genes The gene annotations in this view are divided into three subtracks: GENCODE Basic set is a subset of the Comprehensive set. The selection criteria are described in the methods section. GENCODE Comprehensive set contains all GENCODE coding and non-coding transcript annotations, including polymorphic pseudogenes. This includes both manual and automatic annotations. This is a super-set of the Basic set. GENCODE Pseudogenes include all annotations except polymorphic pseudogenes. PolyA GENCODE PolyA contains polyA signals and sites manually annotated on the genome based on transcribed evidence (ESTs and cDNAs) of 3' end of transcripts containing at least 3 A's not matching the genome. Maximum number of transcripts to display is available for the items in the GENCODE Basic, Comprehensive and Pseudogene tracks. Starting with the GENCODE human V42 and mouse VM31 releases, transcripts are assigned rank within the gene. The ranks may be used to filter the number of transcripts displayed in a principled manner. Transcript ranking is not available in the lift37 releases. See Methods for details of rank assignment. Filtering is available for the items in the GENCODE Basic, Comprehensive and Pseudogene tracks using the following criteria: Transcript class: filter by the basic biological function of a transcript annotation All - don't filter by transcript class coding - display protein coding transcripts, including polymorphic pseudogenes nonCoding - display non-protein coding transcripts pseudo - display pseudogene transcript annotations problem - display problem transcripts (Biotypes of retained_intron, TEC, or disrupted_domain) Transcript Annotation Method: filter by the method used to create the annotation All - don't filter by transcript class manual - display manually created annotations, including those that are also created automatically automatic - display automatically created annotations, including those that are also created manually manual_only - display manually created annotations that were not annotated by the automatic method automatic_only - display automatically created annotations that were not annotated by the manual method Transcript Biotype: filter transcripts by Biotype Support Level: filter transcripts by transcription support level Coloring for the gene annotations is based on the annotation type: coding non-coding pseudogene problem all polyA annotations Methods The GENCODE project aims to annotate all evidence-based gene features on the human and mouse reference sequence with high accuracy by integrating computational approaches (including comparative methods), manual annotation and targeted experimental verification. This goal includes identifying all protein-coding loci with associated alternative variants, non-coding loci which have transcript evidence, and pseudogenes. For a detailed description of the methods and references used, see Harrow et al. (2006). GENCODE Basic Set selection: The GENCODE Basic Set is intended to provide a simplified subset of the GENCODE transcript annotations that will be useful to the majority of users. The goal was to have a high-quality basic set that also covered all loci. Selection of GENCODE annotations for inclusion in the basic set was determined independently for the coding and non-coding transcripts at each gene locus. Criteria for selection of coding transcripts (including polymorphic pseudogenes) at a given locus: All full-length coding transcripts (except problem transcripts or transcripts that are nonsense-mediated decay) were included in the basic set. If there were no transcripts meeting the above criteria, then the partial coding transcript with the largest CDS was included in the basic set (excluding problem transcripts). Criteria for selection of non-coding transcripts at a given locus: All full-length non-coding transcripts (except problem transcripts) with a well characterized Biotype (see below) were included in the basic set. If there were no transcripts meeting the above criteria, then the largest non-coding transcript was included in the basic set (excluding problem transcripts). If no transcripts were included by either of the above criteria, the longest problem transcript is included. Non-coding transcript categorization: Non-coding transcripts are categorized using their biotype and the following criteria: well characterized: antisense, Mt_rRNA, Mt_tRNA, miRNA, rRNA, snRNA, snoRNA poorly characterized: 3prime_overlapping_ncrna, lincRNA, misc_RNA, non_coding, processed_transcript, sense_intronic, sense_overlapping Transcript ranking: Within each gene, transcripts have been ranked according to the following criteria. The ranking approach is preliminary and will change is future releases. Protein_coding genes MANE or Ensembl canonical -1st: MANE Select / Ensembl canonical -2nd: MANE Plus Clinical Coding biotypes -1st: protein_coding and protein_coding_LoF -2nd: NMDs and NSDs -3rd: retained intron and protein_coding_CDS_not_defined Completeness -1st: full length -2nd: CDS start/end not found CARS score (only for coding transcripts) Transcript genomic span and length (only for non-coding transcripts) Non-coding genes Transcript biotype -1st: transcript biotype identical to gene biotype Ensembl canonical GENCODE basic Transcript genomic span Transcript length Transcription Support Level (TSL): It is important that users understand how to assess transcript annotations that they see in GENCODE. While some transcript models have a high level of support through the full length of their exon structure, there are also transcripts that are poorly supported and that should be considered speculative. The Transcription Support Level (TSL) is a method to highlight the well-supported and poorly-supported transcript models for users. The method relies on the primary data that can support full-length transcript structure: mRNA and EST alignments supplied by UCSC and Ensembl. The mRNA and EST alignments are compared to the GENCODE transcripts and the transcripts are scored according to how well the alignment matches over its full length. The GENCODE TSL provides a consistent method of evaluating the level of support that a GENCODE transcript annotation is actually expressed in mouse. Mouse transcript sequences from the International Nucleotide Sequence Database Collaboration (GenBank, ENA, and DDBJ) are used as the evidence for this analysis. Exonerate RNA alignments from Ensembl, BLAT RNA and EST alignments from the UCSC Genome Browser Database are used in the analysis. Erroneous transcripts and libraries identified in lists maintained by the Ensembl, UCSC, HAVANA and RefSeq groups are flagged as suspect. GENCODE annotations for protein-coding and non-protein-coding transcripts are compared with the evidence alignments. Annotations in the MHC region and other immunological genes are not evaluated, as automatic alignments tend to be very problematic. Methods for evaluating single-exon genes are still being developed and they are not included in the current analysis. Multi-exon GENCODE annotations are evaluated using the criteria that all introns are supported by an evidence alignment and the evidence alignment does not indicate that there are unannotated exons. Small insertions and deletions in evidence alignments are assumed to be due to polymorphisms and not considered as differing from the annotations. All intron boundaries must match exactly. The transcript start and end locations are allowed to differ. The following categories are assigned to each of the evaluated annotations: tsl1 - all splice junctions of the transcript are supported by at least one non-suspect mRNA tsl2 - the best supporting mRNA is flagged as suspect or the support is from multiple ESTs tsl3 - the only support is from a single EST tsl4 - the best supporting EST is flagged as suspect tsl5 - no single transcript supports the model structure tslNA - the transcript was not analyzed for one of the following reasons: pseudogene annotation, including transcribed pseudogenes immunoglobin gene transcript T-cell receptor transcript single-exon transcript (will be included in a future version) APPRIS is a system to annotate alternatively spliced transcripts based on a range of computational methods. It provides value to the annotations of the human, mouse, zebrafish, rat, and pig genomes. APPRIS has selected a single CDS variant for each gene as the 'PRINCIPAL' isoform. Principal isoforms are tagged with the numbers 1 to 5, with 1 being the most reliable. PRINCIPAL:1 - Transcript(s) expected to code for the main functional isoform based solely on the core modules in the APPRIS. PRINCIPAL:2 - Where the APPRIS core modules are unable to choose a clear principal variant (approximately 25% of human protein coding genes), the database chooses two or more of the CDS variants as "candidates" to be the principal variant. PRINCIPAL:3 - Where the APPRIS core modules are unable to choose a clear principal variant and more than one of the variants have distinct CCDS identifiers, APPRIS selects the variant with lowest CCDS identifier as the principal variant. The lower the CCDS identifier, the earlier it was annotated. PRINCIPAL:4 - Where the APPRIS core modules are unable to choose a clear principal CDS and there is more than one variant with distinct (but consecutive) CCDS identifiers, APPRIS selects the longest CCDS isoform as the principal variant. PRINCIPAL:5 - Where the APPRIS core modules are unable to choose a clear principal variant and none of the candidate variants are annotated by CCDS, APPRIS selects the longest of the candidate isoforms as the principal variant. For genes in which the APPRIS core modules are unable to choose a clear principal variant (approximately 25% of human protein coding genes), the "candidate" variants not chosen as principal are labeled in the following way: ALTERNATIVE:1 - Candidate transcript(s) models that are conserved in at least three tested species. ALTERNATIVE:2 - Candidate transcript(s) models that appear to be conserved in fewer than three tested species. Non-candidate transcripts are not tagged and are considered as "Minor" transcripts. Further information and additional web services can be found at the APPRIS website. Downloads GENCODE GFF3 and GTF files are available from the GENCODE release M31 site. Release Notes GENCODE version M31 corresponds to Ensembl 108. See also: The GENCODE Project Credits The GENCODE project is an international collaboration funded by NIH/NHGRI grant U41HG007234. More information is available at www.gencodegenes.org, Participating GENCODE institutions and personnel can be found here. References Frankish A, Diekhans M, Jungreis I, Lagarde J, Loveland JE, Mudge JM, Sisu C, Wright JC, Armstrong J, Barnes I et al. GENCODE 2021. Nucleic Acids Res. 2021 Jan 8;49(D1):D916-D923. PMID: 33270111; PMC: PMC7778937; DOI: 10.1093/nar/gkaa1087 A full list of GENCODE publications are available at The GENCODE Project web site. Data Release Policy GENCODE data are available for use without restrictions. wgEncodeGencodeVM31ViewPolya PolyA All GENCODE annotations from VM31 (Ensembl 108) Genes and Gene Predictions wgEncodeGencodePolyaVM31 PolyA PolyA Transcript Annotation Set from GENCODE Version M31 (Ensembl 108) Genes and Gene Predictions wgEncodeGencodeVM31ViewGenes Genes All GENCODE annotations from VM31 (Ensembl 108) Genes and Gene Predictions wgEncodeGencodePseudoGeneVM31 Pseudogenes Pseudogene Annotation Set from GENCODE Version M31 (Ensembl 108) Genes and Gene Predictions wgEncodeGencodeCompVM31 Comprehensive Comprehensive Gene Annotation Set from GENCODE Version M31 (Ensembl 108) Genes and Gene Predictions wgEncodeGencodeBasicVM31 Basic Basic Gene Annotation Set from GENCODE Version M31 (Ensembl 108) Genes and Gene Predictions wgEncodeGencodeVM30 All GENCODE VM30 All GENCODE annotations from VM30 (Ensembl 107) Genes and Gene Predictions Description The GENCODE Genes track (version M30, July 2022) shows high-quality manual annotations merged with evidence-based automated annotations across the entire mouse genome generated by the GENCODE project. The GENCODE gene set presents a full merge between HAVANA manual annotation process and Ensembl automatic annotation pipeline. Priority is given to the manually curated HAVANA annotation using predicted Ensembl annotations when there are no corresponding manual annotations. The M30 annotation was carried out on genome assembly GRCm39 (mm39). The Ensembl human and mouse data sets are the same gene annotations as GENCODE for the corresponding release. Display Conventions and Configuration This track is a multi-view composite track that contains differing data sets (views). Instructions for configuring multi-view tracks are here. To show only selected subtracks, uncheck the boxes next to the tracks that you wish to hide. Views available on this track are: Genes The gene annotations in this view are divided into three subtracks: GENCODE Basic set is a subset of the Comprehensive set. The selection criteria are described in the methods section. GENCODE Comprehensive set contains all GENCODE coding and non-coding transcript annotations, including polymorphic pseudogenes. This includes both manual and automatic annotations. This is a super-set of the Basic set. GENCODE Pseudogenes include all annotations except polymorphic pseudogenes. PolyA GENCODE PolyA contains polyA signals and sites manually annotated on the genome based on transcribed evidence (ESTs and cDNAs) of 3' end of transcripts containing at least 3 A's not matching the genome. Maximum number of transcripts to display is available for the items in the GENCODE Basic, Comprehensive and Pseudogene tracks. Starting with the GENCODE human V42 and mouse VM31 releases, transcripts are assigned rank within the gene. The ranks may be used to filter the number of transcripts displayed in a principled manner. Transcript ranking is not available in the lift37 releases. See Methods for details of rank assignment. Filtering is available for the items in the GENCODE Basic, Comprehensive and Pseudogene tracks using the following criteria: Transcript class: filter by the basic biological function of a transcript annotation All - don't filter by transcript class coding - display protein coding transcripts, including polymorphic pseudogenes nonCoding - display non-protein coding transcripts pseudo - display pseudogene transcript annotations problem - display problem transcripts (Biotypes of retained_intron, TEC, or disrupted_domain) Transcript Annotation Method: filter by the method used to create the annotation All - don't filter by transcript class manual - display manually created annotations, including those that are also created automatically automatic - display automatically created annotations, including those that are also created manually manual_only - display manually created annotations that were not annotated by the automatic method automatic_only - display automatically created annotations that were not annotated by the manual method Transcript Biotype: filter transcripts by Biotype Support Level: filter transcripts by transcription support level Coloring for the gene annotations is based on the annotation type: coding non-coding pseudogene problem all polyA annotations Methods The GENCODE project aims to annotate all evidence-based gene features on the human and mouse reference sequence with high accuracy by integrating computational approaches (including comparative methods), manual annotation and targeted experimental verification. This goal includes identifying all protein-coding loci with associated alternative variants, non-coding loci which have transcript evidence, and pseudogenes. For a detailed description of the methods and references used, see Harrow et al. (2006). GENCODE Basic Set selection: The GENCODE Basic Set is intended to provide a simplified subset of the GENCODE transcript annotations that will be useful to the majority of users. The goal was to have a high-quality basic set that also covered all loci. Selection of GENCODE annotations for inclusion in the basic set was determined independently for the coding and non-coding transcripts at each gene locus. Criteria for selection of coding transcripts (including polymorphic pseudogenes) at a given locus: All full-length coding transcripts (except problem transcripts or transcripts that are nonsense-mediated decay) were included in the basic set. If there were no transcripts meeting the above criteria, then the partial coding transcript with the largest CDS was included in the basic set (excluding problem transcripts). Criteria for selection of non-coding transcripts at a given locus: All full-length non-coding transcripts (except problem transcripts) with a well characterized Biotype (see below) were included in the basic set. If there were no transcripts meeting the above criteria, then the largest non-coding transcript was included in the basic set (excluding problem transcripts). If no transcripts were included by either of the above criteria, the longest problem transcript is included. Non-coding transcript categorization: Non-coding transcripts are categorized using their biotype and the following criteria: well characterized: antisense, Mt_rRNA, Mt_tRNA, miRNA, rRNA, snRNA, snoRNA poorly characterized: 3prime_overlapping_ncrna, lincRNA, misc_RNA, non_coding, processed_transcript, sense_intronic, sense_overlapping Transcript ranking: Within each gene, transcripts have been ranked according to the following criteria. The ranking approach is preliminary and will change is future releases. Protein_coding genes MANE or Ensembl canonical -1st: MANE Select / Ensembl canonical -2nd: MANE Plus Clinical Coding biotypes -1st: protein_coding and protein_coding_LoF -2nd: NMDs and NSDs -3rd: retained intron and protein_coding_CDS_not_defined Completeness -1st: full length -2nd: CDS start/end not found CARS score (only for coding transcripts) Transcript genomic span and length (only for non-coding transcripts) Non-coding genes Transcript biotype -1st: transcript biotype identical to gene biotype Ensembl canonical GENCODE basic Transcript genomic span Transcript length Transcription Support Level (TSL): It is important that users understand how to assess transcript annotations that they see in GENCODE. While some transcript models have a high level of support through the full length of their exon structure, there are also transcripts that are poorly supported and that should be considered speculative. The Transcription Support Level (TSL) is a method to highlight the well-supported and poorly-supported transcript models for users. The method relies on the primary data that can support full-length transcript structure: mRNA and EST alignments supplied by UCSC and Ensembl. The mRNA and EST alignments are compared to the GENCODE transcripts and the transcripts are scored according to how well the alignment matches over its full length. The GENCODE TSL provides a consistent method of evaluating the level of support that a GENCODE transcript annotation is actually expressed in mouse. Mouse transcript sequences from the International Nucleotide Sequence Database Collaboration (GenBank, ENA, and DDBJ) are used as the evidence for this analysis. Exonerate RNA alignments from Ensembl, BLAT RNA and EST alignments from the UCSC Genome Browser Database are used in the analysis. Erroneous transcripts and libraries identified in lists maintained by the Ensembl, UCSC, HAVANA and RefSeq groups are flagged as suspect. GENCODE annotations for protein-coding and non-protein-coding transcripts are compared with the evidence alignments. Annotations in the MHC region and other immunological genes are not evaluated, as automatic alignments tend to be very problematic. Methods for evaluating single-exon genes are still being developed and they are not included in the current analysis. Multi-exon GENCODE annotations are evaluated using the criteria that all introns are supported by an evidence alignment and the evidence alignment does not indicate that there are unannotated exons. Small insertions and deletions in evidence alignments are assumed to be due to polymorphisms and not considered as differing from the annotations. All intron boundaries must match exactly. The transcript start and end locations are allowed to differ. The following categories are assigned to each of the evaluated annotations: tsl1 - all splice junctions of the transcript are supported by at least one non-suspect mRNA tsl2 - the best supporting mRNA is flagged as suspect or the support is from multiple ESTs tsl3 - the only support is from a single EST tsl4 - the best supporting EST is flagged as suspect tsl5 - no single transcript supports the model structure tslNA - the transcript was not analyzed for one of the following reasons: pseudogene annotation, including transcribed pseudogenes immunoglobin gene transcript T-cell receptor transcript single-exon transcript (will be included in a future version) APPRIS is a system to annotate alternatively spliced transcripts based on a range of computational methods. It provides value to the annotations of the human, mouse, zebrafish, rat, and pig genomes. APPRIS has selected a single CDS variant for each gene as the 'PRINCIPAL' isoform. Principal isoforms are tagged with the numbers 1 to 5, with 1 being the most reliable. PRINCIPAL:1 - Transcript(s) expected to code for the main functional isoform based solely on the core modules in the APPRIS. PRINCIPAL:2 - Where the APPRIS core modules are unable to choose a clear principal variant (approximately 25% of human protein coding genes), the database chooses two or more of the CDS variants as "candidates" to be the principal variant. PRINCIPAL:3 - Where the APPRIS core modules are unable to choose a clear principal variant and more than one of the variants have distinct CCDS identifiers, APPRIS selects the variant with lowest CCDS identifier as the principal variant. The lower the CCDS identifier, the earlier it was annotated. PRINCIPAL:4 - Where the APPRIS core modules are unable to choose a clear principal CDS and there is more than one variant with distinct (but consecutive) CCDS identifiers, APPRIS selects the longest CCDS isoform as the principal variant. PRINCIPAL:5 - Where the APPRIS core modules are unable to choose a clear principal variant and none of the candidate variants are annotated by CCDS, APPRIS selects the longest of the candidate isoforms as the principal variant. For genes in which the APPRIS core modules are unable to choose a clear principal variant (approximately 25% of human protein coding genes), the "candidate" variants not chosen as principal are labeled in the following way: ALTERNATIVE:1 - Candidate transcript(s) models that are conserved in at least three tested species. ALTERNATIVE:2 - Candidate transcript(s) models that appear to be conserved in fewer than three tested species. Non-candidate transcripts are not tagged and are considered as "Minor" transcripts. Further information and additional web services can be found at the APPRIS website. Downloads GENCODE GFF3 and GTF files are available from the GENCODE release M30 site. Release Notes GENCODE version M30 corresponds to Ensembl 107. See also: The GENCODE Project Credits The GENCODE project is an international collaboration funded by NIH/NHGRI grant U41HG007234. More information is available at www.gencodegenes.org, Participating GENCODE institutions and personnel can be found here. References Frankish A, Diekhans M, Jungreis I, Lagarde J, Loveland JE, Mudge JM, Sisu C, Wright JC, Armstrong J, Barnes I et al. GENCODE 2021. Nucleic Acids Res. 2021 Jan 8;49(D1):D916-D923. PMID: 33270111; PMC: PMC7778937; DOI: 10.1093/nar/gkaa1087 A full list of GENCODE publications are available at The GENCODE Project web site. Data Release Policy GENCODE data are available for use without restrictions. wgEncodeGencodeVM30ViewPolya PolyA All GENCODE annotations from VM30 (Ensembl 107) Genes and Gene Predictions wgEncodeGencodePolyaVM30 PolyA PolyA Transcript Annotation Set from GENCODE Version M30 (Ensembl 107) Genes and Gene Predictions wgEncodeGencodeVM30ViewGenes Genes All GENCODE annotations from VM30 (Ensembl 107) Genes and Gene Predictions wgEncodeGencodePseudoGeneVM30 Pseudogenes Pseudogene Annotation Set from GENCODE Version M30 (Ensembl 107) Genes and Gene Predictions wgEncodeGencodeCompVM30 Comprehensive Comprehensive Gene Annotation Set from GENCODE Version M30 (Ensembl 107) Genes and Gene Predictions wgEncodeGencodeBasicVM30 Basic Basic Gene Annotation Set from GENCODE Version M30 (Ensembl 107) Genes and Gene Predictions wgEncodeGencodeVM30View2Way 2-Way All GENCODE annotations from VM30 (Ensembl 107) Genes and Gene Predictions wgEncodeGencode2wayConsPseudoVM30 2-way Pseudogenes 2-way Pseudogene Annotation Set from GENCODE Version M30 (Ensembl 107) Genes and Gene Predictions wgEncodeGencodeVM29 All GENCODE VM29 All GENCODE annotations from VM29 (Ensembl 106) Genes and Gene Predictions Description The GENCODE Genes track (version M29, Feb 2022) shows high-quality manual annotations merged with evidence-based automated annotations across the entire mouse genome generated by the GENCODE project. The GENCODE gene set presents a full merge between HAVANA manual annotation process and Ensembl automatic annotation pipeline. Priority is given to the manually curated HAVANA annotation using predicted Ensembl annotations when there are no corresponding manual annotations. The M29 annotation was carried out on genome assembly GRCm39 (mm39). The Ensembl human and mouse data sets are the same gene annotations as GENCODE for the corresponding release. Display Conventions and Configuration This track is a multi-view composite track that contains differing data sets (views). Instructions for configuring multi-view tracks are here. To show only selected subtracks, uncheck the boxes next to the tracks that you wish to hide. Views available on this track are: Genes The gene annotations in this view are divided into three subtracks: GENCODE Basic set is a subset of the Comprehensive set. The selection criteria are described in the methods section. GENCODE Comprehensive set contains all GENCODE coding and non-coding transcript annotations, including polymorphic pseudogenes. This includes both manual and automatic annotations. This is a super-set of the Basic set. GENCODE Pseudogenes include all annotations except polymorphic pseudogenes. PolyA GENCODE PolyA contains polyA signals and sites manually annotated on the genome based on transcribed evidence (ESTs and cDNAs) of 3' end of transcripts containing at least 3 A's not matching the genome. Maximum number of transcripts to display is available for the items in the GENCODE Basic, Comprehensive and Pseudogene tracks. Starting with the GENCODE human V42 and mouse VM31 releases, transcripts are assigned rank within the gene. The ranks may be used to filter the number of transcripts displayed in a principled manner. Transcript ranking is not available in the lift37 releases. See Methods for details of rank assignment. Filtering is available for the items in the GENCODE Basic, Comprehensive and Pseudogene tracks using the following criteria: Transcript class: filter by the basic biological function of a transcript annotation All - don't filter by transcript class coding - display protein coding transcripts, including polymorphic pseudogenes nonCoding - display non-protein coding transcripts pseudo - display pseudogene transcript annotations problem - display problem transcripts (Biotypes of retained_intron, TEC, or disrupted_domain) Transcript Annotation Method: filter by the method used to create the annotation All - don't filter by transcript class manual - display manually created annotations, including those that are also created automatically automatic - display automatically created annotations, including those that are also created manually manual_only - display manually created annotations that were not annotated by the automatic method automatic_only - display automatically created annotations that were not annotated by the manual method Transcript Biotype: filter transcripts by Biotype Support Level: filter transcripts by transcription support level Coloring for the gene annotations is based on the annotation type: coding non-coding pseudogene problem all polyA annotations Methods The GENCODE project aims to annotate all evidence-based gene features on the human and mouse reference sequence with high accuracy by integrating computational approaches (including comparative methods), manual annotation and targeted experimental verification. This goal includes identifying all protein-coding loci with associated alternative variants, non-coding loci which have transcript evidence, and pseudogenes. For a detailed description of the methods and references used, see Harrow et al. (2006). GENCODE Basic Set selection: The GENCODE Basic Set is intended to provide a simplified subset of the GENCODE transcript annotations that will be useful to the majority of users. The goal was to have a high-quality basic set that also covered all loci. Selection of GENCODE annotations for inclusion in the basic set was determined independently for the coding and non-coding transcripts at each gene locus. Criteria for selection of coding transcripts (including polymorphic pseudogenes) at a given locus: All full-length coding transcripts (except problem transcripts or transcripts that are nonsense-mediated decay) were included in the basic set. If there were no transcripts meeting the above criteria, then the partial coding transcript with the largest CDS was included in the basic set (excluding problem transcripts). Criteria for selection of non-coding transcripts at a given locus: All full-length non-coding transcripts (except problem transcripts) with a well characterized Biotype (see below) were included in the basic set. If there were no transcripts meeting the above criteria, then the largest non-coding transcript was included in the basic set (excluding problem transcripts). If no transcripts were included by either of the above criteria, the longest problem transcript is included. Non-coding transcript categorization: Non-coding transcripts are categorized using their biotype and the following criteria: well characterized: antisense, Mt_rRNA, Mt_tRNA, miRNA, rRNA, snRNA, snoRNA poorly characterized: 3prime_overlapping_ncrna, lincRNA, misc_RNA, non_coding, processed_transcript, sense_intronic, sense_overlapping Transcript ranking: Within each gene, transcripts have been ranked according to the following criteria. The ranking approach is preliminary and will change is future releases. Protein_coding genes MANE or Ensembl canonical -1st: MANE Select / Ensembl canonical -2nd: MANE Plus Clinical Coding biotypes -1st: protein_coding and protein_coding_LoF -2nd: NMDs and NSDs -3rd: retained intron and protein_coding_CDS_not_defined Completeness -1st: full length -2nd: CDS start/end not found CARS score (only for coding transcripts) Transcript genomic span and length (only for non-coding transcripts) Non-coding genes Transcript biotype -1st: transcript biotype identical to gene biotype Ensembl canonical GENCODE basic Transcript genomic span Transcript length Transcription Support Level (TSL): It is important that users understand how to assess transcript annotations that they see in GENCODE. While some transcript models have a high level of support through the full length of their exon structure, there are also transcripts that are poorly supported and that should be considered speculative. The Transcription Support Level (TSL) is a method to highlight the well-supported and poorly-supported transcript models for users. The method relies on the primary data that can support full-length transcript structure: mRNA and EST alignments supplied by UCSC and Ensembl. The mRNA and EST alignments are compared to the GENCODE transcripts and the transcripts are scored according to how well the alignment matches over its full length. The GENCODE TSL provides a consistent method of evaluating the level of support that a GENCODE transcript annotation is actually expressed in mouse. Mouse transcript sequences from the International Nucleotide Sequence Database Collaboration (GenBank, ENA, and DDBJ) are used as the evidence for this analysis. Exonerate RNA alignments from Ensembl, BLAT RNA and EST alignments from the UCSC Genome Browser Database are used in the analysis. Erroneous transcripts and libraries identified in lists maintained by the Ensembl, UCSC, HAVANA and RefSeq groups are flagged as suspect. GENCODE annotations for protein-coding and non-protein-coding transcripts are compared with the evidence alignments. Annotations in the MHC region and other immunological genes are not evaluated, as automatic alignments tend to be very problematic. Methods for evaluating single-exon genes are still being developed and they are not included in the current analysis. Multi-exon GENCODE annotations are evaluated using the criteria that all introns are supported by an evidence alignment and the evidence alignment does not indicate that there are unannotated exons. Small insertions and deletions in evidence alignments are assumed to be due to polymorphisms and not considered as differing from the annotations. All intron boundaries must match exactly. The transcript start and end locations are allowed to differ. The following categories are assigned to each of the evaluated annotations: tsl1 - all splice junctions of the transcript are supported by at least one non-suspect mRNA tsl2 - the best supporting mRNA is flagged as suspect or the support is from multiple ESTs tsl3 - the only support is from a single EST tsl4 - the best supporting EST is flagged as suspect tsl5 - no single transcript supports the model structure tslNA - the transcript was not analyzed for one of the following reasons: pseudogene annotation, including transcribed pseudogenes immunoglobin gene transcript T-cell receptor transcript single-exon transcript (will be included in a future version) APPRIS is a system to annotate alternatively spliced transcripts based on a range of computational methods. It provides value to the annotations of the human, mouse, zebrafish, rat, and pig genomes. APPRIS has selected a single CDS variant for each gene as the 'PRINCIPAL' isoform. Principal isoforms are tagged with the numbers 1 to 5, with 1 being the most reliable. PRINCIPAL:1 - Transcript(s) expected to code for the main functional isoform based solely on the core modules in the APPRIS. PRINCIPAL:2 - Where the APPRIS core modules are unable to choose a clear principal variant (approximately 25% of human protein coding genes), the database chooses two or more of the CDS variants as "candidates" to be the principal variant. PRINCIPAL:3 - Where the APPRIS core modules are unable to choose a clear principal variant and more than one of the variants have distinct CCDS identifiers, APPRIS selects the variant with lowest CCDS identifier as the principal variant. The lower the CCDS identifier, the earlier it was annotated. PRINCIPAL:4 - Where the APPRIS core modules are unable to choose a clear principal CDS and there is more than one variant with distinct (but consecutive) CCDS identifiers, APPRIS selects the longest CCDS isoform as the principal variant. PRINCIPAL:5 - Where the APPRIS core modules are unable to choose a clear principal variant and none of the candidate variants are annotated by CCDS, APPRIS selects the longest of the candidate isoforms as the principal variant. For genes in which the APPRIS core modules are unable to choose a clear principal variant (approximately 25% of human protein coding genes), the "candidate" variants not chosen as principal are labeled in the following way: ALTERNATIVE:1 - Candidate transcript(s) models that are conserved in at least three tested species. ALTERNATIVE:2 - Candidate transcript(s) models that appear to be conserved in fewer than three tested species. Non-candidate transcripts are not tagged and are considered as "Minor" transcripts. Further information and additional web services can be found at the APPRIS website. Downloads GENCODE GFF3 and GTF files are available from the GENCODE release M29 site. Release Notes GENCODE version M29 corresponds to Ensembl 106. See also: The GENCODE Project Credits The GENCODE project is an international collaboration funded by NIH/NHGRI grant U41HG007234. More information is available at www.gencodegenes.org, Participating GENCODE institutions and personnel can be found here. References Frankish A, Diekhans M, Jungreis I, Lagarde J, Loveland JE, Mudge JM, Sisu C, Wright JC, Armstrong J, Barnes I et al. GENCODE 2021. Nucleic Acids Res. 2021 Jan 8;49(D1):D916-D923. PMID: 33270111; PMC: PMC7778937; DOI: 10.1093/nar/gkaa1087 A full list of GENCODE publications are available at The GENCODE Project web site. Data Release Policy GENCODE data are available for use without restrictions. wgEncodeGencodeVM29ViewPolya PolyA All GENCODE annotations from VM29 (Ensembl 106) Genes and Gene Predictions wgEncodeGencodePolyaVM29 PolyA PolyA Transcript Annotation Set from GENCODE Version M29 (Ensembl 106) Genes and Gene Predictions wgEncodeGencodeVM29ViewGenes Genes All GENCODE annotations from VM29 (Ensembl 106) Genes and Gene Predictions wgEncodeGencodePseudoGeneVM29 Pseudogenes Pseudogene Annotation Set from GENCODE Version M29 (Ensembl 106) Genes and Gene Predictions wgEncodeGencodeCompVM29 Comprehensive Comprehensive Gene Annotation Set from GENCODE Version M29 (Ensembl 106) Genes and Gene Predictions wgEncodeGencodeBasicVM29 Basic Basic Gene Annotation Set from GENCODE Version M29 (Ensembl 106) Genes and Gene Predictions wgEncodeGencodeVM29View2Way 2-Way All GENCODE annotations from VM29 (Ensembl 106) Genes and Gene Predictions wgEncodeGencode2wayConsPseudoVM29 2-way Pseudogenes 2-way Pseudogene Annotation Set from GENCODE Version M29 (Ensembl 106) Genes and Gene Predictions wgEncodeGencodeVM28 All GENCODE VM28 All GENCODE annotations from VM28 (Ensembl 105) Genes and Gene Predictions Description The GENCODE Genes track (version M28, Oct 2021) shows high-quality manual annotations merged with evidence-based automated annotations across the entire mouse genome generated by the GENCODE project. The GENCODE gene set presents a full merge between HAVANA manual annotation process and Ensembl automatic annotation pipeline. Priority is given to the manually curated HAVANA annotation using predicted Ensembl annotations when there are no corresponding manual annotations. The M28 annotation was carried out on genome assembly GRCm39 (mm39). The Ensembl human and mouse data sets are the same gene annotations as GENCODE for the corresponding release. Display Conventions and Configuration This track is a multi-view composite track that contains differing data sets (views). Instructions for configuring multi-view tracks are here. To show only selected subtracks, uncheck the boxes next to the tracks that you wish to hide. Views available on this track are: Genes The gene annotations in this view are divided into three subtracks: GENCODE Basic set is a subset of the Comprehensive set. The selection criteria are described in the methods section. GENCODE Comprehensive set contains all GENCODE coding and non-coding transcript annotations, including polymorphic pseudogenes. This includes both manual and automatic annotations. This is a super-set of the Basic set. GENCODE Pseudogenes include all annotations except polymorphic pseudogenes. PolyA GENCODE PolyA contains polyA signals and sites manually annotated on the genome based on transcribed evidence (ESTs and cDNAs) of 3' end of transcripts containing at least 3 A's not matching the genome. Maximum number of transcripts to display is available for the items in the GENCODE Basic, Comprehensive and Pseudogene tracks. Starting with the GENCODE human V42 and mouse VM31 releases, transcripts are assigned rank within the gene. The ranks may be used to filter the number of transcripts displayed in a principled manner. Transcript ranking is not available in the lift37 releases. See Methods for details of rank assignment. Filtering is available for the items in the GENCODE Basic, Comprehensive and Pseudogene tracks using the following criteria: Transcript class: filter by the basic biological function of a transcript annotation All - don't filter by transcript class coding - display protein coding transcripts, including polymorphic pseudogenes nonCoding - display non-protein coding transcripts pseudo - display pseudogene transcript annotations problem - display problem transcripts (Biotypes of retained_intron, TEC, or disrupted_domain) Transcript Annotation Method: filter by the method used to create the annotation All - don't filter by transcript class manual - display manually created annotations, including those that are also created automatically automatic - display automatically created annotations, including those that are also created manually manual_only - display manually created annotations that were not annotated by the automatic method automatic_only - display automatically created annotations that were not annotated by the manual method Transcript Biotype: filter transcripts by Biotype Support Level: filter transcripts by transcription support level Coloring for the gene annotations is based on the annotation type: coding non-coding pseudogene problem all polyA annotations Methods The GENCODE project aims to annotate all evidence-based gene features on the human and mouse reference sequence with high accuracy by integrating computational approaches (including comparative methods), manual annotation and targeted experimental verification. This goal includes identifying all protein-coding loci with associated alternative variants, non-coding loci which have transcript evidence, and pseudogenes. For a detailed description of the methods and references used, see Harrow et al. (2006). GENCODE Basic Set selection: The GENCODE Basic Set is intended to provide a simplified subset of the GENCODE transcript annotations that will be useful to the majority of users. The goal was to have a high-quality basic set that also covered all loci. Selection of GENCODE annotations for inclusion in the basic set was determined independently for the coding and non-coding transcripts at each gene locus. Criteria for selection of coding transcripts (including polymorphic pseudogenes) at a given locus: All full-length coding transcripts (except problem transcripts or transcripts that are nonsense-mediated decay) were included in the basic set. If there were no transcripts meeting the above criteria, then the partial coding transcript with the largest CDS was included in the basic set (excluding problem transcripts). Criteria for selection of non-coding transcripts at a given locus: All full-length non-coding transcripts (except problem transcripts) with a well characterized Biotype (see below) were included in the basic set. If there were no transcripts meeting the above criteria, then the largest non-coding transcript was included in the basic set (excluding problem transcripts). If no transcripts were included by either of the above criteria, the longest problem transcript is included. Non-coding transcript categorization: Non-coding transcripts are categorized using their biotype and the following criteria: well characterized: antisense, Mt_rRNA, Mt_tRNA, miRNA, rRNA, snRNA, snoRNA poorly characterized: 3prime_overlapping_ncrna, lincRNA, misc_RNA, non_coding, processed_transcript, sense_intronic, sense_overlapping Transcript ranking: Within each gene, transcripts have been ranked according to the following criteria. The ranking approach is preliminary and will change is future releases. Protein_coding genes MANE or Ensembl canonical -1st: MANE Select / Ensembl canonical -2nd: MANE Plus Clinical Coding biotypes -1st: protein_coding and protein_coding_LoF -2nd: NMDs and NSDs -3rd: retained intron and protein_coding_CDS_not_defined Completeness -1st: full length -2nd: CDS start/end not found CARS score (only for coding transcripts) Transcript genomic span and length (only for non-coding transcripts) Non-coding genes Transcript biotype -1st: transcript biotype identical to gene biotype Ensembl canonical GENCODE basic Transcript genomic span Transcript length Transcription Support Level (TSL): It is important that users understand how to assess transcript annotations that they see in GENCODE. While some transcript models have a high level of support through the full length of their exon structure, there are also transcripts that are poorly supported and that should be considered speculative. The Transcription Support Level (TSL) is a method to highlight the well-supported and poorly-supported transcript models for users. The method relies on the primary data that can support full-length transcript structure: mRNA and EST alignments supplied by UCSC and Ensembl. The mRNA and EST alignments are compared to the GENCODE transcripts and the transcripts are scored according to how well the alignment matches over its full length. The GENCODE TSL provides a consistent method of evaluating the level of support that a GENCODE transcript annotation is actually expressed in mouse. Mouse transcript sequences from the International Nucleotide Sequence Database Collaboration (GenBank, ENA, and DDBJ) are used as the evidence for this analysis. Exonerate RNA alignments from Ensembl, BLAT RNA and EST alignments from the UCSC Genome Browser Database are used in the analysis. Erroneous transcripts and libraries identified in lists maintained by the Ensembl, UCSC, HAVANA and RefSeq groups are flagged as suspect. GENCODE annotations for protein-coding and non-protein-coding transcripts are compared with the evidence alignments. Annotations in the MHC region and other immunological genes are not evaluated, as automatic alignments tend to be very problematic. Methods for evaluating single-exon genes are still being developed and they are not included in the current analysis. Multi-exon GENCODE annotations are evaluated using the criteria that all introns are supported by an evidence alignment and the evidence alignment does not indicate that there are unannotated exons. Small insertions and deletions in evidence alignments are assumed to be due to polymorphisms and not considered as differing from the annotations. All intron boundaries must match exactly. The transcript start and end locations are allowed to differ. The following categories are assigned to each of the evaluated annotations: tsl1 - all splice junctions of the transcript are supported by at least one non-suspect mRNA tsl2 - the best supporting mRNA is flagged as suspect or the support is from multiple ESTs tsl3 - the only support is from a single EST tsl4 - the best supporting EST is flagged as suspect tsl5 - no single transcript supports the model structure tslNA - the transcript was not analyzed for one of the following reasons: pseudogene annotation, including transcribed pseudogenes immunoglobin gene transcript T-cell receptor transcript single-exon transcript (will be included in a future version) APPRIS is a system to annotate alternatively spliced transcripts based on a range of computational methods. It provides value to the annotations of the human, mouse, zebrafish, rat, and pig genomes. APPRIS has selected a single CDS variant for each gene as the 'PRINCIPAL' isoform. Principal isoforms are tagged with the numbers 1 to 5, with 1 being the most reliable. PRINCIPAL:1 - Transcript(s) expected to code for the main functional isoform based solely on the core modules in the APPRIS. PRINCIPAL:2 - Where the APPRIS core modules are unable to choose a clear principal variant (approximately 25% of human protein coding genes), the database chooses two or more of the CDS variants as "candidates" to be the principal variant. PRINCIPAL:3 - Where the APPRIS core modules are unable to choose a clear principal variant and more than one of the variants have distinct CCDS identifiers, APPRIS selects the variant with lowest CCDS identifier as the principal variant. The lower the CCDS identifier, the earlier it was annotated. PRINCIPAL:4 - Where the APPRIS core modules are unable to choose a clear principal CDS and there is more than one variant with distinct (but consecutive) CCDS identifiers, APPRIS selects the longest CCDS isoform as the principal variant. PRINCIPAL:5 - Where the APPRIS core modules are unable to choose a clear principal variant and none of the candidate variants are annotated by CCDS, APPRIS selects the longest of the candidate isoforms as the principal variant. For genes in which the APPRIS core modules are unable to choose a clear principal variant (approximately 25% of human protein coding genes), the "candidate" variants not chosen as principal are labeled in the following way: ALTERNATIVE:1 - Candidate transcript(s) models that are conserved in at least three tested species. ALTERNATIVE:2 - Candidate transcript(s) models that appear to be conserved in fewer than three tested species. Non-candidate transcripts are not tagged and are considered as "Minor" transcripts. Further information and additional web services can be found at the APPRIS website. Downloads GENCODE GFF3 and GTF files are available from the GENCODE release M28 site. Release Notes GENCODE version M28 corresponds to Ensembl 105. See also: The GENCODE Project Credits The GENCODE project is an international collaboration funded by NIH/NHGRI grant U41HG007234. More information is available at www.gencodegenes.org, Participating GENCODE institutions and personnel can be found here. References Frankish A, Diekhans M, Jungreis I, Lagarde J, Loveland JE, Mudge JM, Sisu C, Wright JC, Armstrong J, Barnes I et al. GENCODE 2021. Nucleic Acids Res. 2021 Jan 8;49(D1):D916-D923. PMID: 33270111; PMC: PMC7778937; DOI: 10.1093/nar/gkaa1087 A full list of GENCODE publications are available at The GENCODE Project web site. Data Release Policy GENCODE data are available for use without restrictions. wgEncodeGencodeVM28ViewPolya PolyA All GENCODE annotations from VM28 (Ensembl 105) Genes and Gene Predictions wgEncodeGencodePolyaVM28 PolyA PolyA Transcript Annotation Set from GENCODE Version M28 (Ensembl 105) Genes and Gene Predictions wgEncodeGencodeVM28ViewGenes Genes All GENCODE annotations from VM28 (Ensembl 105) Genes and Gene Predictions wgEncodeGencodePseudoGeneVM28 Pseudogenes Pseudogene Annotation Set from GENCODE Version M28 (Ensembl 105) Genes and Gene Predictions wgEncodeGencodeCompVM28 Comprehensive Comprehensive Gene Annotation Set from GENCODE Version M28 (Ensembl 105) Genes and Gene Predictions wgEncodeGencodeBasicVM28 Basic Basic Gene Annotation Set from GENCODE Version M28 (Ensembl 105) Genes and Gene Predictions wgEncodeGencodeVM28View2Way 2-Way All GENCODE annotations from VM28 (Ensembl 105) Genes and Gene Predictions wgEncodeGencode2wayConsPseudoVM28 2-way Pseudogenes 2-way Pseudogene Annotation Set from GENCODE Version M28 (Ensembl 105) Genes and Gene Predictions wgEncodeGencodeVM27 All GENCODE VM27 All GENCODE annotations from VM27 (Ensembl 104) Genes and Gene Predictions Description The GENCODE Genes track (version M27, May 2021) shows high-quality manual annotations merged with evidence-based automated annotations across the entire mouse genome generated by the GENCODE project. The GENCODE gene set presents a full merge between HAVANA manual annotation process and Ensembl automatic annotation pipeline. Priority is given to the manually curated HAVANA annotation using predicted Ensembl annotations when there are no corresponding manual annotations. The M27 annotation was carried out on genome assembly GRCm39 (mm39). The Ensembl human and mouse data sets are the same gene annotations as GENCODE for the corresponding release. Display Conventions and Configuration This track is a multi-view composite track that contains differing data sets (views). Instructions for configuring multi-view tracks are here. To show only selected subtracks, uncheck the boxes next to the tracks that you wish to hide. Views available on this track are: Genes The gene annotations in this view are divided into three subtracks: GENCODE Basic set is a subset of the Comprehensive set. The selection criteria are described in the methods section. GENCODE Comprehensive set contains all GENCODE coding and non-coding transcript annotations, including polymorphic pseudogenes. This includes both manual and automatic annotations. This is a super-set of the Basic set. GENCODE Pseudogenes include all annotations except polymorphic pseudogenes. PolyA GENCODE PolyA contains polyA signals and sites manually annotated on the genome based on transcribed evidence (ESTs and cDNAs) of 3' end of transcripts containing at least 3 A's not matching the genome. Maximum number of transcripts to display is available for the items in the GENCODE Basic, Comprehensive and Pseudogene tracks. Starting with the GENCODE human V42 and mouse VM31 releases, transcripts are assigned rank within the gene. The ranks may be used to filter the number of transcripts displayed in a principled manner. Transcript ranking is not available in the lift37 releases. See Methods for details of rank assignment. Filtering is available for the items in the GENCODE Basic, Comprehensive and Pseudogene tracks using the following criteria: Transcript class: filter by the basic biological function of a transcript annotation All - don't filter by transcript class coding - display protein coding transcripts, including polymorphic pseudogenes nonCoding - display non-protein coding transcripts pseudo - display pseudogene transcript annotations problem - display problem transcripts (Biotypes of retained_intron, TEC, or disrupted_domain) Transcript Annotation Method: filter by the method used to create the annotation All - don't filter by transcript class manual - display manually created annotations, including those that are also created automatically automatic - display automatically created annotations, including those that are also created manually manual_only - display manually created annotations that were not annotated by the automatic method automatic_only - display automatically created annotations that were not annotated by the manual method Transcript Biotype: filter transcripts by Biotype Support Level: filter transcripts by transcription support level Coloring for the gene annotations is based on the annotation type: coding non-coding pseudogene problem all polyA annotations Methods The GENCODE project aims to annotate all evidence-based gene features on the human and mouse reference sequence with high accuracy by integrating computational approaches (including comparative methods), manual annotation and targeted experimental verification. This goal includes identifying all protein-coding loci with associated alternative variants, non-coding loci which have transcript evidence, and pseudogenes. For a detailed description of the methods and references used, see Harrow et al. (2006). GENCODE Basic Set selection: The GENCODE Basic Set is intended to provide a simplified subset of the GENCODE transcript annotations that will be useful to the majority of users. The goal was to have a high-quality basic set that also covered all loci. Selection of GENCODE annotations for inclusion in the basic set was determined independently for the coding and non-coding transcripts at each gene locus. Criteria for selection of coding transcripts (including polymorphic pseudogenes) at a given locus: All full-length coding transcripts (except problem transcripts or transcripts that are nonsense-mediated decay) were included in the basic set. If there were no transcripts meeting the above criteria, then the partial coding transcript with the largest CDS was included in the basic set (excluding problem transcripts). Criteria for selection of non-coding transcripts at a given locus: All full-length non-coding transcripts (except problem transcripts) with a well characterized Biotype (see below) were included in the basic set. If there were no transcripts meeting the above criteria, then the largest non-coding transcript was included in the basic set (excluding problem transcripts). If no transcripts were included by either of the above criteria, the longest problem transcript is included. Non-coding transcript categorization: Non-coding transcripts are categorized using their biotype and the following criteria: well characterized: antisense, Mt_rRNA, Mt_tRNA, miRNA, rRNA, snRNA, snoRNA poorly characterized: 3prime_overlapping_ncrna, lincRNA, misc_RNA, non_coding, processed_transcript, sense_intronic, sense_overlapping Transcript ranking: Within each gene, transcripts have been ranked according to the following criteria. The ranking approach is preliminary and will change is future releases. Protein_coding genes MANE or Ensembl canonical -1st: MANE Select / Ensembl canonical -2nd: MANE Plus Clinical Coding biotypes -1st: protein_coding and protein_coding_LoF -2nd: NMDs and NSDs -3rd: retained intron and protein_coding_CDS_not_defined Completeness -1st: full length -2nd: CDS start/end not found CARS score (only for coding transcripts) Transcript genomic span and length (only for non-coding transcripts) Non-coding genes Transcript biotype -1st: transcript biotype identical to gene biotype Ensembl canonical GENCODE basic Transcript genomic span Transcript length Transcription Support Level (TSL): It is important that users understand how to assess transcript annotations that they see in GENCODE. While some transcript models have a high level of support through the full length of their exon structure, there are also transcripts that are poorly supported and that should be considered speculative. The Transcription Support Level (TSL) is a method to highlight the well-supported and poorly-supported transcript models for users. The method relies on the primary data that can support full-length transcript structure: mRNA and EST alignments supplied by UCSC and Ensembl. The mRNA and EST alignments are compared to the GENCODE transcripts and the transcripts are scored according to how well the alignment matches over its full length. The GENCODE TSL provides a consistent method of evaluating the level of support that a GENCODE transcript annotation is actually expressed in mouse. Mouse transcript sequences from the International Nucleotide Sequence Database Collaboration (GenBank, ENA, and DDBJ) are used as the evidence for this analysis. Exonerate RNA alignments from Ensembl, BLAT RNA and EST alignments from the UCSC Genome Browser Database are used in the analysis. Erroneous transcripts and libraries identified in lists maintained by the Ensembl, UCSC, HAVANA and RefSeq groups are flagged as suspect. GENCODE annotations for protein-coding and non-protein-coding transcripts are compared with the evidence alignments. Annotations in the MHC region and other immunological genes are not evaluated, as automatic alignments tend to be very problematic. Methods for evaluating single-exon genes are still being developed and they are not included in the current analysis. Multi-exon GENCODE annotations are evaluated using the criteria that all introns are supported by an evidence alignment and the evidence alignment does not indicate that there are unannotated exons. Small insertions and deletions in evidence alignments are assumed to be due to polymorphisms and not considered as differing from the annotations. All intron boundaries must match exactly. The transcript start and end locations are allowed to differ. The following categories are assigned to each of the evaluated annotations: tsl1 - all splice junctions of the transcript are supported by at least one non-suspect mRNA tsl2 - the best supporting mRNA is flagged as suspect or the support is from multiple ESTs tsl3 - the only support is from a single EST tsl4 - the best supporting EST is flagged as suspect tsl5 - no single transcript supports the model structure tslNA - the transcript was not analyzed for one of the following reasons: pseudogene annotation, including transcribed pseudogenes immunoglobin gene transcript T-cell receptor transcript single-exon transcript (will be included in a future version) APPRIS is a system to annotate alternatively spliced transcripts based on a range of computational methods. It provides value to the annotations of the human, mouse, zebrafish, rat, and pig genomes. APPRIS has selected a single CDS variant for each gene as the 'PRINCIPAL' isoform. Principal isoforms are tagged with the numbers 1 to 5, with 1 being the most reliable. PRINCIPAL:1 - Transcript(s) expected to code for the main functional isoform based solely on the core modules in the APPRIS. PRINCIPAL:2 - Where the APPRIS core modules are unable to choose a clear principal variant (approximately 25% of human protein coding genes), the database chooses two or more of the CDS variants as "candidates" to be the principal variant. PRINCIPAL:3 - Where the APPRIS core modules are unable to choose a clear principal variant and more than one of the variants have distinct CCDS identifiers, APPRIS selects the variant with lowest CCDS identifier as the principal variant. The lower the CCDS identifier, the earlier it was annotated. PRINCIPAL:4 - Where the APPRIS core modules are unable to choose a clear principal CDS and there is more than one variant with distinct (but consecutive) CCDS identifiers, APPRIS selects the longest CCDS isoform as the principal variant. PRINCIPAL:5 - Where the APPRIS core modules are unable to choose a clear principal variant and none of the candidate variants are annotated by CCDS, APPRIS selects the longest of the candidate isoforms as the principal variant. For genes in which the APPRIS core modules are unable to choose a clear principal variant (approximately 25% of human protein coding genes), the "candidate" variants not chosen as principal are labeled in the following way: ALTERNATIVE:1 - Candidate transcript(s) models that are conserved in at least three tested species. ALTERNATIVE:2 - Candidate transcript(s) models that appear to be conserved in fewer than three tested species. Non-candidate transcripts are not tagged and are considered as "Minor" transcripts. Further information and additional web services can be found at the APPRIS website. Downloads GENCODE GFF3 and GTF files are available from the GENCODE release M27 site. Release Notes GENCODE version M27 corresponds to Ensembl 104. See also: The GENCODE Project Credits The GENCODE project is an international collaboration funded by NIH/NHGRI grant U41HG007234. More information is available at www.gencodegenes.org, Participating GENCODE institutions and personnel can be found here. References Frankish A, Diekhans M, Jungreis I, Lagarde J, Loveland JE, Mudge JM, Sisu C, Wright JC, Armstrong J, Barnes I et al. GENCODE 2021. Nucleic Acids Res. 2021 Jan 8;49(D1):D916-D923. PMID: 33270111; PMC: PMC7778937; DOI: 10.1093/nar/gkaa1087 A full list of GENCODE publications are available at The GENCODE Project web site. Data Release Policy GENCODE data are available for use without restrictions. wgEncodeGencodeVM27ViewPolya PolyA All GENCODE annotations from VM27 (Ensembl 104) Genes and Gene Predictions wgEncodeGencodePolyaVM27 PolyA PolyA Transcript Annotation Set from GENCODE Version M27 (Ensembl 104) Genes and Gene Predictions wgEncodeGencodeVM27ViewGenes Genes All GENCODE annotations from VM27 (Ensembl 104) Genes and Gene Predictions wgEncodeGencodePseudoGeneVM27 Pseudogenes Pseudogene Annotation Set from GENCODE Version M27 (Ensembl 104) Genes and Gene Predictions wgEncodeGencodeCompVM27 Comprehensive Comprehensive Gene Annotation Set from GENCODE Version M27 (Ensembl 104) Genes and Gene Predictions wgEncodeGencodeBasicVM27 Basic Basic Gene Annotation Set from GENCODE Version M27 (Ensembl 104) Genes and Gene Predictions wgEncodeGencodeVM27View2Way 2-Way All GENCODE annotations from VM27 (Ensembl 104) Genes and Gene Predictions wgEncodeGencode2wayConsPseudoVM27 2-way Pseudogenes 2-way Pseudogene Annotation Set from GENCODE Version M27 (Ensembl 104) Genes and Gene Predictions wgEncodeGencodeVM26 All GENCODE VM26 All GENCODE annotations from VM26 (Ensembl 103) Genes and Gene Predictions Description The GENCODE Genes track (version M26, Feb 2021) shows high-quality manual annotations merged with evidence-based automated annotations across the entire mouse genome generated by the GENCODE project. The GENCODE gene set presents a full merge between HAVANA manual annotation process and Ensembl automatic annotation pipeline. Priority is given to the manually curated HAVANA annotation using predicted Ensembl annotations when there are no corresponding manual annotations. The M26 annotation was carried out on genome assembly GRCm39 (mm39). The Ensembl human and mouse data sets are the same gene annotations as GENCODE for the corresponding release. Display Conventions and Configuration This track is a multi-view composite track that contains differing data sets (views). Instructions for configuring multi-view tracks are here. To show only selected subtracks, uncheck the boxes next to the tracks that you wish to hide. Views available on this track are: Genes The gene annotations in this view are divided into three subtracks: GENCODE Basic set is a subset of the Comprehensive set. The selection criteria are described in the methods section. GENCODE Comprehensive set contains all GENCODE coding and non-coding transcript annotations, including polymorphic pseudogenes. This includes both manual and automatic annotations. This is a super-set of the Basic set. GENCODE Pseudogenes include all annotations except polymorphic pseudogenes. PolyA GENCODE PolyA contains polyA signals and sites manually annotated on the genome based on transcribed evidence (ESTs and cDNAs) of 3' end of transcripts containing at least 3 A's not matching the genome. Maximum number of transcripts to display is available for the items in the GENCODE Basic, Comprehensive and Pseudogene tracks. Starting with the GENCODE human V42 and mouse VM31 releases, transcripts are assigned rank within the gene. The ranks may be used to filter the number of transcripts displayed in a principled manner. Transcript ranking is not available in the lift37 releases. See Methods for details of rank assignment. Filtering is available for the items in the GENCODE Basic, Comprehensive and Pseudogene tracks using the following criteria: Transcript class: filter by the basic biological function of a transcript annotation All - don't filter by transcript class coding - display protein coding transcripts, including polymorphic pseudogenes nonCoding - display non-protein coding transcripts pseudo - display pseudogene transcript annotations problem - display problem transcripts (Biotypes of retained_intron, TEC, or disrupted_domain) Transcript Annotation Method: filter by the method used to create the annotation All - don't filter by transcript class manual - display manually created annotations, including those that are also created automatically automatic - display automatically created annotations, including those that are also created manually manual_only - display manually created annotations that were not annotated by the automatic method automatic_only - display automatically created annotations that were not annotated by the manual method Transcript Biotype: filter transcripts by Biotype Support Level: filter transcripts by transcription support level Coloring for the gene annotations is based on the annotation type: coding non-coding pseudogene problem all polyA annotations Methods The GENCODE project aims to annotate all evidence-based gene features on the human and mouse reference sequence with high accuracy by integrating computational approaches (including comparative methods), manual annotation and targeted experimental verification. This goal includes identifying all protein-coding loci with associated alternative variants, non-coding loci which have transcript evidence, and pseudogenes. For a detailed description of the methods and references used, see Harrow et al. (2006). GENCODE Basic Set selection: The GENCODE Basic Set is intended to provide a simplified subset of the GENCODE transcript annotations that will be useful to the majority of users. The goal was to have a high-quality basic set that also covered all loci. Selection of GENCODE annotations for inclusion in the basic set was determined independently for the coding and non-coding transcripts at each gene locus. Criteria for selection of coding transcripts (including polymorphic pseudogenes) at a given locus: All full-length coding transcripts (except problem transcripts or transcripts that are nonsense-mediated decay) were included in the basic set. If there were no transcripts meeting the above criteria, then the partial coding transcript with the largest CDS was included in the basic set (excluding problem transcripts). Criteria for selection of non-coding transcripts at a given locus: All full-length non-coding transcripts (except problem transcripts) with a well characterized Biotype (see below) were included in the basic set. If there were no transcripts meeting the above criteria, then the largest non-coding transcript was included in the basic set (excluding problem transcripts). If no transcripts were included by either of the above criteria, the longest problem transcript is included. Non-coding transcript categorization: Non-coding transcripts are categorized using their biotype and the following criteria: well characterized: antisense, Mt_rRNA, Mt_tRNA, miRNA, rRNA, snRNA, snoRNA poorly characterized: 3prime_overlapping_ncrna, lincRNA, misc_RNA, non_coding, processed_transcript, sense_intronic, sense_overlapping Transcript ranking: Within each gene, transcripts have been ranked according to the following criteria. The ranking approach is preliminary and will change is future releases. Protein_coding genes MANE or Ensembl canonical -1st: MANE Select / Ensembl canonical -2nd: MANE Plus Clinical Coding biotypes -1st: protein_coding and protein_coding_LoF -2nd: NMDs and NSDs -3rd: retained intron and protein_coding_CDS_not_defined Completeness -1st: full length -2nd: CDS start/end not found CARS score (only for coding transcripts) Transcript genomic span and length (only for non-coding transcripts) Non-coding genes Transcript biotype -1st: transcript biotype identical to gene biotype Ensembl canonical GENCODE basic Transcript genomic span Transcript length Transcription Support Level (TSL): It is important that users understand how to assess transcript annotations that they see in GENCODE. While some transcript models have a high level of support through the full length of their exon structure, there are also transcripts that are poorly supported and that should be considered speculative. The Transcription Support Level (TSL) is a method to highlight the well-supported and poorly-supported transcript models for users. The method relies on the primary data that can support full-length transcript structure: mRNA and EST alignments supplied by UCSC and Ensembl. The mRNA and EST alignments are compared to the GENCODE transcripts and the transcripts are scored according to how well the alignment matches over its full length. The GENCODE TSL provides a consistent method of evaluating the level of support that a GENCODE transcript annotation is actually expressed in mouse. Mouse transcript sequences from the International Nucleotide Sequence Database Collaboration (GenBank, ENA, and DDBJ) are used as the evidence for this analysis. Exonerate RNA alignments from Ensembl, BLAT RNA and EST alignments from the UCSC Genome Browser Database are used in the analysis. Erroneous transcripts and libraries identified in lists maintained by the Ensembl, UCSC, HAVANA and RefSeq groups are flagged as suspect. GENCODE annotations for protein-coding and non-protein-coding transcripts are compared with the evidence alignments. Annotations in the MHC region and other immunological genes are not evaluated, as automatic alignments tend to be very problematic. Methods for evaluating single-exon genes are still being developed and they are not included in the current analysis. Multi-exon GENCODE annotations are evaluated using the criteria that all introns are supported by an evidence alignment and the evidence alignment does not indicate that there are unannotated exons. Small insertions and deletions in evidence alignments are assumed to be due to polymorphisms and not considered as differing from the annotations. All intron boundaries must match exactly. The transcript start and end locations are allowed to differ. The following categories are assigned to each of the evaluated annotations: tsl1 - all splice junctions of the transcript are supported by at least one non-suspect mRNA tsl2 - the best supporting mRNA is flagged as suspect or the support is from multiple ESTs tsl3 - the only support is from a single EST tsl4 - the best supporting EST is flagged as suspect tsl5 - no single transcript supports the model structure tslNA - the transcript was not analyzed for one of the following reasons: pseudogene annotation, including transcribed pseudogenes immunoglobin gene transcript T-cell receptor transcript single-exon transcript (will be included in a future version) APPRIS is a system to annotate alternatively spliced transcripts based on a range of computational methods. It provides value to the annotations of the human, mouse, zebrafish, rat, and pig genomes. APPRIS has selected a single CDS variant for each gene as the 'PRINCIPAL' isoform. Principal isoforms are tagged with the numbers 1 to 5, with 1 being the most reliable. PRINCIPAL:1 - Transcript(s) expected to code for the main functional isoform based solely on the core modules in the APPRIS. PRINCIPAL:2 - Where the APPRIS core modules are unable to choose a clear principal variant (approximately 25% of human protein coding genes), the database chooses two or more of the CDS variants as "candidates" to be the principal variant. PRINCIPAL:3 - Where the APPRIS core modules are unable to choose a clear principal variant and more than one of the variants have distinct CCDS identifiers, APPRIS selects the variant with lowest CCDS identifier as the principal variant. The lower the CCDS identifier, the earlier it was annotated. PRINCIPAL:4 - Where the APPRIS core modules are unable to choose a clear principal CDS and there is more than one variant with distinct (but consecutive) CCDS identifiers, APPRIS selects the longest CCDS isoform as the principal variant. PRINCIPAL:5 - Where the APPRIS core modules are unable to choose a clear principal variant and none of the candidate variants are annotated by CCDS, APPRIS selects the longest of the candidate isoforms as the principal variant. For genes in which the APPRIS core modules are unable to choose a clear principal variant (approximately 25% of human protein coding genes), the "candidate" variants not chosen as principal are labeled in the following way: ALTERNATIVE:1 - Candidate transcript(s) models that are conserved in at least three tested species. ALTERNATIVE:2 - Candidate transcript(s) models that appear to be conserved in fewer than three tested species. Non-candidate transcripts are not tagged and are considered as "Minor" transcripts. Further information and additional web services can be found at the APPRIS website. Downloads GENCODE GFF3 and GTF files are available from the GENCODE release M26 site. Release Notes GENCODE version M26 corresponds to Ensembl 103. See also: The GENCODE Project Credits The GENCODE project is an international collaboration funded by NIH/NHGRI grant U41HG007234. More information is available at www.gencodegenes.org, Participating GENCODE institutions and personnel can be found here. References Frankish A, Diekhans M, Jungreis I, Lagarde J, Loveland JE, Mudge JM, Sisu C, Wright JC, Armstrong J, Barnes I et al. GENCODE 2021. Nucleic Acids Res. 2021 Jan 8;49(D1):D916-D923. PMID: 33270111; PMC: PMC7778937; DOI: 10.1093/nar/gkaa1087 A full list of GENCODE publications are available at The GENCODE Project web site. Data Release Policy GENCODE data are available for use without restrictions. wgEncodeGencodeVM26ViewPolya PolyA All GENCODE annotations from VM26 (Ensembl 103) Genes and Gene Predictions wgEncodeGencodePolyaVM26 PolyA PolyA Transcript Annotation Set from GENCODE Version M26 (Ensembl 103) Genes and Gene Predictions wgEncodeGencodeVM26ViewGenes Genes All GENCODE annotations from VM26 (Ensembl 103) Genes and Gene Predictions wgEncodeGencodePseudoGeneVM26 Pseudogenes Pseudogene Annotation Set from GENCODE Version M26 (Ensembl 103) Genes and Gene Predictions wgEncodeGencodeCompVM26 Comprehensive Comprehensive Gene Annotation Set from GENCODE Version M26 (Ensembl 103) Genes and Gene Predictions wgEncodeGencodeBasicVM26 Basic Basic Gene Annotation Set from GENCODE Version M26 (Ensembl 103) Genes and Gene Predictions wgEncodeGencodeVM26View2Way 2-Way All GENCODE annotations from VM26 (Ensembl 103) Genes and Gene Predictions wgEncodeGencode2wayConsPseudoVM26 2-way Pseudogenes 2-way Pseudogene Annotation Set from GENCODE Version M26 (Ensembl 103) Genes and Gene Predictions gold Assembly Assembly from Fragments Mapping and Sequencing Description This track shows the sequences used in the Jun. 2020 mouse genome assembly. Genome assembly procedures are covered in the NCBI assembly documentation. NCBI also provides specific information about this assembly. The definition of this assembly is from the AGP file delivered with the sequence. The NCBI document AGP Specification describes the format of the AGP file. In dense mode, this track depicts the contigs that make up the currently viewed scaffold. Contig boundaries are distinguished by the use of alternating gold and brown coloration. Where gaps exist between contigs, spaces are shown between the gold and brown blocks. The relative order and orientation of the contigs within a scaffold is always known; therefore, a line is drawn in the graphical display to bridge the blocks. Component types found in this track (with counts of that type in parentheses): F - finished sequence (20,878) W - whole genome shotgun (1,264) O - other sequence (118) P - pre draft (12) A - active finishing (1) augustusGene AUGUSTUS AUGUSTUS ab initio gene predictions v3.1 Genes and Gene Predictions Description This track shows ab initio predictions from the program AUGUSTUS (version 3.1). The predictions are based on the genome sequence alone. For more information on the different gene tracks, see our Genes FAQ. Methods Statistical signal models were built for splice sites, branch-point patterns, translation start sites, and the poly-A signal. Furthermore, models were built for the sequence content of protein-coding and non-coding regions as well as for the length distributions of different exon and intron types. Detailed descriptions of most of these different models can be found in Mario Stanke's dissertation. This track shows the most likely gene structure according to a Semi-Markov Conditional Random Field model. Alternative splicing transcripts were obtained with a sampling algorithm (--alternatives-from-sampling=true --sample=100 --minexonintronprob=0.2 --minmeanexonintronprob=0.5 --maxtracks=3 --temperature=2). The different models used by Augustus were trained on a number of different species-specific gene sets, which included 1000-2000 training gene structures. The --species option allows one to choose the species used for training the models. Different training species were used for the --species option when generating these predictions for different groups of assemblies. Assembly Group Training Species Fish zebrafish Birds chicken Human and all other vertebrates human Nematodes caenorhabditis Drosophila fly A. mellifera honeybee1 A. gambiae culex S. cerevisiae saccharomyces This table describes which training species was used for a particular group of assemblies. When available, the closest related training species was used. Credits Thanks to the Stanke lab for providing the AUGUSTUS program. The training for the chicken version was done by Stefanie König and the training for the human and zebrafish versions was done by Mario Stanke. References Stanke M, Diekhans M, Baertsch R, Haussler D. Using native and syntenically mapped cDNA alignments to improve de novo gene finding. Bioinformatics. 2008 Mar 1;24(5):637-44. PMID: 18218656 Stanke M, Waack S. Gene prediction with a hidden Markov model and a new intron submodel. Bioinformatics. 2003 Oct;19 Suppl 2:ii215-25. PMID: 14534192 mm39ChainNet Chain/Net Chain and Net Alignments Comparative Genomics Description Chain Track The chain track shows alignments of mouse (Jun. 2020 (GRCm39/mm39)/mm39) to other genomes using a gap scoring system that allows longer gaps than traditional affine gap scoring systems. It can also tolerate gaps in both mouse and the other genome simultaneously. These "double-sided" gaps can be caused by local inversions and overlapping deletions in both species. The chain track displays boxes joined together by either single or double lines. The boxes represent aligning regions. Single lines indicate gaps that are largely due to a deletion in the other assembly or an insertion in the mouse assembly. Double lines represent more complex gaps that involve substantial sequence in both species. This may result from inversions, overlapping deletions, an abundance of local mutation, or an unsequenced gap in one species. In cases where multiple chains align over a particular region of the other genome, the chains with single-lined gaps are often due to processed pseudogenes, while chains with double-lined gaps are more often due to paralogs and unprocessed pseudogenes. In the "pack" and "full" display modes, the individual feature names indicate the chromosome, strand, and location (in thousands) of the match for each matching alignment. Net Track The net track shows the best mouse/other chain for every part of the other genome. It is useful for finding orthologous regions and for studying genome rearrangement. The mouse sequence used in this annotation is from the Jun. 2020 (GRCm39/mm39) (mm39) assembly. Display Conventions and Configuration Multiple species are grouped together in a composite track. In the display and on the configuration page, an effort was made to group them loosely into "clades." These groupings are based on the taxonomic classification at NCBI, using the CommonTree tool. Some organisms may be pulled from a larger group into a subgroup, to emphasize a relationship. For example, members of an Order may be listed together, while other organisms in the same Superorder may be grouped separately under the Superorder name. Chain Track By default, the chains to chromosome-based assemblies are colored based on which chromosome they map to in the aligning organism. To turn off the coloring, check the "off" button next to: Color track based on chromosome. To display only the chains of one chromosome in the aligning organism, enter the name of that chromosome (e.g. chr4) in box next to: Filter by chromosome. Net Track In full display mode, the top-level (level 1) chains are the largest, highest-scoring chains that span this region. In many cases gaps exist in the top-level chain. When possible, these are filled in by other chains that are displayed at level 2. The gaps in level 2 chains may be filled by level 3 chains and so forth. In the graphical display, the boxes represent ungapped alignments; the lines represent gaps. Click on a box to view detailed information about the chain as a whole; click on a line to display information about the gap. The detailed information is useful in determining the cause of the gap or, for lower level chains, the genomic rearrangement. Individual items in the display are categorized as one of four types (other than gap): Top - the best, longest match. Displayed on level 1. Syn - line-ups on the same chromosome as the gap in the level above it. Inv - a line-up on the same chromosome as the gap above it, but in the opposite orientation. NonSyn - a match to a chromosome different from the gap in the level above. Methods Chain track The lastz alignments were converted into axt format using the lavToAxt program. The axt alignments were fed into axtChain, which organizes all alignments between a single mouse chromosome and a single chromosome from the other genome into a group and creates a kd-tree out of the gapless subsections (blocks) of the alignments. A dynamic program was then run over the kd-trees to find the maximally scoring chains of these blocks. See also: lastz parameters and other details (e.g., update time) and chain minimum score and gap parameters used in these alignments. Net track Chains were derived from lastz alignments, using the methods described on the chain tracks description pages, and sorted with the highest-scoring chains in the genome ranked first. The program chainNet was then used to place the chains one at a time, trimming them as necessary to fit into sections not already covered by a higher-scoring chain. During this process, a natural hierarchy emerged in which a chain that filled a gap in a higher-scoring chain was placed underneath that chain. The program netSyntenic was used to fill in information about the relationship between higher- and lower-level chains, such as whether a lower-level chain was syntenic or inverted relative to the higher-level chain. The program netClass was then used to fill in how much of the gaps and chains contained Ns (sequencing gaps) in one or both species and how much was filled with transposons inserted before and after the two organisms diverged. Credits lastz was developed by: Harris RS. Improved pairwise alignment of genomic DNA. Ph.D. Thesis. Pennsylvania State University, USA. 2007. blastz was developed at Pennsylvania State University by Minmei Hou, Scott Schwartz, Zheng Zhang, and Webb Miller with advice from Ross Hardison. Lineage-specific repeats were identified by Arian Smit and his RepeatMasker program. The axtChain program was developed at the University of California at Santa Cruz by Jim Kent with advice from Webb Miller and David Haussler. The browser display and database storage of the chains and nets were created by Robert Baertsch and Jim Kent. The chainNet, netSyntenic, and netClass programs were developed at the University of California Santa Cruz by Jim Kent. References Chiaromonte F, Yap VB, Miller W. Scoring pairwise genomic sequence alignments. Pac Symp Biocomput. 2002:115-26. PMID: 11928468 Kent WJ, Baertsch R, Hinrichs A, Miller W, Haussler D. Evolution's cauldron: duplication, deletion, and rearrangement in the mouse and human genomes. Proc Natl Acad Sci U S A. 2003 Sep 30;100(20):11484-9. PMID: 14500911; PMC: PMC208784 Schwartz S, Kent WJ, Smit A, Zhang Z, Baertsch R, Hardison RC, Haussler D, Miller W. Human-mouse alignments with BLASTZ. Genome Res. 2003 Jan;13(1):103-7. PMID: 12529312; PMC: PMC430961 mm39ChainNetViewnet Nets Chain and Net Alignments Comparative Genomics netPetMar3 Lamprey Net Lamprey (Dec. 2017 (Pmar_germline 1.0/petMar3)) Alignment Net Comparative Genomics netDanRer11 Zebrafish Net Zebrafish (May 2017 (GRCz11/danRer11)) Alignment Net Comparative Genomics netXenTro10 X. tropicalis Net X. tropicalis (Nov. 2019 (UCB_Xtro_10.0/xenTro10)) Alignment Net Comparative Genomics netGalGal6 Chicken Net Chicken (Mar. 2018 (GRCg6a/galGal6)) Alignment Net Comparative Genomics netMonDom5 Opossum Net Opossum (Oct. 2006 (Broad/monDom5)) Alignment Net Comparative Genomics netLoxAfr3 Elephant Net Elephant (Jul. 2009 (Broad/loxAfr3)) Alignment Net Comparative Genomics netEchTel2 Tenrec Net Tenrec (Nov. 2012 (Broad/echTel2)) Alignment Net Comparative Genomics netSusScr3 Pig Net Pig (Aug. 2011 (SGSC Sscrofa10.2/susScr3)) Alignment Net Comparative Genomics netTurTru2 Dolphin Net Dolphin (Oct. 2011 (Baylor Ttru_1.4/turTru2)) Alignment Net Comparative Genomics netBosTau9 Cow Net Cow (Apr. 2018 (ARS-UCD1.2/bosTau9)) Alignment Net Comparative Genomics netOviAri4 Sheep Net Sheep (Nov. 2015 (Oar_v4.0/oviAri4)) Alignment Net Comparative Genomics netEquCab3 Horse Net Horse (Jan. 2018 (EquCab3.0/equCab3)) Alignment Net Comparative Genomics netManPen1 Chinese pangolin Net Chinese pangolin (Aug 2014 (M_pentadactyla-1.1.1/manPen1)) Alignment Net Comparative Genomics netCanFam5 Dog Net Dog (May 2019 (UMICH_Zoey_3.1/canFam5)) Alignment Net Comparative Genomics netCanFam6 Dog Net Dog (Oct. 2020 (Dog10K_Boxer_Tasha/canFam6)) Alignment Net Comparative Genomics netNeoSch1 Hawaiian monk seal Net Hawaiian monk seal (Jun. 2017 (ASM220157v1/neoSch1)) Alignment Net Comparative Genomics netEriEur2 Hedgehog Net Hedgehog (May 2012 (EriEur2.0/eriEur2)) Alignment Net Comparative Genomics netSorAra2 Shrew Net Shrew (Aug. 2008 (Broad/sorAra2)) Alignment Net Comparative Genomics netHg38 Human Net Human (Dec. 2013 (GRCh38/hg38)) Alignment Net Comparative Genomics netPanTro6 Chimp Net Chimp (Jan. 2018 (Clint_PTRv2/panTro6)) Alignment Net Comparative Genomics netPanPan3 Bonobo Net Bonobo (May 2020 (Mhudiblu_PPA_v0/panPan3)) Alignment Net Comparative Genomics netGorGor6 Gorilla Net Gorilla (Aug. 2019 (Kamilah_GGO_v0/gorGor6)) Alignment Net Comparative Genomics netRheMac10 Rhesus Net Rhesus (Feb. 2019 (Mmul_10/rheMac10)) Alignment Net Comparative Genomics netCalJac4 Marmoset Net Marmoset (May 2020 (Callithrix_jacchus_cj1700_1.1/calJac4)) Alignment Net Comparative Genomics netTarSyr2 Tarsier Net Tarsier (Sep. 2013 (Tarsius_syrichta-2.0.1/tarSyr2)) Alignment Net Comparative Genomics netOtoGar3 Bushbaby Net Bushbaby (Mar. 2011 (Broad/otoGar3)) Alignment Net Comparative Genomics netTupBel1 Tree shrew Net Tree shrew (Dec. 2006 (Broad/tupBel1)) Alignment Net Comparative Genomics netGalVar1 Malayan flying lemur Net Malayan flying lemur (Jun. 2014 (G_variegatus-3.0.2/galVar1)) Alignment Net Comparative Genomics netOryCun2 Rabbit Net Rabbit (Apr. 2009 (Broad/oryCun2)) Alignment Net Comparative Genomics netOchPri3 Pika Net Pika (May 2012 (OchPri3.0/ochPri3)) Alignment Net Comparative Genomics netCavPor3 Guinea pig Net Guinea pig (Feb. 2008 (Broad/cavPor3)) Alignment Net Comparative Genomics netSpeTri2 Squirrel Net Squirrel (Nov. 2011 (Broad/speTri2)) Alignment Net Comparative Genomics netGCF_003668045.3 Chinese hamster mafNet Chinese hamster (Jun. 2020 GCF_003668045.3_CriGri-PICRH-1.0) mafNet Alignment Comparative Genomics netRn6 Rat Net Rat (Jul. 2014 (RGSC 6.0/rn6)) Alignment Net Comparative Genomics netRn7 Rat Net Rat (Nov. 2020 (mRatBN7.2/rn7)) Alignment Net Comparative Genomics mm39ChainNetViewchain Chains Chain and Net Alignments Comparative Genomics chainPetMar3 Lamprey Chain Lamprey (Dec. 2017 (Pmar_germline 1.0/petMar3)) Chained Alignments Comparative Genomics chainDanRer11 Zebrafish Chain Zebrafish (May 2017 (GRCz11/danRer11)) Chained Alignments Comparative Genomics chainXenTro10 X. tropicalis Chain X. tropicalis (Nov. 2019 (UCB_Xtro_10.0/xenTro10)) Chained Alignments Comparative Genomics chainGalGal6 Chicken Chain Chicken (Mar. 2018 (GRCg6a/galGal6)) Chained Alignments Comparative Genomics chainMonDom5 Opossum Chain Opossum (Oct. 2006 (Broad/monDom5)) Chained Alignments Comparative Genomics chainLoxAfr3 Elephant Chain Elephant (Jul. 2009 (Broad/loxAfr3)) Chained Alignments Comparative Genomics chainEchTel2 Tenrec Chain Tenrec (Nov. 2012 (Broad/echTel2)) Chained Alignments Comparative Genomics chainSusScr3 Pig Chain Pig (Aug. 2011 (SGSC Sscrofa10.2/susScr3)) Chained Alignments Comparative Genomics chainTurTru2 Dolphin Chain Dolphin (Oct. 2011 (Baylor Ttru_1.4/turTru2)) Chained Alignments Comparative Genomics chainBosTau9 Cow Chain Cow (Apr. 2018 (ARS-UCD1.2/bosTau9)) Chained Alignments Comparative Genomics chainOviAri4 Sheep Chain Sheep (Nov. 2015 (Oar_v4.0/oviAri4)) Chained Alignments Comparative Genomics chainEquCab3 Horse Chain Horse (Jan. 2018 (EquCab3.0/equCab3)) Chained Alignments Comparative Genomics chainManPen1 Chinese pangolin Chain Chinese pangolin (Aug 2014 (M_pentadactyla-1.1.1/manPen1)) Chained Alignments Comparative Genomics chainCanFam5 Dog Chain Dog (May 2019 (UMICH_Zoey_3.1/canFam5)) Chained Alignments Comparative Genomics chainCanFam6 Dog Chain Dog (Oct. 2020 (Dog10K_Boxer_Tasha/canFam6)) Chained Alignments Comparative Genomics chainNeoSch1 Hawaiian monk seal Chain Hawaiian monk seal (Jun. 2017 (ASM220157v1/neoSch1)) Chained Alignments Comparative Genomics chainEriEur2 Hedgehog Chain Hedgehog (May 2012 (EriEur2.0/eriEur2)) Chained Alignments Comparative Genomics chainSorAra2 Shrew Chain Shrew (Aug. 2008 (Broad/sorAra2)) Chained Alignments Comparative Genomics chainHg38 Human Chain Human (Dec. 2013 (GRCh38/hg38)) Chained Alignments Comparative Genomics chainPanTro6 Chimp Chain Chimp (Jan. 2018 (Clint_PTRv2/panTro6)) Chained Alignments Comparative Genomics chainPanPan3 Bonobo Chain Bonobo (May 2020 (Mhudiblu_PPA_v0/panPan3)) Chained Alignments Comparative Genomics chainGorGor6 Gorilla Chain Gorilla (Aug. 2019 (Kamilah_GGO_v0/gorGor6)) Chained Alignments Comparative Genomics chainRheMac10 Rhesus Chain Rhesus (Feb. 2019 (Mmul_10/rheMac10)) Chained Alignments Comparative Genomics chainCalJac4 Marmoset Chain Marmoset (May 2020 (Callithrix_jacchus_cj1700_1.1/calJac4)) Chained Alignments Comparative Genomics chainTarSyr2 Tarsier Chain Tarsier (Sep. 2013 (Tarsius_syrichta-2.0.1/tarSyr2)) Chained Alignments Comparative Genomics chainOtoGar3 Bushbaby Chain Bushbaby (Mar. 2011 (Broad/otoGar3)) Chained Alignments Comparative Genomics chainTupBel1 Tree shrew Chain Tree shrew (Dec. 2006 (Broad/tupBel1)) Chained Alignments Comparative Genomics chainGalVar1 Malayan flying lemur Chain Malayan flying lemur (Jun. 2014 (G_variegatus-3.0.2/galVar1)) Chained Alignments Comparative Genomics chainOryCun2 Rabbit Chain Rabbit (Apr. 2009 (Broad/oryCun2)) Chained Alignments Comparative Genomics chainOchPri3 Pika Chain Pika (May 2012 (OchPri3.0/ochPri3)) Chained Alignments Comparative Genomics chainCavPor3 Guinea pig Chain Guinea pig (Feb. 2008 (Broad/cavPor3)) Chained Alignments Comparative Genomics chainSpeTri2 Squirrel Chain Squirrel (Nov. 2011 (Broad/speTri2)) Chained Alignments Comparative Genomics chainGCF_003668045.3 Chinese hamster Chain Chinese hamster (Jun. 2020 GCF_003668045.3_CriGri-PICRH-1.0) Chained Alignments Comparative Genomics chainRn6 Rat Chain Rat (Jul. 2014 (RGSC 6.0/rn6)) Chained Alignments Comparative Genomics chainRn7 Rat Chain Rat (Nov. 2020 (mRatBN7.2/rn7)) Chained Alignments Comparative Genomics cytoBandIdeo Chromosome Band (Ideogram) Chromosome Bands Based on Microscopy (for Ideogram) Mapping and Sequencing crisprAllTargets CRISPR Targets CRISPR/Cas9 -NGG Targets, whole genome Genes and Gene Predictions Description This track shows the DNA sequences targetable by CRISPR RNA guides using the Cas9 enzyme from S. pyogenes (PAM: NGG) over the entire mouse (mm39) genome. CRISPR target sites were annotated with predicted specificity (off-target effects) and predicted efficiency (on-target cleavage) by various algorithms through the tool CRISPOR. Sp-Cas9 usually cuts double-stranded DNA three or four base pairs 5' of the PAM site. Display Conventions and Configuration The track "CRISPR Targets" shows all potential -NGG target sites across the genome. The target sequence of the guide is shown with a thick (exon) bar. The PAM motif match (NGG) is shown with a thinner bar. Guides are colored to reflect both predicted specificity and efficiency. Specificity reflects the "uniqueness" of a 20mer sequence in the genome; the less unique a sequence is, the more likely it is to cleave other locations of the genome (off-target effects). Efficiency is the frequency of cleavage at the target site (on-target efficiency). Shades of gray stand for sites that are hard to target specifically, as the 20mer is not very unique in the genome: impossible to target: target site has at least one identical copy in the genome and was not scored hard to target: many similar sequences in the genome that alignment stopped, repeat? hard to target: target site was aligned but results in a low specificity score <= 50 (see below) Colors highlight targets that are specific in the genome (MIT specificity > 50) but have different predicted efficiencies: unable to calculate Doench/Fusi 2016 efficiency score low predicted cleavage: Doench/Fusi 2016 Efficiency percentile <= 30 medium predicted cleavage: Doench/Fusi 2016 Efficiency percentile > 30 and < 55 high predicted cleavage: Doench/Fusi 2016 Efficiency > 55 Mouse-over a target site to show predicted specificity and efficiency scores: The MIT Specificity score summarizes all off-targets into a single number from 0-100. The higher the number, the fewer off-target effects are expected. We recommend guides with an MIT specificity > 50. The efficiency score tries to predict if a guide leads to rather strong or weak cleavage. According to (Haeussler et al. 2016), the Doench 2016 Efficiency score should be used to select the guide with the highest cleavage efficiency when expressing guides from RNA PolIII Promoters such as U6. Scores are given as percentiles, e.g. "70%" means that 70% of mammalian guides have a score equal or lower than this guide. The raw score number is also shown in parentheses after the percentile. The Moreno-Mateos 2015 Efficiency score should be used instead of the Doench 2016 score when transcribing the guide in vitro with a T7 promoter, e.g. for injections in mouse, zebrafish or Xenopus embryos. The Moreno-Mateos score is given in percentiles and the raw value in parentheses, see the note above. Click onto features to show all scores and predicted off-targets with up to four mismatches. The Out-of-Frame score by Bae et al. 2014 is correlated with the probability that mutations induced by the guide RNA will disrupt the open reading frame. The authors recommend out-of-frame scores > 66 to create knock-outs with a single guide efficiently. Off-target sites are sorted by the CFD (Cutting Frequency Determination) score (Doench et al. 2016). The higher the CFD score, the more likely there is off-target cleavage at that site. Off-targets with a CFD score < 0.023 are not shown on this page, but are available when following the link to the external CRISPOR tool. When compared against experimentally validated off-targets by Haeussler et al. 2016, the large majority of predicted off-targets with CFD scores < 0.023 were false-positives. For storage and performance reasons, on the level of individual off-targets, only CFD scores are available. Methods Relationship between predictions and experimental data Like most algorithms, the MIT specificity score is not always a perfect predictor of off-target effects. Despite low scores, many tested guides caused few and/or weak off-target cleavage when tested with whole-genome assays (Figure 2 from Haeussler et al. 2016), as shown below, and the published data contains few data points with high specificity scores. Overall though, the assays showed that the higher the specificity score, the lower the off-target effects. Similarly, efficiency scoring is not very accurate: guides with low scores can be efficient and vice versa. As a general rule, however, the higher the score, the less likely that a guide is very inefficient. The following histograms illustrate, for each type of score, how the share of inefficient guides drops with increasing efficiency scores: When reading this plot, keep in mind that both scores were evaluated on their own training data. Especially for the Moreno-Mateos score, the results are too optimistic, due to overfitting. When evaluated on independent datasets, the correlation of the prediction with other assays was around 25% lower, see Haeussler et al. 2016. At the time of writing, there is no independent dataset available yet to determine the Moreno-Mateos accuracy for each score percentile range. Track methods The entire mouse (mm39) genome was scanned for the -NGG motif. Flanking 20mer guide sequences were aligned to the genome with BWA and scored with MIT Specificity scores using the command-line version of crispor.org. Non-unique guide sequences were skipped. Flanking sequences were extracted from the genome and input for Crispor efficiency scoring, available from the Crispor downloads page, which includes the Doench 2016, Moreno-Mateos 2015 and Bae 2014 algorithms, among others. Note that the Doench 2016 scores were updated by the Broad institute in 2017 ("Azimuth" update). As a result, earlier versions of the track show the old Doench 2016 scores and this version of the track shows new Doench 2016 scores. Old and new scores are almost identical, they are correlated to 0.99 and for more than 80% of the guides the difference is below 0.02. However, for very few guides, the difference can be bigger. In case of doubt, we recommend the new scores. Crispor.org can display both scores and many more with the "Show all scores" link. Data Access Positional data can be explored interactively with the Table Browser or the Data Integrator. For small programmatic positional queries, the track can be accessed using our REST API. For genome-wide data or automated analysis, CRISPR genome annotations can be downloaded from our download server as a bigBedFile. The files for this track are called crispr.bb, which lists positions and scores, and crisprDetails.tab, which has information about off-target matches. Individual regions or whole genome annotations can be obtained using our tool bigBedToBed, which can be compiled from the source code or downloaded as a pre-compiled binary for your system. Instructions for downloading source code and binaries can be found here. The tool can also be used to obtain only features within a given range, e.g. bigBedToBed http://hgdownload.soe.ucsc.edu/gbdb/mm39/crisprAllTargets/crispr.bb -chrom=chr21 -start=0 -end=1000000 stdout Credits Track created by Maximilian Haeussler, with helpful input from Jean-Paul Concordet (MNHN Paris) and Alberto Stolfi (NYU). References Haeussler M, Schönig K, Eckert H, Eschstruth A, Mianné J, Renaud JB, Schneider-Maunoury S, Shkumatava A, Teboul L, Kent J et al. Evaluation of off-target and on-target scoring algorithms and integration into the guide RNA selection tool CRISPOR. Genome Biol. 2016 Jul 5;17(1):148. PMID: 27380939; PMC: PMC4934014 Bae S, Kweon J, Kim HS, Kim JS. Microhomology-based choice of Cas9 nuclease target sites. Nat Methods. 2014 Jul;11(7):705-6. PMID: 24972169 Doench JG, Fusi N, Sullender M, Hegde M, Vaimberg EW, Donovan KF, Smith I, Tothova Z, Wilen C, Orchard R et al. Optimized sgRNA design to maximize activity and minimize off-target effects of CRISPR-Cas9. Nat Biotechnol. 2016 Feb;34(2):184-91. PMID: 26780180; PMC: PMC4744125 Hsu PD, Scott DA, Weinstein JA, Ran FA, Konermann S, Agarwala V, Li Y, Fine EJ, Wu X, Shalem O et al. DNA targeting specificity of RNA-guided Cas9 nucleases. Nat Biotechnol. 2013 Sep;31(9):827-32. PMID: 23873081; PMC: PMC3969858 Moreno-Mateos MA, Vejnar CE, Beaudoin JD, Fernandez JP, Mis EK, Khokha MK, Giraldez AJ. CRISPRscan: designing highly efficient sgRNAs for CRISPR-Cas9 targeting in vivo. Nat Methods. 2015 Oct;12(10):982-8. PMID: 26322839; PMC: PMC4589495 evaSnp EVA SNP Release 3 Short Genetic Variants from European Variant Archive Release 3 Variation and Repeats Description This track contains mappings of single nucleotide variants and small insertions and deletions (indels) from the European Variation Archive (EVA) Release 3 for the mouse mm39 genome. The dbSNP database at NCBI no longer hosts non-human variants. Interpreting and Configuring the Graphical Display Variants are shown as single tick marks at most zoom levels. When viewing the track at or near base-level resolution, the displayed width of the SNP variant corresponds to the width of the variant in the reference sequence. Insertions are indicated by a single tick mark displayed between two nucleotides, single nucleotide polymorphisms are displayed as the width of a single base, and multiple nucleotide variants are represented by a block that spans two or more bases. The display is set to automatically collapse to dense visibility when there are more than 100k variants in the window. When the window size is more than 250k bp, the display is switched to density graph mode. Searching, details, and filtering Navigation to an individual variant can be accomplished by typing or copying the variant identifier (rsID) or the genomic coordinates into the Position/Search box on the Browser. A click on an item in the graphical display displays a page with data about that variant. Data fields include the Reference and Alternate Alleles, the class of the variant as reported by EVA, the source of the data, the amino acid change, if any, and the functional class as determined by UCSC's Variant Annotation Integrator. Variants can be filtered using the track controls to show subsets of the data by either EVA Sequence Ontology (SO) term, UCSC-generated functional effect, or by color, which bins the UCSC functional effects into general classes. Mouse-over Mousing over an item shows the ucscClass, which is the consequence according to the Variant Annotation Integrator, and the aaChange when one is available, which is the change in amino acid in HGVS.p terms. Items may have multiple ucscClasses, which will all be shown in the mouse-over in a comma-separated list. Likewise, multiple HGVS.p terms may be shown for each rsID separated by spaces describing all possible AA changes. Multiple items may appear due to different variant predictions on multiple gene transcripts. For all organisms the gene models used were ncbiRefSeqCurated, except for mm39 which used ncbiRefSeqSelect. Track colors Variants are colored according to the most potentially deleterious functional effect prediction according to the Variant Annotation Integrator. Specific bins can be seen in the Methods section below. Color Variant Type Protein-altering variants and splice site variants Synonymous codon variants Non-coding transcript or Untranslated Region (UTR) variants Intergenic and intronic variants Sequence ontology (SO) Variants are classified by EVA into one of the following sequence ontology terms: substitution — A single nucleotide in the reference is replaced by another, alternate allele deletion — One or more nucleotides is deleted. The representation in the database is to display one additional nucleotide in both the Reference field (Ref) and the Alternate Allele field (Alt). E.g. a variant that is a deletion of an A maybe be represented as Ref = GA and Alt = G. insertion — One or more nucleotides is inserted. The representation in the database is to display one additional nucleotide in both the Reference field (Ref) and the Alternate Allele field (Alt). E.g. a variant that is an insertion of a T maybe be represented as Ref = G and Alt = GT delins — Similar to tandemRepeat, in that the runs of Ref and Alt Alleles are of different length, except that there is more than one type of nucleotide, e.g., Ref = CCAAAAACAAAAACA, Alt = ACAAAAAC. multipleNucleotideVariant — More than one nucleotide is substituted by an equal number of different nucleotides, e.g., Ref = AA, Alt = GC. sequence alteration — A parent term meant to signify a deviation from another sequence. Can be assigned to variants that have not been characterized yet. Methods Data were downloaded from the European Variation Archive EVA release 3 (2022-02-24) current_ids.vcf.gz files corresponding to the proper assembly. Chromosome names were converted to UCSC-style, a few problematic variants were removed, and the variants passed through the Variant Annotation Integrator to predict consequence. For every organism the ncbiRefSeqCurated gene models were used to predict the consequences, except for mm39 which used the ncbiRefSeqSelect models. Variants were then colored according to their predicted consequence in the following fashion: Protein-altering variants and splice site variants - exon_loss_variant, frameshift_variant, inframe_deletion, inframe_insertion, initiator_codon_variant, missense_variant, splice_acceptor_variant, splice_donor_variant, splice_region_variant, stop_gained, stop_lost, coding_sequence_variant, transcript_ablation Synonymous codon variants - synonymous_variant, stop_retained_variant Non-coding transcript or Untranslated Region (UTR) variants - 5_prime_UTR_variant, 3_prime_UTR_variant, complex_transcript_variant, non_coding_transcript_exon_variant Intergenic and intronic variants - upstream_gene_variant, downstream_gene_variant, intron_variant, intergenic_variant, NMD_transcript_variant, no_sequence_alteration Sequence Ontology ("SO:") terms were converted to the variant classes, then the files were converted to BED, and then bigBed format. No functional annotations were provided by the EVA (e.g., missense, nonsense, etc). These were computed using UCSC's Variant Annotation Integrator (Hinrichs, et al., 2016). Amino-acid substitutions for missense variants are based on RefSeq alignments of mRNA transcripts, which do not always match the amino acids predicted from translating the genomic sequence. Therefore, in some instances, the variant and the genomic nucleotide and associated amino acid may be reversed. E.g., a Pro > Arg change from the perspective of the mRNA would be Arg > Pro from the persepective the genomic sequence. For complete documentation of the processing of these tracks, read the EVA Release 3 MakeDoc. Data Access Note: It is not recommeneded to use LiftOver to convert SNPs between assemblies, and more information about how to convert SNPs between assemblies can be found on the following FAQ entry. The data can be explored interactively with the Table Browser, or the Data Integrator. For automated analysis, the data may be queried from our REST API. Please refer to our mailing list archives for questions, or our Data Access FAQ for more information. For automated download and analysis, this annotation is stored in a bigBed file that can be downloaded from our download server. The file for this track is called evaSnp.bb. Individual regions or the whole genome annotation can be obtained using our tool bigBedToBed which can be compiled from the source code or downloaded as a precompiled binary for your system. Instructions for downloading source code and binaries can be found here. The tool can also be used to obtain only features within a given range, e.g. bigBedToBed https://hgdownload.soe.ucsc.edu/gbdb/mm39/bbi/evaSnp.bb -chrom=chr21 -start=0 -end=100000000 stdout Credits This track was produced from the European Variation Archive release 3 data. Consequences were predicted using UCSC's Variant Annotation Integrator and NCBI's RefSeq gene models. References Cezard T, Cunningham F, Hunt SE, Koylass B, Kumar N, Saunders G, Shen A, Silva AF, Tsukanov K, Venkataraman S et al. The European Variation Archive: a FAIR resource of genomic variation for all species. Nucleic Acids Res. 2021 Oct 28:gkab960. doi:10.1093/nar/gkab960. Epub ahead of print. PMID: 34718739. PMID: PMC8728205. Hinrichs AS, Raney BJ, Speir ML, Rhead B, Casper J, Karolchik D, Kuhn RM, Rosenbloom KR, Zweig AS, Haussler D, Kent WJ. UCSC Data Integrator and Variant Annotation Integrator. Bioinformatics. 2016 May 1;32(9):1430-2. PMID: 26740527; PMC: PMC4848401 evaSnp4 EVA SNP Release 4 Short Genetic Variants from European Variant Archive Release 4 Variation and Repeats Description This track contains mappings of single nucleotide variants and small insertions and deletions (indels) from the European Variation Archive (EVA) Release 4 for the mouse mm39 genome. The dbSNP database at NCBI no longer hosts non-human variants. Interpreting and Configuring the Graphical Display Variants are shown as single tick marks at most zoom levels. When viewing the track at or near base-level resolution, the displayed width of the SNP variant corresponds to the width of the variant in the reference sequence. Insertions are indicated by a single tick mark displayed between two nucleotides, single nucleotide polymorphisms are displayed as the width of a single base, and multiple nucleotide variants are represented by a block that spans two or more bases. The display is set to automatically collapse to dense visibility when there are more than 100k variants in the window. When the window size is more than 250k bp, the display is switched to density graph mode. Searching, details, and filtering Navigation to an individual variant can be accomplished by typing or copying the variant identifier (rsID) or the genomic coordinates into the Position/Search box on the Browser. A click on an item in the graphical display displays a page with data about that variant. Data fields include the Reference and Alternate Alleles, the class of the variant as reported by EVA, the source of the data, the amino acid change, if any, and the functional class as determined by UCSC's Variant Annotation Integrator. Variants can be filtered using the track controls to show subsets of the data by either EVA Sequence Ontology (SO) term, UCSC-generated functional effect, or by color, which bins the UCSC functional effects into general classes. Mouse-over Mousing over an item shows the ucscClass, which is the consequence according to the Variant Annotation Integrator, and the aaChange when one is available, which is the change in amino acid in HGVS.p terms. Items may have multiple ucscClasses, which will all be shown in the mouse-over in a comma-separated list. Likewise, multiple HGVS.p terms may be shown for each rsID separated by spaces describing all possible AA changes. Multiple items may appear due to different variant predictions on multiple gene transcripts. For all organisms the gene models used were the NCBI RefSeq curated when available, if not then ensembl genes, or finally UCSC mappings of RefSeq if neither of the previous models was possible. Track colors Variants are colored according to the most potentially deleterious functional effect prediction according to the Variant Annotation Integrator. Specific bins can be seen in the Methods section below. Color Variant Type Protein-altering variants and splice site variants Synonymous codon variants Non-coding transcript or Untranslated Region (UTR) variants Intergenic and intronic variants Sequence ontology (SO) Variants are classified by EVA into one of the following sequence ontology terms: substitution — A single nucleotide in the reference is replaced by another, alternate allele deletion — One or more nucleotides is deleted. The representation in the database is to display one additional nucleotide in both the Reference field (Ref) and the Alternate Allele field (Alt). E.g. a variant that is a deletion of an A maybe be represented as Ref = GA and Alt = G. insertion — One or more nucleotides is inserted. The representation in the database is to display one additional nucleotide in both the Reference field (Ref) and the Alternate Allele field (Alt). E.g. a variant that is an insertion of a T maybe be represented as Ref = G and Alt = GT delins — Similar to tandemRepeat, in that the runs of Ref and Alt Alleles are of different length, except that there is more than one type of nucleotide, e.g., Ref = CCAAAAACAAAAACA, Alt = ACAAAAAC. multipleNucleotideVariant — More than one nucleotide is substituted by an equal number of different nucleotides, e.g., Ref = AA, Alt = GC. sequence alteration — A parent term meant to signify a deviation from another sequence. Can be assigned to variants that have not been characterized yet. Methods Data were downloaded from the European Variation Archive EVA release 4 (2022-11-21) current_ids.vcf.gz files corresponding to the proper assembly. Chromosome names were converted to UCSC-style and the variants passed through the Variant Annotation Integrator to predict consequence. For every organism the NCBI RefSeq curated models were used when available, followed by ensembl genes, and finally UCSC mapping of RefSeq when neither of the previous models were possible. Variants were then colored according to their predicted consequence in the following fashion: Protein-altering variants and splice site variants - exon_loss_variant, frameshift_variant, inframe_deletion, inframe_insertion, initiator_codon_variant, missense_variant, splice_acceptor_variant, splice_donor_variant, splice_region_variant, stop_gained, stop_lost, coding_sequence_variant, transcript_ablation Synonymous codon variants - synonymous_variant, stop_retained_variant Non-coding transcript or Untranslated Region (UTR) variants - 5_prime_UTR_variant, 3_prime_UTR_variant, complex_transcript_variant, non_coding_transcript_exon_variant Intergenic and intronic variants - upstream_gene_variant, downstream_gene_variant, intron_variant, intergenic_variant, NMD_transcript_variant, no_sequence_alteration Sequence Ontology ("SO:") terms were converted to the variant classes, then the files were converted to BED, and then bigBed format. No functional annotations were provided by the EVA (e.g., missense, nonsense, etc). These were computed using UCSC's Variant Annotation Integrator (Hinrichs, et al., 2016). Amino-acid substitutions for missense variants are based on RefSeq alignments of mRNA transcripts, which do not always match the amino acids predicted from translating the genomic sequence. Therefore, in some instances, the variant and the genomic nucleotide and associated amino acid may be reversed. E.g., a Pro > Arg change from the perspective of the mRNA would be Arg > Pro from the persepective the genomic sequence. Also, in bosTau9, galGal5, rheMac8, danRer10 and danRer11 the mitochondrial sequence was removed or renamed to match UCSC. For complete documentation of the processing of these tracks, read the EVA Release 4 MakeDoc. Data Access Note: It is not recommeneded to use LiftOver to convert SNPs between assemblies, and more information about how to convert SNPs between assemblies can be found on the following FAQ entry. The data can be explored interactively with the Table Browser, or the Data Integrator. For automated analysis, the data may be queried from our REST API. Please refer to our mailing list archives for questions, or our Data Access FAQ for more information. For automated download and analysis, this annotation is stored in a bigBed file that can be downloaded from our download server. The file for this track is called evaSnp4.bb. Individual regions or the whole genome annotation can be obtained using our tool bigBedToBed which can be compiled from the source code or downloaded as a precompiled binary for your system. Instructions for downloading source code and binaries can be found here. The tool can also be used to obtain only features within a given range, e.g. bigBedToBed https://hgdownload.soe.ucsc.edu/gbdb/mm39/bbi/evaSnp4.bb -chrom=chr21 -start=0 -end=100000000 stdout Credits This track was produced from the European Variation Archive release 4 data. Consequences were predicted using UCSC's Variant Annotation Integrator and NCBI's RefSeq as well as ensembl gene models. References Cezard T, Cunningham F, Hunt SE, Koylass B, Kumar N, Saunders G, Shen A, Silva AF, Tsukanov K, Venkataraman S et al. The European Variation Archive: a FAIR resource of genomic variation for all species. Nucleic Acids Res. 2021 Oct 28:gkab960. doi:10.1093/nar/gkab960. Epub ahead of print. PMID: 34718739. PMID: PMC8728205. Hinrichs AS, Raney BJ, Speir ML, Rhead B, Casper J, Karolchik D, Kuhn RM, Rosenbloom KR, Zweig AS, Haussler D, Kent WJ. UCSC Data Integrator and Variant Annotation Integrator. Bioinformatics. 2016 May 1;32(9):1430-2. PMID: 26740527; PMC: PMC4848401 evaSnp5 EVA SNP Release 5 Short Genetic Variants from European Variant Archive Release 5 Variation and Repeats Description This track contains mappings of single nucleotide variants and small insertions and deletions (indels) from the European Variation Archive (EVA) Release 5 for the mouse mm39 genome. The dbSNP database at NCBI no longer hosts non-human variants. Interpreting and Configuring the Graphical Display Variants are shown as single tick marks at most zoom levels. When viewing the track at or near base-level resolution, the displayed width of the SNP variant corresponds to the width of the variant in the reference sequence. Insertions are indicated by a single tick mark displayed between two nucleotides, single nucleotide polymorphisms are displayed as the width of a single base, and multiple nucleotide variants are represented by a block that spans two or more bases. The display is set to automatically collapse to dense visibility when there are more than 100k variants in the window. When the window size is more than 250k bp, the display is switched to density graph mode. Searching, details, and filtering Navigation to an individual variant can be accomplished by typing or copying the variant identifier (rsID) or the genomic coordinates into the Position/Search box on the Browser. A click on an item in the graphical display displays a page with data about that variant. Data fields include the Reference and Alternate Alleles, the class of the variant as reported by EVA, the source of the data, the amino acid change, if any, and the functional class as determined by UCSC's Variant Annotation Integrator. Variants can be filtered using the track controls to show subsets of the data by either EVA Sequence Ontology (SO) term, UCSC-generated functional effect, or by color, which bins the UCSC functional effects into general classes. Mouse-over Mousing over an item shows the ucscClass, which is the consequence according to the Variant Annotation Integrator, and the aaChange when one is available, which is the change in amino acid in HGVS.p terms. Items may have multiple ucscClasses, which will all be shown in the mouse-over in a comma-separated list. Likewise, multiple HGVS.p terms may be shown for each rsID separated by spaces describing all possible AA changes. Multiple items may appear due to different variant predictions on multiple gene transcripts. For all organisms the gene models used were the NCBI RefSeq curated when available, if not then ensembl genes, or finally UCSC mappings of RefSeq if neither of the previous models was possible. Track colors Variants are colored according to the most potentially deleterious functional effect prediction according to the Variant Annotation Integrator. Specific bins can be seen in the Methods section below. Color Variant Type Protein-altering variants and splice site variants Synonymous codon variants Non-coding transcript or Untranslated Region (UTR) variants Intergenic and intronic variants Sequence ontology (SO) Variants are classified by EVA into one of the following sequence ontology terms: substitution — A single nucleotide in the reference is replaced by another, alternate allele deletion — One or more nucleotides is deleted. The representation in the database is to display one additional nucleotide in both the Reference field (Ref) and the Alternate Allele field (Alt). E.g. a variant that is a deletion of an A maybe be represented as Ref = GA and Alt = G. insertion — One or more nucleotides is inserted. The representation in the database is to display one additional nucleotide in both the Reference field (Ref) and the Alternate Allele field (Alt). E.g. a variant that is an insertion of a T maybe be represented as Ref = G and Alt = GT delins — Similar to tandemRepeat, in that the runs of Ref and Alt Alleles are of different length, except that there is more than one type of nucleotide, e.g., Ref = CCAAAAACAAAAACA, Alt = ACAAAAAC. multipleNucleotideVariant — More than one nucleotide is substituted by an equal number of different nucleotides, e.g., Ref = AA, Alt = GC. sequence alteration — A parent term meant to signify a deviation from another sequence. Can be assigned to variants that have not been characterized yet. Methods Data were downloaded from the European Variation Archive EVA release 5 (2023-9-7) current_ids.vcf.gz files corresponding to the proper assembly. Chromosome names were converted to UCSC-style and the variants passed through the Variant Annotation Integrator to predict consequence. For every organism the NCBI RefSeq curated models were used when available, followed by ensembl genes, and finally UCSC mapping of RefSeq when neither of the previous models were possible. Variants were then colored according to their predicted consequence in the following fashion: Protein-altering variants and splice site variants - exon_loss_variant, frameshift_variant, inframe_deletion, inframe_insertion, initiator_codon_variant, missense_variant, splice_acceptor_variant, splice_donor_variant, splice_region_variant, stop_gained, stop_lost, coding_sequence_variant, transcript_ablation Synonymous codon variants - synonymous_variant, stop_retained_variant Non-coding transcript or Untranslated Region (UTR) variants - 5_prime_UTR_variant, 3_prime_UTR_variant, complex_transcript_variant, non_coding_transcript_exon_variant Intergenic and intronic variants - upstream_gene_variant, downstream_gene_variant, intron_variant, intergenic_variant, NMD_transcript_variant, no_sequence_alteration Sequence Ontology ("SO:") terms were converted to the variant classes, then the files were converted to BED, and then bigBed format. No functional annotations were provided by the EVA (e.g., missense, nonsense, etc). These were computed using UCSC's Variant Annotation Integrator (Hinrichs, et al., 2016). Amino-acid substitutions for missense variants are based on RefSeq alignments of mRNA transcripts, which do not always match the amino acids predicted from translating the genomic sequence. Therefore, in some instances, the variant and the genomic nucleotide and associated amino acid may be reversed. E.g., a Pro > Arg change from the perspective of the mRNA would be Arg > Pro from the persepective the genomic sequence. Also, in bosTau9, galGal5, rheMac8, danRer10 and danRer11 the mitochondrial sequence was removed or renamed to match UCSC. For complete documentation of the processing of these tracks, read the EVA Release 5 MakeDoc. Data Access Note: It is not recommeneded to use LiftOver to convert SNPs between assemblies, and more information about how to convert SNPs between assemblies can be found on the following FAQ entry. The data can be explored interactively with the Table Browser, or the Data Integrator. For automated analysis, the data may be queried from our REST API. Please refer to our mailing list archives for questions, or our Data Access FAQ for more information. For automated download and analysis, this annotation is stored in a bigBed file that can be downloaded from our download server. The file for this track is called evaSnp5.bb. Individual regions or the whole genome annotation can be obtained using our tool bigBedToBed which can be compiled from the source code or downloaded as a precompiled binary for your system. Instructions for downloading source code and binaries can be found here. The tool can also be used to obtain only features within a given range, e.g. bigBedToBed https://hgdownload.soe.ucsc.edu/gbdb/mm39/bbi/evaSnp5.bb -chrom=chr21 -start=0 -end=100000000 stdout Credits This track was produced from the European Variation Archive release 5 data. Consequences were predicted using UCSC's Variant Annotation Integrator and NCBI's RefSeq as well as ensembl gene models. References Cezard T, Cunningham F, Hunt SE, Koylass B, Kumar N, Saunders G, Shen A, Silva AF, Tsukanov K, Venkataraman S et al. The European Variation Archive: a FAIR resource of genomic variation for all species. Nucleic Acids Res. 2021 Oct 28:gkab960. doi:10.1093/nar/gkab960. Epub ahead of print. PMID: 34718739. PMID: PMC8728205. Hinrichs AS, Raney BJ, Speir ML, Rhead B, Casper J, Karolchik D, Kuhn RM, Rosenbloom KR, Zweig AS, Haussler D, Kent WJ. UCSC Data Integrator and Variant Annotation Integrator. Bioinformatics. 2016 May 1;32(9):1430-2. PMID: 26740527; PMC: PMC4848401 gap Gap Gap Locations Mapping and Sequencing Description This track shows the gaps in the Jun. 2020 mouse genome assembly. Genome assembly procedures are covered in the NCBI assembly documentation. NCBI also provides specific information about this assembly. The definition of the gaps in this assembly is from the AGP file delivered with the sequence. The NCBI document AGP Specification describes the format of the AGP file. Gaps are represented as black boxes in this track. If the relative order and orientation of the contigs on either side of the gap is supported by read pair data, it is a bridged gap and a white line is drawn through the black box representing the gap. This assembly contains the following principal types of gaps: centromere - gaps for centromeres are included when they can be reasonably localized (count: 20; all of size 2,890,000 bases) short_arm - a gap inserted at the start of an acrocentric chromosome (count: 21; all of size 10,000 bases) telomere - telomere gaps (count: 42; all of size 100,000 bases) contig - gaps between contigs in scaffolds (count: 60; size range: 8,000 - 500,000 bases) scaffold - gaps between scaffolds in chromosome assemblies (count: 181; size range: 27 - 522,000 bases) gc5BaseBw GC Percent GC Percent in 5-Base Windows Mapping and Sequencing Description The GC percent track shows the percentage of G (guanine) and C (cytosine) bases in 5-base windows. High GC content is typically associated with gene-rich areas. This track may be configured in a variety of ways to highlight different apsects of the displayed information. Click the "Graph configuration help" link for an explanation of the configuration options. Credits The data and presentation of this graph were prepared by Hiram Clawson. genscan Genscan Genes Genscan Gene Predictions Genes and Gene Predictions Description This track shows predictions from the Genscan program written by Chris Burge. The predictions are based on transcriptional, translational and donor/acceptor splicing signals as well as the length and compositional distributions of exons, introns and intergenic regions. For more information on the different gene tracks, see our Genes FAQ. Display Conventions and Configuration This track follows the display conventions for gene prediction tracks. The track description page offers the following filter and configuration options: Color track by codons: Select the genomic codons option to color and label each codon in a zoomed-in display to facilitate validation and comparison of gene predictions. Go to the Coloring Gene Predictions and Annotations by Codon page for more information about this feature. Methods For a description of the Genscan program and the model that underlies it, refer to Burge and Karlin (1997) in the References section below. The splice site models used are described in more detail in Burge (1998) below. Credits Thanks to Chris Burge for providing the Genscan program. References Burge C. Modeling Dependencies in Pre-mRNA Splicing Signals. In: Salzberg S, Searls D, Kasif S, editors. Computational Methods in Molecular Biology. Amsterdam: Elsevier Science; 1998. p. 127-163. Burge C, Karlin S. Prediction of complete gene structures in human genomic DNA. J. Mol. Biol. 1997 Apr 25;268(1):78-94. PMID: 9149143 grcIncidentDb GRC Incident GRC Incident Database Mapping and Sequencing Description This track shows locations in the mouse assembly where assembly problems have been noted or resolved, as reported by the Genome Reference Consortium (GRC). If you would like to report an assembly problem, please use the GRC issue reporting system. Methods Data for this track are extracted from the GRC incident database from the specific species *_issues.gff3 file. The track is synchronized once daily to incorporate new updates. Credits The data and presentation of this track were prepared by Hiram Clawson. ucscToINSDC INSDC Accession at INSDC - International Nucleotide Sequence Database Collaboration Mapping and Sequencing Description This track associates UCSC Genome Browser chromosome names to accession names from the International Nucleotide Sequence Database Collaboration (INSDC). The data were downloaded from the NCBI assembly database. Credits The data for this track was prepared by Hiram Clawson. nestedRepeats Interrupted Rpts Fragments of Interrupted Repeats Joined by RepeatMasker ID Variation and Repeats Description This track shows joined fragments of interrupted repeats extracted from the output of the RepeatMasker program which screens DNA sequences for interspersed repeats and low complexity DNA sequences using the Repbase Update library of repeats from the Genetic Information Research Institute (GIRI). Repbase Update is described in Jurka (2000) in the References section below. The detailed annotations from RepeatMasker are in the RepeatMasker track. This track shows fragments of original repeat insertions which have been interrupted by insertions of younger repeats or through local rearrangements. The fragments are joined using the ID column of RepeatMasker output. Display Conventions and Configuration In pack or full mode, each interrupted repeat is displayed as boxes (fragments) joined by horizontal lines, labeled with the repeat name. If all fragments are on the same strand, arrows are added to the horizontal line to indicate the strand. In dense or squish mode, labels and arrows are omitted and in dense mode, all items are collapsed to fit on a single row. Items are shaded according to the average identity score of their fragments. Usually, the shade of an item is similar to the shades of its fragments unless some fragments are much more diverged than others. The score displayed above is the average identity score, clipped to a range of 50% - 100% and then mapped to the range 0 - 1000 for shading in the browser. Methods UCSC has used the most current versions of the RepeatMasker software and repeat libraries available to generate these data. Note that these versions may be newer than those that are publicly available on the Internet. Data are generated using the RepeatMasker -s flag. Additional flags may be used for certain organisms. See the FAQ for more information. Credits Thanks to Arian Smit, Robert Hubley and GIRI for providing the tools and repeat libraries used to generate this track. References Smit AFA, Hubley R, Green P. RepeatMasker Open-3.0. http://www.repeatmasker.org. 1996-2010. Repbase Update is described in: Jurka J. Repbase Update: a database and an electronic journal of repetitive elements. Trends Genet. 2000 Sep;16(9):418-420. PMID: 10973072 For a discussion of repeats in mammalian genomes, see: Smit AF. Interspersed repeats and other mementos of transposable elements in mammalian genomes. Curr Opin Genet Dev. 1999 Dec;9(6):657-63. PMID: 10607616 Smit AF. The origin of interspersed repeats in the human genome. Curr Opin Genet Dev. 1996 Dec;6(6):743-8. PMID: 8994846 jaspar JASPAR Transcription Factors JASPAR Transcription Factor Binding Site Database Expression and Regulation Description This track represents genome-wide predicted binding sites for TF (transcription factor) binding profiles in the JASPAR CORE collection. This open-source database contains a curated, non-redundant set of binding profiles derived from published collections of experimentally defined transcription factor binding sites for eukaryotes. Display Conventions and Configuration Shaded boxes represent predicted binding sites for each of the TF profiles in the JASPAR CORE collection. The shading of the boxes indicates the p-value of the profile's match to that position (scaled between 0-1000 scores, where 0 corresponds to a p-value of 1 and 1000 to a p-value ≤ 10-10). Thus, the darker the shade, the lower (better) the p-value. The default view shows only predicted binding sites with scores of 400 or greater but can be adjusted in the track settings. Multi-select filters allow viewing of particular transcription factors. At window sizes of greater than 10,000 base pairs, this track turns to density graph mode. Zoom to a smaller region and click into an item to see more detail. From BED format documentation: shade score in range ≤ 166 167-277 278-388 389-499 500-611 612-722 723-833 834-944 ≥ 945 Conversion table: Item score 0 100 131 200 300 400 500 600 700 800 900 1000 p-value 1 0.1 0.049 10-2 10-3 10-4 10-5 10-6 10-7 10-8 10-9 ≤ 10-10 Methods The JASPAR 2024 update expanded the JASPAR CORE collection by 20% (329 added and 72 upgraded profiles). The new profiles were introduced after manual curation, in which 26 629 TF binding motifs were curated and obtained as PFMs or discovered from ChIP-seq/-exo or DAP-seq data. 2500 profiles from JASPAR 2022 were revised to either promote them to the CORE collection, update the associated metadata, or remove them because of validation inconsistencies or poor quality. The JASPAR database stores and focuses mostly on PFMs as the model of choice for TF-DNA interactions. More information on the methods can be found in the JASPAR 2024 publication or on the JASPAR website. JASPAR 2022 contains updated transcription factor binding sites with additional transcription factor profiles. More information on the methods can be found in the JASPAR 2022 publication JASPAR 2022 publication or on the JASPAR website. JASPAR 2020 scanned DNA sequences with JASPAR CORE TF-binding profiles for each taxa independently using PWMScan. TFBS predictions were selected with a PWM relative score ≥ 0.8 and a p-value < 0.05. P-values were scaled between 0 (corresponding to a p-value of 1) and 1000 (p-value ≤ 10-10) for coloring of the genome tracks and to allow for comparison of prediction confidence between different profiles. JASPAR 2018 used the TFBS Perl module (Lenhard and Wasserman 2002) and FIMO (Grant, Bailey, and Noble 2011), as distributed within the MEME suite (version 4.11.2) (Bailey et al. 2009). For scanning genomes with the BioPerl TFBS module, profiles were converted to PWMs and matches were kept with a relative score ≥ 0.8. For the FIMO scan, profiles were reformatted to MEME motifs and matches with a p-value < 0.05 were kept. TFBS predictions that were not consistent between the two methods (TFBS Perl module and FIMO) were removed. The remaining TFBS predictions were colored according to their FIMO p-value to allow for comparison of prediction confidence between different profiles. Please refer to the JASPAR 2024, 2022, 2020, and 2018 publications for more details (citation below). Data Access JASPAR Transcription Factor Binding data includes billions of items. Limited regions can be explored interactively with the Table Browser and cross-referenced with Data Integrator, although positional queries that are too big can lead to timing out. This results in a black page or truncated output. In this case, you may try reducing the chromosomal query to a smaller window. For programmatic access, the track can be accessed using the Genome Browser's REST API. JASPAR annotations can be downloaded from the Genome Browser's download server as a bigBed file. This compressed binary format can be remotely queried through command line utilities. Please note that some of the download files can be quite large. The utilities for working with bigBed-formatted binary files can be downloaded here. Run a utility with no arguments to see a brief description of the utility and its options. bigBedInfo provides summary statistics about a bigBed file including the number of items in the file. With the -as option, the output includes an autoSql definition of data columns, useful for interpreting the column values. bigBedToBed converts the binary bigBed data to tab-separated text. Output can be restricted to a particular region by using the -chrom, -start and -end options. Example: retrieve all JASPAR items in chr1:200001-200400 bigBedToBed http://hgdownload.soe.ucsc.edu/gbdb/mm39/jaspar/JASPAR2024.bb -chrom=chr1 -start=200000 -end=200400 stdout All data are freely available. Additional resources are available directly from the JASPAR group: Binding site predictions for all and individual TF profiles are available for download at http://expdata.cmmt.ubc.ca/JASPAR/downloads/UCSC_tracks/. Code and data used to create the UCSC tracks are available at https://github.com/wassermanlab/JASPAR-UCSC-tracks. The underlying JASPAR motif data is available through the JASPAR website at https://jaspar.genereg.net/. Other Genomes The JASPAR group provides TFBS predictions for many additional species and genomes, accessible by connection to their Public Hub or by clicking the assembly links below: Species Genome assembly versions Human - Homo sapiens hg19, hg38 Mouse - Mus musculus mm10, mm39 Zebrafish - Danio rerio danRer11 Fruitfly - Drosophila melanogaster dm6 Nematode - Caenorhabditis elegans ce10, ce11 Vase tunicate - Ciona intestinalis ci3 Thale cress - Arabidopsis thaliana araTha1 Yeast - Saccharomyces cerevisiae sacCer3 Credits The JASPAR database is a joint effort between several labs (please see the latest JASPAR paper, below). Binding site predictions and UCSC tracks were computed by the Wasserman Lab. For enquiries about the data please contact Oriol Fornes ( oriol@cmmt. ubc.ca ). Wasserman Lab Centre for Molecular Medicine and Therapeutics BC Children's Hospital Research Institute Department of Medical Genetics University of British Columbia Vancouver, Canada References Castro-Mondragon JA, Riudavets-Puig R, Rauluseviciute I, Berhanu Lemma R, Turchi L, Blanc-Mathieu R, Lucas J, Boddie P, Khan A, Manosalva Pérez N et al. JASPAR 2022: the 9th release of the open-access database of transcription factor binding profiles. Nucleic Acids Res. 2021 Nov 30;. PMID: 34850907 Fornes O, Castro-Mondragon JA, Khan A, van der Lee R, Zhang X, Richmond PA, Modi BP, Correard S, Gheorghe M, Baranašić D et al. JASPAR 2020: update of the open-access database of transcription factor binding profiles. Nucleic Acids Res. 2020 Jan 8;48(D1):D87-D92. PMID: 31701148; PMC: PMC7145627 Khan A, Fornes O, Stigliani A, Gheorghe M, Castro-Mondragon JA, van der Lee R, Bessy A, Chèneby J, Kulkarni SR, Tan G et al. JASPAR 2018: update of the open-access database of transcription factor binding profiles and its web framework. Nucleic Acids Res. 2018 Jan 4;46(D1):D260-D266. PMID: 29140473; PMC: PMC5753243 Rauluseviciute I, Riudavets-Puig R, Blanc-Mathieu R, Castro-Mondragon JA, Ferenc K, Kumar V, Lemma RB, Lucas J, Chèneby J, Baranasic D et al. JASPAR 2024: 20th anniversary of the open-access database of transcription factor binding profiles. Nucleic Acids Res. 2023 Nov 14;. PMID: 37962376 jaspar2022 JASPAR 2022 TFBS JASPAR CORE 2022 - Predicted Transcription Factor Binding Sites Expression and Regulation jaspar2024 JASPAR 2024 TFBS JASPAR CORE 2024 - Predicted Transcription Factor Binding Sites Expression and Regulation microsat Microsatellite Microsatellites - Di-nucleotide and Tri-nucleotide Repeats Variation and Repeats Description This track displays regions that are likely to be useful as microsatellite markers. These are sequences of at least 15 perfect di-nucleotide and tri-nucleotide repeats and tend to be highly polymorphic in the population. Methods The data shown in this track are a subset of the Simple Repeats track, selecting only those repeats of period 2 and 3, with 100% identity and no indels and with at least 15 copies of the repeat. The Simple Repeats track is created using the Tandem Repeats Finder. For more information about this program, see Benson (1999). Credits Tandem Repeats Finder was written by Gary Benson. References Benson G. Tandem repeats finder: a program to analyze DNA sequences. Nucleic Acids Res. 1999 Jan 15;27(2):573-80. PMID: 9862982; PMC: PMC148217 est Mouse ESTs Mouse ESTs Including Unspliced mRNA and EST Description This track shows alignments between mouse expressed sequence tags (ESTs) in GenBank and the genome. ESTs are single-read sequences, typically about 500 bases in length, that usually represent fragments of transcribed genes. Display Conventions and Configuration This track follows the display conventions for PSL alignment tracks. In dense display mode, the items that are more darkly shaded indicate matches of better quality. The strand information (+/-) indicates the direction of the match between the EST and the matching genomic sequence. It bears no relationship to the direction of transcription of the RNA with which it might be associated. The description page for this track has a filter that can be used to change the display mode, alter the color, and include/exclude a subset of items within the track. This may be helpful when many items are shown in the track display, especially when only some are relevant to the current task. To use the filter: Type a term in one or more of the text boxes to filter the EST display. For example, to apply the filter to all ESTs expressed in a specific organ, type the name of the organ in the tissue box. To view the list of valid terms for each text box, consult the table in the Table Browser that corresponds to the factor on which you wish to filter. For example, the "tissue" table contains all the types of tissues that can be entered into the tissue text box. Multiple terms may be entered at once, separated by a space. Wildcards may also be used in the filter. If filtering on more than one value, choose the desired combination logic. If "and" is selected, only ESTs that match all filter criteria will be highlighted. If "or" is selected, ESTs that match any one of the filter criteria will be highlighted. Choose the color or display characteristic that should be used to highlight or include/exclude the filtered items. If "exclude" is chosen, the browser will not display ESTs that match the filter criteria. If "include" is selected, the browser will display only those ESTs that match the filter criteria. This track may also be configured to display base labeling, a feature that allows the user to display all bases in the aligning sequence or only those that differ from the genomic sequence. For more information about this option, go to the Base Coloring for Alignment Tracks page. Several types of alignment gap may also be colored; for more information, go to the Alignment Insertion/Deletion Display Options page. Methods To make an EST, RNA is isolated from cells and reverse transcribed into cDNA. Typically, the cDNA is cloned into a plasmid vector and a read is taken from the 5' and/or 3' primer. For most — but not all — ESTs, the reverse transcription is primed by an oligo-dT, which hybridizes with the poly-A tail of mature mRNA. The reverse transcriptase may or may not make it to the 5' end of the mRNA, which may or may not be degraded. In general, the 3' ESTs mark the end of transcription reasonably well, but the 5' ESTs may end at any point within the transcript. Some of the newer cap-selected libraries cover transcription start reasonably well. Before the cap-selection techniques emerged, some projects used random rather than poly-A priming in an attempt to retrieve sequence distant from the 3' end. These projects were successful at this, but as a side effect also deposited sequences from unprocessed mRNA and perhaps even genomic sequences into the EST databases. Even outside of the random-primed projects, there is a degree of non-mRNA contamination. Because of this, a single unspliced EST should be viewed with considerable skepticism. To generate this track, mouse ESTs from GenBank were aligned against the genome using blat. Note that the maximum intron length allowed by blat is 750,000 bases, which may eliminate some ESTs with very long introns that might otherwise align. When a single EST aligned in multiple places, the alignment having the highest base identity was identified. Only alignments having a base identity level within 0.5% of the best and at least 96% base identity with the genomic sequence were kept. Credits This track was produced at UCSC from EST sequence data submitted to the international public sequence databases by scientists worldwide. References Benson DA, Cavanaugh M, Clark K, Karsch-Mizrachi I, Lipman DJ, Ostell J, Sayers EW. GenBank. Nucleic Acids Res. 2013 Jan;41(Database issue):D36-42. PMID: 23193287; PMC: PMC3531190 Benson DA, Karsch-Mizrachi I, Lipman DJ, Ostell J, Wheeler DL. GenBank: update. Nucleic Acids Res. 2004 Jan 1;32(Database issue):D23-6. PMID: 14681350; PMC: PMC308779 Kent WJ. BLAT - the BLAST-like alignment tool. Genome Res. 2002 Apr;12(4):656-64. PMID: 11932250; PMC: PMC187518 xenoMrna Other mRNAs Non-Mouse mRNAs from GenBank mRNA and EST Description This track displays translated blat alignments of vertebrate and invertebrate mRNA in GenBank from organisms other than mouse. Display Conventions and Configuration This track follows the display conventions for PSL alignment tracks. In dense display mode, the items that are more darkly shaded indicate matches of better quality. The strand information (+/-) for this track is in two parts. The first + indicates the orientation of the query sequence whose translated protein produced the match (here always 5' to 3', hence +). The second + or - indicates the orientation of the matching translated genomic sequence. Because the two orientations of a DNA sequence give different predicted protein sequences, there are four combinations. ++ is not the same as --, nor is +- the same as -+. The description page for this track has a filter that can be used to change the display mode, alter the color, and include/exclude a subset of items within the track. This may be helpful when many items are shown in the track display, especially when only some are relevant to the current task. To use the filter: Type a term in one or more of the text boxes to filter the mRNA display. For example, to apply the filter to all mRNAs expressed in a specific organ, type the name of the organ in the tissue box. To view the list of valid terms for each text box, consult the table in the Table Browser that corresponds to the factor on which you wish to filter. For example, the "tissue" table contains all the types of tissues that can be entered into the tissue text box. Multiple terms may be entered at once, separated by a space. Wildcards may also be used in the filter. If filtering on more than one value, choose the desired combination logic. If "and" is selected, only mRNAs that match all filter criteria will be highlighted. If "or" is selected, mRNAs that match any one of the filter criteria will be highlighted. Choose the color or display characteristic that should be used to highlight or include/exclude the filtered items. If "exclude" is chosen, the browser will not display mRNAs that match the filter criteria. If "include" is selected, the browser will display only those mRNAs that match the filter criteria. This track may also be configured to display codon coloring, a feature that allows the user to quickly compare mRNAs against the genomic sequence. For more information about this option, go to the Codon and Base Coloring for Alignment Tracks page. Several types of alignment gap may also be colored; for more information, go to the Alignment Insertion/Deletion Display Options page. Methods The mRNAs were aligned against the mouse genome using translated blat. When a single mRNA aligned in multiple places, the alignment having the highest base identity was found. Only those alignments having a base identity level within 1% of the best and at least 25% base identity with the genomic sequence were kept. Credits The mRNA track was produced at UCSC from mRNA sequence data submitted to the international public sequence databases by scientists worldwide. References Benson DA, Cavanaugh M, Clark K, Karsch-Mizrachi I, Lipman DJ, Ostell J, Sayers EW. GenBank. Nucleic Acids Res. 2013 Jan;41(Database issue):D36-42. PMID: 23193287; PMC: PMC3531190 Benson DA, Karsch-Mizrachi I, Lipman DJ, Ostell J, Wheeler DL. GenBank: update. Nucleic Acids Res. 2004 Jan 1;32(Database issue):D23-6. PMID: 14681350; PMC: PMC308779 Kent WJ. BLAT - the BLAST-like alignment tool. Genome Res. 2002 Apr;12(4):656-64. PMID: 11932250; PMC: PMC187518 xenoRefGene Other RefSeq Non-Mouse RefSeq Genes Genes and Gene Predictions Description The RefSeq mRNAs gene track for the mouse (Jun. 2020 (GRCm39/mm39)) genome assembly displays translated blat alignments of vertebrate and invertebrate mRNA in GenBank. Track statistics summary Total genome size: 2,654,624,157 (not counting gaps) Gene count: 22,442 Bases in genes: 838,462,469 (txStart to txEnd) Genes percent genome coverage: % 31.585 Bases in exons: 53,564,706 Exons percent genome coverage: % 2.018 Search tips Please note, the name searching system is not completely case insensitive. When in doubt, enter search names in all lower case to find gene names. Methods The mRNAs were aligned against the mouse (Jun. 2020 (GRCm39/mm39)) genome using translated blat. When a single mRNA aligned in multiple places, the alignment having the highest base identity was found. Only those alignments having a base identity level within 1% of the best and at least 25% base identity with the genomic sequence were kept. Specifically, the translated blat command is: blat -noHead -q=rnax -t=dnax -mask=lower target.fa query.fa target.query.psl where target.fa is one of the chromosome sequence of the genome assembly, and the query.fa is the mRNAs from RefSeq The resulting PSL outputs are filtered: pslCDnaFilter -minId=0.35 -minCover=0.25 -globalNearBest=0.0100 -minQSize=20 \ -ignoreIntrons -repsAsMatch -ignoreNs -bestOverlap \ all.results.psl mm39.xenoRefGene.psl The filtered mm39.xenoRefGene.psl is converted to genePred data to display for this track. Credits The mRNA track was produced at UCSC from mRNA sequence data submitted to the international public sequence databases by scientists worldwide. References Benson DA, Cavanaugh M, Clark K, Karsch-Mizrachi I, Lipman DJ, Ostell J, Sayers EW. GenBank. Nucleic Acids Res. 2013 Jan;41(Database issue):D36-42. PMID: 23193287; PMC: PMC3531190 Benson DA, Karsch-Mizrachi I, Lipman DJ, Ostell J, Wheeler DL. GenBank: update. Nucleic Acids Res. 2004 Jan 1;32(Database issue):D23-6. PMID: 14681350; PMC: PMC308779 Kent WJ. BLAT - the BLAST-like alignment tool. Genome Res. 2002 Apr;12(4):656-64. PMID: 11932250; PMC: PMC187518 ucscGenePfam Pfam in UCSC Gene Pfam Domains in UCSC Genes Genes and Gene Predictions Description Most proteins are composed of one or more conserved functional regions called domains. This track shows the high-quality, manually-curated Pfam-A domains found in transcripts located in the UCSC Genes track by the software HMMER3. Display Conventions and Configuration This track follows the display conventions for gene tracks. Methods The sequences from the knownGenePep table (see UCSC Genes description page) are submitted to the set of Pfam-A HMMs which annotate regions within the predicted peptide that are recognizable as Pfam protein domains. These regions are then mapped to the transcripts themselves using the pslMap utility. A complete shell script log for every version of UCSC genes can be found in our GitHub repository under hg/makeDb/doc/ucscGenes, e.g. mm10.knownGenes17.csh is for the database mm10 and version 17 of UCSC known genes. Of the several options for filtering out false positives, the "Trusted cutoff (TC)" threshold method is used in this track to determine significance. For more information regarding thresholds and scores, see the HMMER documentation and results interpretation pages. Note: There is currently an undocumented but known HMMER problem which results in lessened sensitivity and possible missed searches for some zinc finger domains. Until a fix is released for HMMER /PFAM thresholds, please also consult the "UniProt Domains" subtrack of the UniProt track for more comprehensive zinc finger annotations. Credits pslMap was written by Mark Diekhans at UCSC. References Finn RD, Mistry J, Tate J, Coggill P, Heger A, Pollington JE, Gavin OL, Gunasekaran P, Ceric G, Forslund K et al. The Pfam protein families database. Nucleic Acids Res. 2010 Jan;38(Database issue):D211-22. PMID: 19920124; PMC: PMC2808889 ReMap ReMap ChIP-seq ReMap Atlas of Regulatory Regions Expression and Regulation Description This track represents the ReMap Atlas of regulatory regions, which consists of a large-scale integrative analysis of all Public ChIP-seq data for transcriptional regulators from GEO, ArrayExpress, and ENCODE. Below is a schematic diagram of the types of regulatory regions: ReMap 2022 Atlas (all peaks for each analyzed data set) ReMap 2022 Non-redundant peaks (merged similar target) ReMap 2022 Cis Regulatory Modules Display Conventions and Configuration Each transcription factor follows a specific RGB color. ChIP-seq peak summits are represented by vertical bars. Hsap: A data set is defined as a ChIP/Exo-seq experiment in a given GEO/ArrayExpress/ENCODE series (e.g. GSE41561), for a given TF (e.g. ESR1), in a particular biological condition (e.g. MCF-7). Data sets are labeled with the concatenation of these three pieces of information (e.g. GSE41561.ESR1.MCF-7). Atha: The data set is defined as a ChIP-seq experiment in a given series (e.g. GSE94486), for a given target (e.g. ARR1), in a particular biological condition (i.e. ecotype, tissue type, experimental conditions; e.g. Col-0_seedling_3d-6BA-4h). Data sets are labeled with the concatenation of these three pieces of information (e.g. GSE94486.ARR1.Col-0_seedling_3d-6BA-4h). Methods This release of ReMap (2022) presents the analysis of 5,505 quality controlled mouse ChIP-seq (n=7,317 before QCs) from public sources (GEO & ENCODE). Those ChIP-seq data sets have been mapped to the GRCm38/mm10 mouse assembly. The data set is defined as a ChIP-seq experiment in a given series (e.g. GSE122715), for a given TF (e.g. USF1), in a particular biological condition (i.e. cell line, tissue type, disease state, or experimental conditions; e.g. mESC). Data sets were labeled by concatenating these three pieces of information, such as GSE122715.USF1.mESC. Those merged analyses cover a total of 656 DNA-binding proteins (transcriptional regulators) such as a variety of transcription factors (TFs), transcription co-activators (TCFs), and chromatin-remodeling factors (CRFs) for 123 million peaks. ENCODE Available ENCODE ChIP-seq data sets for transcriptional regulators from the ENCODE portal were processed with the standardized ReMap pipeline. The list of ENCODE data was retrieved as FASTQ files from the ENCODE portal using filters. Metadata information in JSON format and FASTQ files were retrieved using the Python requests module. ChIP-seq processing Both Public and ENCODE data were processed similarly. Bowtie 2 (PMC3322381) (version 2.2.9) with options -end-to-end -sensitive was used to align all reads on the genome. Biological and technical replicates for each unique combination of GSE/TF/Cell type or Biological condition were used for peak calling. TFBS were identified using MACS2 peak-calling tool (PMC3120977) (version 2.1.1.2) in order to follow ENCODE ChIP-seq guidelines, with stringent thresholds (MACS2 default thresholds, p-value: 1e-5). An input data set was used when available. Quality assessment To assess the quality of public data sets, a score was computed based on the cross-correlation and the FRiP (fraction of reads in peaks) metrics developed by the ENCODE Consortium (https://genome.ucsc.edu/ENCODE/qualityMetrics.html). Two thresholds were defined for each of the two cross-correlation ratios (NSC, normalized strand coefficient: 1.05 and 1.10; RSC, relative strand coefficient: 0.8 and 1.0). Detailed descriptions of the ENCODE quality coefficients can be found at https://genome.ucsc.edu/ENCODE/qualityMetrics.html. The phantompeak tools suite was used (https://code.google.com/p/phantompeakqualtools/) to compute RSC and NSC. Please refer to the ReMap 2022, 2020, and 2018 publications for more details (citation below). This is a detailled view of the data increase in ReMap v2 with FOXA1 peaks at a specific location. --> Data Access ReMap Atlas of regulatory regions data can be explored interactively with the Table Browser and cross-referenced with the Data Integrator. For programmatic access, the track can be accessed using the Genome Browser's REST API. ReMap annotations can be downloaded from the Genome Browser's download server as a bigBed file. This compressed binary format can be remotely queried through command line utilities. Please note that some of the download files can be quite large. Individual BED files for specific TFs, cells/biotypes, or data sets can be found and downloaded on the ReMap website. References Chèneby J, Gheorghe M, Artufel M, Mathelier A, Ballester B. ReMap 2018: an updated atlas of regulatory regions from an integrative analysis of DNA-binding ChIP- seq experiments. Nucleic Acids Res. 2018 Jan 4;46(D1):D267-D275. PMID: 29126285; PMC: PMC5753247 Chèneby J, Ménétrier Z, Mestdagh M, Rosnet T, Douida A, Rhalloussi W, Bergon A, Lopez F, Ballester B. ReMap 2020: a database of regulatory regions from an integrative analysis of Human and Arabidopsis DNA-binding sequencing experiments. Nucleic Acids Res. 2020 Jan 8;48(D1):D180-D188. PMID: 31665499; PMC: PMC7145625 Griffon A, Barbier Q, Dalino J, van Helden J, Spicuglia S, Ballester B. Integrative analysis of public ChIP-seq experiments reveals a complex multi-cell regulatory landscape. Nucleic Acids Res. 2015 Feb 27;43(4):e27. PMID: 25477382; PMC: PMC4344487 Hammal F, de Langen P, Bergon A, Lopez F, Ballester B. ReMap 2022: a database of Human, Mouse, Drosophila and Arabidopsis regulatory regions from an integrative analysis of DNA-binding sequencing experiments. Nucleic Acids Res. 2022 Jan 7;50(D1):D316-D325. PMID: 34751401; PMC: PMC8728178 ReMapTFs ReMap ChIP-seq ReMap Atlas of Regulatory Regions Expression and Regulation Description This track represents the ReMap Atlas of regulatory regions, which consists of a large-scale integrative analysis of all Public ChIP-seq data for transcriptional regulators from GEO, ArrayExpress, and ENCODE. Below is a schematic diagram of the types of regulatory regions: ReMap 2022 Atlas (all peaks for each analyzed data set) ReMap 2022 Non-redundant peaks (merged similar target) ReMap 2022 Cis Regulatory Modules Display Conventions and Configuration Each transcription factor follows a specific RGB color. ChIP-seq peak summits are represented by vertical bars. Hsap: A data set is defined as a ChIP/Exo-seq experiment in a given GEO/ArrayExpress/ENCODE series (e.g. GSE41561), for a given TF (e.g. ESR1), in a particular biological condition (e.g. MCF-7). Data sets are labeled with the concatenation of these three pieces of information (e.g. GSE41561.ESR1.MCF-7). Atha: The data set is defined as a ChIP-seq experiment in a given series (e.g. GSE94486), for a given target (e.g. ARR1), in a particular biological condition (i.e. ecotype, tissue type, experimental conditions; e.g. Col-0_seedling_3d-6BA-4h). Data sets are labeled with the concatenation of these three pieces of information (e.g. GSE94486.ARR1.Col-0_seedling_3d-6BA-4h). Methods This release of ReMap (2022) presents the analysis of 5,505 quality controlled mouse ChIP-seq (n=7,317 before QCs) from public sources (GEO & ENCODE). Those ChIP-seq data sets have been mapped to the GRCm38/mm10 mouse assembly. The data set is defined as a ChIP-seq experiment in a given series (e.g. GSE122715), for a given TF (e.g. USF1), in a particular biological condition (i.e. cell line, tissue type, disease state, or experimental conditions; e.g. mESC). Data sets were labeled by concatenating these three pieces of information, such as GSE122715.USF1.mESC. Those merged analyses cover a total of 656 DNA-binding proteins (transcriptional regulators) such as a variety of transcription factors (TFs), transcription co-activators (TCFs), and chromatin-remodeling factors (CRFs) for 123 million peaks. ENCODE Available ENCODE ChIP-seq data sets for transcriptional regulators from the ENCODE portal were processed with the standardized ReMap pipeline. The list of ENCODE data was retrieved as FASTQ files from the ENCODE portal using filters. Metadata information in JSON format and FASTQ files were retrieved using the Python requests module. ChIP-seq processing Both Public and ENCODE data were processed similarly. Bowtie 2 (PMC3322381) (version 2.2.9) with options -end-to-end -sensitive was used to align all reads on the genome. Biological and technical replicates for each unique combination of GSE/TF/Cell type or Biological condition were used for peak calling. TFBS were identified using MACS2 peak-calling tool (PMC3120977) (version 2.1.1.2) in order to follow ENCODE ChIP-seq guidelines, with stringent thresholds (MACS2 default thresholds, p-value: 1e-5). An input data set was used when available. Quality assessment To assess the quality of public data sets, a score was computed based on the cross-correlation and the FRiP (fraction of reads in peaks) metrics developed by the ENCODE Consortium (https://genome.ucsc.edu/ENCODE/qualityMetrics.html). Two thresholds were defined for each of the two cross-correlation ratios (NSC, normalized strand coefficient: 1.05 and 1.10; RSC, relative strand coefficient: 0.8 and 1.0). Detailed descriptions of the ENCODE quality coefficients can be found at https://genome.ucsc.edu/ENCODE/qualityMetrics.html. The phantompeak tools suite was used (https://code.google.com/p/phantompeakqualtools/) to compute RSC and NSC. Please refer to the ReMap 2022, 2020, and 2018 publications for more details (citation below). This is a detailled view of the data increase in ReMap v2 with FOXA1 peaks at a specific location. --> Data Access ReMap Atlas of regulatory regions data can be explored interactively with the Table Browser and cross-referenced with the Data Integrator. For programmatic access, the track can be accessed using the Genome Browser's REST API. ReMap annotations can be downloaded from the Genome Browser's download server as a bigBed file. This compressed binary format can be remotely queried through command line utilities. Please note that some of the download files can be quite large. Individual BED files for specific TFs, cells/biotypes, or data sets can be found and downloaded on the ReMap website. References Chèneby J, Gheorghe M, Artufel M, Mathelier A, Ballester B. ReMap 2018: an updated atlas of regulatory regions from an integrative analysis of DNA-binding ChIP- seq experiments. Nucleic Acids Res. 2018 Jan 4;46(D1):D267-D275. PMID: 29126285; PMC: PMC5753247 Chèneby J, Ménétrier Z, Mestdagh M, Rosnet T, Douida A, Rhalloussi W, Bergon A, Lopez F, Ballester B. ReMap 2020: a database of regulatory regions from an integrative analysis of Human and Arabidopsis DNA-binding sequencing experiments. Nucleic Acids Res. 2020 Jan 8;48(D1):D180-D188. PMID: 31665499; PMC: PMC7145625 Griffon A, Barbier Q, Dalino J, van Helden J, Spicuglia S, Ballester B. Integrative analysis of public ChIP-seq experiments reveals a complex multi-cell regulatory landscape. Nucleic Acids Res. 2015 Feb 27;43(4):e27. PMID: 25477382; PMC: PMC4344487 Hammal F, de Langen P, Bergon A, Lopez F, Ballester B. ReMap 2022: a database of Human, Mouse, Drosophila and Arabidopsis regulatory regions from an integrative analysis of DNA-binding sequencing experiments. Nucleic Acids Res. 2022 Jan 7;50(D1):D316-D325. PMID: 34751401; PMC: PMC8728178 ReMapDensity ReMap density ReMap density Expression and Regulation Description This track represents the ReMap Atlas of regulatory regions, which consists of a large-scale integrative analysis of all Public ChIP-seq data for transcriptional regulators from GEO, ArrayExpress, and ENCODE. Below is a schematic diagram of the types of regulatory regions: ReMap 2022 Atlas (all peaks for each analyzed data set) ReMap 2022 Non-redundant peaks (merged similar target) ReMap 2022 Cis Regulatory Modules Display Conventions and Configuration Each transcription factor follows a specific RGB color. ChIP-seq peak summits are represented by vertical bars. Hsap: A data set is defined as a ChIP/Exo-seq experiment in a given GEO/ArrayExpress/ENCODE series (e.g. GSE41561), for a given TF (e.g. ESR1), in a particular biological condition (e.g. MCF-7). Data sets are labeled with the concatenation of these three pieces of information (e.g. GSE41561.ESR1.MCF-7). Atha: The data set is defined as a ChIP-seq experiment in a given series (e.g. GSE94486), for a given target (e.g. ARR1), in a particular biological condition (i.e. ecotype, tissue type, experimental conditions; e.g. Col-0_seedling_3d-6BA-4h). Data sets are labeled with the concatenation of these three pieces of information (e.g. GSE94486.ARR1.Col-0_seedling_3d-6BA-4h). Methods This release of ReMap (2022) presents the analysis of 5,505 quality controlled mouse ChIP-seq (n=7,317 before QCs) from public sources (GEO & ENCODE). Those ChIP-seq data sets have been mapped to the GRCm38/mm10 mouse assembly. The data set is defined as a ChIP-seq experiment in a given series (e.g. GSE122715), for a given TF (e.g. USF1), in a particular biological condition (i.e. cell line, tissue type, disease state, or experimental conditions; e.g. mESC). Data sets were labeled by concatenating these three pieces of information, such as GSE122715.USF1.mESC. Those merged analyses cover a total of 656 DNA-binding proteins (transcriptional regulators) such as a variety of transcription factors (TFs), transcription co-activators (TCFs), and chromatin-remodeling factors (CRFs) for 123 million peaks. ENCODE Available ENCODE ChIP-seq data sets for transcriptional regulators from the ENCODE portal were processed with the standardized ReMap pipeline. The list of ENCODE data was retrieved as FASTQ files from the ENCODE portal using filters. Metadata information in JSON format and FASTQ files were retrieved using the Python requests module. ChIP-seq processing Both Public and ENCODE data were processed similarly. Bowtie 2 (PMC3322381) (version 2.2.9) with options -end-to-end -sensitive was used to align all reads on the genome. Biological and technical replicates for each unique combination of GSE/TF/Cell type or Biological condition were used for peak calling. TFBS were identified using MACS2 peak-calling tool (PMC3120977) (version 2.1.1.2) in order to follow ENCODE ChIP-seq guidelines, with stringent thresholds (MACS2 default thresholds, p-value: 1e-5). An input data set was used when available. Quality assessment To assess the quality of public data sets, a score was computed based on the cross-correlation and the FRiP (fraction of reads in peaks) metrics developed by the ENCODE Consortium (https://genome.ucsc.edu/ENCODE/qualityMetrics.html). Two thresholds were defined for each of the two cross-correlation ratios (NSC, normalized strand coefficient: 1.05 and 1.10; RSC, relative strand coefficient: 0.8 and 1.0). Detailed descriptions of the ENCODE quality coefficients can be found at https://genome.ucsc.edu/ENCODE/qualityMetrics.html. The phantompeak tools suite was used (https://code.google.com/p/phantompeakqualtools/) to compute RSC and NSC. Please refer to the ReMap 2022, 2020, and 2018 publications for more details (citation below). This is a detailled view of the data increase in ReMap v2 with FOXA1 peaks at a specific location. --> Data Access ReMap Atlas of regulatory regions data can be explored interactively with the Table Browser and cross-referenced with the Data Integrator. For programmatic access, the track can be accessed using the Genome Browser's REST API. ReMap annotations can be downloaded from the Genome Browser's download server as a bigBed file. This compressed binary format can be remotely queried through command line utilities. Please note that some of the download files can be quite large. Individual BED files for specific TFs, cells/biotypes, or data sets can be found and downloaded on the ReMap website. References Chèneby J, Gheorghe M, Artufel M, Mathelier A, Ballester B. ReMap 2018: an updated atlas of regulatory regions from an integrative analysis of DNA-binding ChIP- seq experiments. Nucleic Acids Res. 2018 Jan 4;46(D1):D267-D275. PMID: 29126285; PMC: PMC5753247 Chèneby J, Ménétrier Z, Mestdagh M, Rosnet T, Douida A, Rhalloussi W, Bergon A, Lopez F, Ballester B. ReMap 2020: a database of regulatory regions from an integrative analysis of Human and Arabidopsis DNA-binding sequencing experiments. Nucleic Acids Res. 2020 Jan 8;48(D1):D180-D188. PMID: 31665499; PMC: PMC7145625 Griffon A, Barbier Q, Dalino J, van Helden J, Spicuglia S, Ballester B. Integrative analysis of public ChIP-seq experiments reveals a complex multi-cell regulatory landscape. Nucleic Acids Res. 2015 Feb 27;43(4):e27. PMID: 25477382; PMC: PMC4344487 Hammal F, de Langen P, Bergon A, Lopez F, Ballester B. ReMap 2022: a database of Human, Mouse, Drosophila and Arabidopsis regulatory regions from an integrative analysis of DNA-binding sequencing experiments. Nucleic Acids Res. 2022 Jan 7;50(D1):D316-D325. PMID: 34751401; PMC: PMC8728178 simpleRepeat Simple Repeats Simple Tandem Repeats by TRF Variation and Repeats Description This track displays simple tandem repeats (possibly imperfect repeats) located by Tandem Repeats Finder (TRF) which is specialized for this purpose. These repeats can occur within coding regions of genes and may be quite polymorphic. Repeat expansions are sometimes associated with specific diseases. Methods For more information about the TRF program, see Benson (1999). Credits TRF was written by Gary Benson. References Benson G. Tandem repeats finder: a program to analyze DNA sequences. Nucleic Acids Res. 1999 Jan 15;27(2):573-80. PMID: 9862982; PMC: PMC148217 tanDups Tandem Dups Paired identical sequences Mapping and Sequencing Description There are two tracks in this composite collection: Gap Overlaps - Paired exactly identical sequence on each side of a gap Tandem Dups - Paired exactly identical sequence survey over entire genome assembly The Gap Overlaps is thus a subset of the full Tandem Dups track. This investigation began when an unusual number of paired sequences around gaps was noticed during the mouse strain sequencing project. This naturally raised the question, how common is this feature, and what type of assemblies can it be found in. The Gap Overlaps track indicates any pair of exactly identical sequence on each side of gaps. Where a gap is any run of N's, including a single N. The end of an upstream sequence before the gap is duplicated exactly at the beginning of the downstream sequence following the gap in the assembly. The Tandem Dups track is a similar survey over the entire genome assembly. The separation gap between these paired sequences can range from 1 base up to 20,000 bases. Methods The Gap Overlap duplicate sequences were found by extracting 1,000 bases before and after each gap and aligned to each other with the blat command: blat -q=dna -minIdentity=95 -repMatch=10 upstreamContig.fa downstreamContig.fa Filtering the PSL output for a perfect match, no mis-matches, and therefore of equal size matching sequence, where the alignment ends exactly at the end of the upstream sequence, and begins exactly at the start of the downstream sequence. The Tandem Dups paired sequences were found with the following procedure: Generate 29 base kmers for the entire genome, allow only kmers with bases: A C T G, no N's allowed. Pair up identical kmers with at least one base separation and up to 20,000 bases separation. Collapse overlapping kmer pairs when they are the same size of sequence and the same spacing between the pairs. This procedure preserves the definition of duplicated identical pairs. The resulting pairs can now be longer sequences with smaller separation then the constituent pairs Final result selects sizes of 30 bases or more for the size of the paired sequence, and at least one base remaining as a separation gap. Collapsed pairs that close the gap are discarded. They appear to indicate simple repeat sequences when this happens. It would be interesting to have this result available, but that is not available at this time. The reason for starting with 29 base sized pairs and then selecting results of at least 30 base sized pairs results in a reasonable number of 30 base pairs. If the procedure starts with 30 base sized pairs, it produces way too many 30 base kmer pairs for a reasonable count. See Also Interactive tables of all results: Gap Overlaps Tandem Dups Credits Thank you to Joel Armstrong and Benedict Paten of the Computational Genomics Lab at the U.C. Santa Cruz Genomics Institute for identifying this characteristic of genome assemblies. The data and presentation of this track were prepared by Hiram Clawson, U.C. Santa Cruz Genomics Institute tandemDups Tandem Dups Paired exactly identical sequence survey over entire genome assembly Mapping and Sequencing HLTOGAannotvHg38v1 TOGA vs. hg38 TOGA annotations using human/hg38 as reference Genes and Gene Predictions Description TOGA (Tool to infer Orthologs from Genome Alignments) is a homology-based method that integrates gene annotation, inferring orthologs and classifying genes as intact or lost. Methods As input, TOGA uses a gene annotation of a reference species (human/hg38 for mammals, chicken/galGal6 for birds) and a whole genome alignment between the reference and query genome. TOGA implements a novel paradigm that relies on alignments of intronic and intergenic regions and uses machine learning to accurately distinguish orthologs from paralogs or processed pseudogenes. To annotate genes, CESAR 2.0 is used to determine the positions and boundaries of coding exons of a reference transcript in the orthologous genomic locus in the query species. Display Conventions and Configuration Each annotated transcript is shown in a color-coded classification as "intact": middle 80% of the CDS (coding sequence) is present and exhibits no gene-inactivating mutation. These transcripts likely encode functional proteins. "partially intact": 50% of the CDS is present in the query and the middle 80% of the CDS exhibits no inactivating mutation. These transcripts may also encode functional proteins, but the evidence is weaker as parts of the CDS are missing, often due to assembly gaps. "missing": <50% of the CDS is present in the query and the middle 80% of the CDS exhibits no inactivating mutation. "uncertain loss": there is 1 inactivating mutation in the middle 80% of the CDS, but evidence is not strong enough to classify the transcript as lost. These transcripts may or may not encode a functional protein. "lost": typically several inactivating mutations are present, thus there is strong evidence that the transcript is unlikely to encode a functional protein. Clicking on a transcript provides additional information about the orthology classification, inactivating mutations, the protein sequence and protein/exon alignments. Credits This data was prepared by the Michael Hiller Lab References The TOGA software is available from github.com/hillerlab/TOGA Kirilenko BM, Munegowda C, Osipova E, Jebb D, Sharma V, Blumer M, Morales AE, Ahmed AW, Kontopoulos DG, Hilgers L et al. Integrating gene annotation with orthology inference at scale. Science. 2023 Apr 28;380(6643):eabn3107. PMID: 37104600; PMC: PMC10193443 knownAlt UCSC Alt Events Alternative Splicing, Alternative Promoter and Similar Events in UCSC Genes Genes and Gene Predictions Description This track shows various types of alternative splicing and other events that result in more than a single transcript from the same gene. The label by an item describes the type of event. The events are: Alternate Promoter (altPromoter) - Transcription starts at multiple places. The altPromoter extends from 100 bases before to 50 bases after transcription start. Alternate Finish Site (altFinish) - Transcription ends at multiple places. Cassette Exon (cassetteExon) - Exon is present in some transcripts but not others. These are found by looking for exons that overlap an intron in the same transcript. Retained Intron (retainedIntron) - Introns are spliced out in some transcripts but not others. In some cases, particularly when the intron is near the 3' end, this can reflect an incompletely processed transcript rather than a true alt-splicing event. Overlapping Exon (bleedingExon) - Initial or terminal exons overlap in an intron in another transcript. These often are associated with incompletely processed transcripts. Alternate 3' End (altThreePrime) - Variations on the 3' end of an intron. Alternate 5' End (altFivePrime) - Variations on the 5' end of an intron. Intron Ends have AT/AC (atacIntron) - An intron with AT/AC ends rather than the usual GT/AG. These are associated with the minor spliceosome. Strange Intron Ends (strangeSplice) - An intron with ends that are not GT/AG, GC/AG, or AT/AC. These are usually artifacts of some sort due to sequencing error or polymorphism. Credits This track is based on an analysis by the txgAnalyse program of splicing graphs produced by the txGraph program. Both of these programs were written by Jim Kent at UCSC. liftOverMm10 UCSC liftOver mm10 UCSC liftOver alignments to mm10 Mapping and Sequencing Description This track shows the 'lift over' calculated alignments from the mm10/mouse to the mm39/mouse genome assembly, used by the UCSC liftOver tool. Display Conventions and Configuration The alignments are shown as "chains" of alignable regions. The display is similar to the other chain tracks, see our chain display documentation for more information. Data access UCSC liftOver chain files for mm10 to mm39 can be obtained from a dedicated directory on our download server mm10/liftOver directory. Specifically, the mm39ToMm10.over.chain.gz file. The table can also be explored interactively with the Table Browser or the Data Integrator. Methods The lift over chain file is constructed with the doSameSpeciesLiftOver processing using the blat genome alignment program. uniprot UniProt UniProt SwissProt/TrEMBL Protein Annotations Genes and Gene Predictions Description This track shows protein sequences and annotations on them from the UniProt/SwissProt database, mapped to genomic coordinates. UniProt/SwissProt data has been curated from scientific publications by the UniProt staff, UniProt/TrEMBL data has been predicted by various computational algorithms. The annotations are divided into multiple subtracks, based on their "feature type" in UniProt. The first two subtracks below - one for SwissProt, one for TrEMBL - show the alignments of protein sequences to the genome, all other tracks below are the protein annotations mapped through these alignments to the genome. Track Name Description UCSC Alignment, SwissProt = curated protein sequences Protein sequences from SwissProt mapped to the genome. All other tracks are (start,end) SwissProt annotations on these sequences mapped through this alignment. Even protein sequences without a single curated annotation (splice isoforms) are visible in this track. Each UniProt protein has one main isoform, which is colored in dark. Alternative isoforms are sequences that do not have annotations on them and are colored in light-blue. They can be hidden with the TrEMBL/Isoform filter (see below). UCSC Alignment, TrEMBL = predicted protein sequences Protein sequences from TrEMBL mapped to the genome. All other tracks below are (start,end) TrEMBL annotations mapped to the genome using this track. This track is hidden by default. To show it, click its checkbox on the track configuration page. UniProt Signal Peptides Regions found in proteins destined to be secreted, generally cleaved from mature protein. UniProt Extracellular Domains Protein domains with the comment "Extracellular". UniProt Transmembrane Domains Protein domains of the type "Transmembrane". UniProt Cytoplasmic Domains Protein domains with the comment "Cytoplasmic". UniProt Polypeptide Chains Polypeptide chain in mature protein after post-processing. UniProt Regions of Interest Regions that have been experimentally defined, such as the role of a region in mediating protein-protein interactions or some other biological process. UniProt Domains Protein domains, zinc finger regions and topological domains. UniProt Disulfide Bonds Disulfide bonds. UniProt Amino Acid Modifications Glycosylation sites, modified residues and lipid moiety-binding regions. UniProt Amino Acid Mutations Mutagenesis sites and sequence variants. UniProt Protein Primary/Secondary Structure Annotations Beta strands, helices, coiled-coil regions and turns. UniProt Sequence Conflicts Differences between Genbank sequences and the UniProt sequence. UniProt Repeats Regions of repeated sequence motifs or repeated domains. UniProt Other Annotations All other annotations, e.g. compositional bias For consistency and convenience for users of mutation-related tracks, the subtrack "UniProt/SwissProt Variants" is a copy of the track "UniProt Variants" in the track group "Phenotype and Literature", or "Variation and Repeats", depending on the assembly. Display Conventions and Configuration Genomic locations of UniProt/SwissProt annotations are labeled with a short name for the type of annotation (e.g. "glyco", "disulf bond", "Signal peptide" etc.). A click on them shows the full annotation and provides a link to the UniProt/SwissProt record for more details. TrEMBL annotations are always shown in light blue, except in the Signal Peptides, Extracellular Domains, Transmembrane Domains, and Cytoplamsic domains subtracks. Mouse over a feature to see the full UniProt annotation comment. For variants, the mouse over will show the full name of the UniProt disease acronym. The subtracks for domains related to subcellular location are sorted from outside to inside of the cell: Signal peptide, extracellular, transmembrane, and cytoplasmic. In the "UniProt Modifications" track, lipoification sites are highlighted in dark blue, glycosylation sites in dark green, and phosphorylation in light green. Duplicate annotations are removed as far as possible: if a TrEMBL annotation has the same genome position and same feature type, comment, disease and mutated amino acids as a SwissProt annotation, it is not shown again. Two annotations mapped through different protein sequence alignments but with the same genome coordinates are only shown once. On the configuration page of this track, you can choose to hide any TrEMBL annotations. This filter will also hide the UniProt alternative isoform protein sequences because both types of information are less relevant to most users. Please contact us if you want more detailed filtering features. Note that for the human hg38 assembly and SwissProt annotations, there also is a public track hub prepared by UniProt itself, with genome annotations maintained by UniProt using their own mapping method based on those Gencode/Ensembl gene models that are annotated in UniProt for a given protein. For proteins that differ from the genome, UniProt's mapping method will, in most cases, map a protein and its annotations to an unexpected location (see below for details on UCSC's mapping method). Methods Briefly, UniProt protein sequences were aligned to the transcripts associated with the protein, the top-scoring alignments were retained, and the result was projected to the genome through a transcript-to-genome alignment. Depending on the genome, the transcript-genome alignments was either provided by the source database (NBCI RefSeq), created at UCSC (UCSC RefSeq) or derived from the transcripts (Ensembl/Augustus). The transcript set is NCBI RefSeq for hg38, UCSC RefSeq for hg19 (due to alt/fix haplotype misplacements in the NCBI RefSeq set on hg19). For other genomes, RefSeq, Ensembl and Augustus are tried, in this order. The resulting protein-genome alignments of this process are available in the file formats for liftOver or pslMap from our data archive (see "Data Access" section below). An important step of the mapping process protein -> transcript -> genome is filtering the alignment from protein to transcript. Due to differences between the UniProt proteins and the transcripts (proteins were made many years before the transcripts were made, and human genomes have variants), the transcript with the highest BLAST score when aligning the protein to all transcripts is not always the correct transcript for a protein sequence. Therefore, the protein sequence is aligned to only a very short list of one or sometimes more transcripts, selected by a three-step procedure: Use transcripts directly annotated by UniProt: for organisms that have a RefSeq transcript track, proteins are aligned to the RefSeq transcripts that are annotated by UniProt for this particular protein. Use transcripts for NCBI Gene ID annotated by UniProt: If no transcripts are annotated on the protein, or the annotated ones have been deprecated by NCBI, but a NCBI Gene ID is annotated, the RefSeq transcripts for this Gene ID are used. This can result in multiple matching transcripts for a protein. Use best matching transcript: If no NCBI Gene is annotated, then BLAST scores are used to pick the transcripts. There can be multiple transcripts for one protein, as their coding sequences can be identical. All transcripts within 1% of the highest observed BLAST score are used. For strategy 2 and 3, many of the transcripts found do not differ in coding sequence, so the resulting alignments on the genome will be identical. Therefore, any identical alignments are removed in a final filtering step. The details page of these alignments will contain a list of all transcripts that result in the same protein-genome alignment. On hg38, only a handful of edge cases (pseudogenes, very recently added proteins) remain in 2023 where strategy 3 has to be used. In other words, when an NCBI or UCSC RefSeq track is used for the mapping and to align a protein sequence to the correct transcript, we use a three stage process: If UniProt has annotated a given RefSeq transcript for a given protein sequence, the protein is aligned to this transcript. Any difference in the version suffix is tolerated in this comparison. If no transcript is annotated or the transcript cannot be found in the NCBI/UCSC RefSeq track, the UniProt-annotated NCBI Gene ID is resolved to a set of NCBI RefSeq transcript IDs via the most current version of NCBI genes tables. Only the top match of the resulting alignments and all others within 1% of its score are used for the mapping. If no transcript can be found after step (2), the protein is aligned to all transcripts, the top match, and all others within 1% of its score are used. This system was designed to resolve the problem of incorrect mappings of proteins, mostly on hg38, due to differences between the SwissProt sequences and the genome reference sequence, which has changed since the proteins were defined. The problem is most pronounced for gene families composed of either very repetitive or very similar proteins. To make sure that the alignments always go to the best chromosome location, all _alt and _fix reference patch sequences are ignored for the alignment, so the patches are entirely free of UniProt annotations. Please contact us if you have feedback on this process or example edge cases. We are not aware of a way to evaluate the results completely and in an automated manner. Proteins were aligned to transcripts with TBLASTN, converted to PSL, filtered with pslReps (93% query coverage, keep alignments within top 1% score), lifted to genome positions with pslMap and filtered again with pslReps. UniProt annotations were obtained from the UniProt XML file. The UniProt annotations were then mapped to the genome through the alignment described above using the pslMap program. This approach draws heavily on the LS-SNP pipeline by Mark Diekhans. Like all Genome Browser source code, the main script used to build this track can be found on Github. Older releases This track is automatically updated on an ongoing basis, every 2-3 months. The current version name is always shown on the track details page, it includes the release of UniProt, the version of the transcript set and a unique MD5 that is based on the protein sequences, the transcript sequences, the mapping file between both and the transcript-genome alignment. The exact transcript that was used for the alignment is shown when clicking a protein alignment in one of the two alignment tracks. For reproducibility of older analysis results and for manual inspection, previous versions of this track are available for browsing in the form of the UCSC UniProt Archive Track Hub (click this link to connect the hub now). The underlying data of all releases of this track (past and current) can be obtained from our downloads server, including the UniProt protein-to-genome alignment. Data Access The raw data of the current track can be explored interactively with the Table Browser, or the Data Integrator. For automated analysis, the genome annotation is stored in a bigBed file that can be downloaded from the download server. The exact filenames can be found in the track configuration file. Annotations can be converted to ASCII text by our tool bigBedToBed which can be compiled from the source code or downloaded as a precompiled binary for your system. Instructions for downloading source code and binaries can be found here. The tool can also be used to obtain only features within a given range, for example: bigBedToBed http://hgdownload.soe.ucsc.edu/gbdb/mm39/uniprot/unipStruct.bb -chrom=chr6 -start=0 -end=1000000 stdout Please refer to our mailing list archives for questions, or our Data Access FAQ for more information. Lifting from UniProt to genome coordinates in pipelines To facilitate mapping protein coordinates to the genome, we provide the alignment files in formats that are suitable for our command line tools. Our command line programs liftOver or pslMap can be used to map coordinates on protein sequences to genome coordinates. The filenames are unipToGenome.over.chain.gz (liftOver) and unipToGenomeLift.psl.gz (pslMap). Example commands: wget -q https://hgdownload.soe.ucsc.edu/goldenPath/archive/hg38/uniprot/2022_03/unipToGenome.over.chain.gz wget -q https://hgdownload.soe.ucsc.edu/admin/exe/linux.x86_64/liftOver chmod a+x liftOver echo 'Q99697 1 10 annotationOnProtein' > prot.bed liftOver prot.bed unipToGenome.over.chain.gz genome.bed cat genome.bed Credits This track was created by Maximilian Haeussler at UCSC, with a lot of input from Chris Lee, Mark Diekhans and Brian Raney, feedback from the UniProt staff, Alejo Mujica, Regeneron Pharmaceuticals and Pia Riestra, GeneDx. Thanks to UniProt for making all data available for download. References UniProt Consortium. Reorganizing the protein space at the Universal Protein Resource (UniProt). Nucleic Acids Res. 2012 Jan;40(Database issue):D71-5. PMID: 22102590; PMC: PMC3245120 Yip YL, Scheib H, Diemand AV, Gattiker A, Famiglietti LM, Gasteiger E, Bairoch A. The Swiss-Prot variant page and the ModSNP database: a resource for sequence and structure information on human protein variants. Hum Mutat. 2004 May;23(5):464-70. PMID: 15108278 unipConflict Seq. Conflicts UniProt Sequence Conflicts Genes and Gene Predictions unipRepeat Repeats UniProt Repeats Genes and Gene Predictions unipStruct Structure UniProt Protein Primary/Secondary Structure Annotations Genes and Gene Predictions unipOther Other Annot. UniProt Other Annotations Genes and Gene Predictions unipMut Mutations UniProt Amino Acid Mutations Genes and Gene Predictions unipModif AA Modifications UniProt Amino Acid Modifications Genes and Gene Predictions unipDomain Domains UniProt Domains Genes and Gene Predictions unipDisulfBond Disulf. Bonds UniProt Disulfide Bonds Genes and Gene Predictions unipChain Chains UniProt Mature Protein Products (Polypeptide Chains) Genes and Gene Predictions unipLocCytopl Cytoplasmic UniProt Cytoplasmic Domains Genes and Gene Predictions unipLocTransMemb Transmembrane UniProt Transmembrane Domains Genes and Gene Predictions unipInterest Interest UniProt Regions of Interest Genes and Gene Predictions unipLocExtra Extracellular UniProt Extracellular Domain Genes and Gene Predictions unipLocSignal Signal Peptide UniProt Signal Peptides Genes and Gene Predictions unipAliTrembl TrEMBL Aln. UCSC alignment of TrEMBL proteins to genome Genes and Gene Predictions unipAliSwissprot SwissProt Aln. UCSC alignment of SwissProt proteins to genome (dark blue: main isoform, light blue: alternative isoforms) Genes and Gene Predictions vistaEnhancersBb VISTA Enhancers VISTA Enhancers Expression and Regulation Description This track shows potential enhancers whose activity was experimentally validated in transgenic mice. Most of these noncoding elements were selected for testing based on their extreme conservation in other vertebrates or epigenomic evidence (ChIP-Seq) of putative enhancer marks. More information can be found on the VISTA Enhancer Browser page. Display Conventions and Configuration Items appearing in red (positive) indicate that a reproducible pattern was observed in the in vivo enhancer assay. Items appearing in blue (negative) indicate that NO reproducible pattern was observed in the in vivo enhancer assay. Note that this annotation refers only to the single developmental timepoint that was tested in this screen (e11.5) and does not exclude the possibility that this region is a reproducible enhancer active at earlier or later timepoints in development. Methods Excerpted from the Vista Enhancer Mouse Enhancer Screen Handbook and Methods page at the Lawrence Berkeley National Laboratory (LBNL) website: Enhancer Candidate Identification Most enhancer candidate sequences are identified by extreme evolutionary sequence conservation or by ChIP-seq. Detailed information related to enhancer identification by extreme evolutionary conservation can be found in the following publications: Pennacchio et al., Genomic strategies to identify mammalian regulatory sequences. Nature Rev Genet 2001 Nobrega et al., Nobrega et al., Scanning human gene deserts for long-range enhancers. Science 2003 Pennacchio et al., In vivo enhancer analysis of human conserved non-coding sequences. Nature 2006 Visel et al., Enhancer identification through comparative genomics. Semin Cell Dev Biol. 2007 Visel et al., Ultraconservation identifies a small subset of extremely constrained developmental enhancers. Nature Genet 2008 Detailed information related to enhancer identification by ChIP-seq can be found in the following publications: Visel et al., ChIP-seq accurately predicts tissue-specific activity of enhancers. Nature 2009 Visel et al., Genomic views of distant-acting enhancers. Nature 2009 See the Transgenic Mouse Assay section for experimental procedures that were used to perform the transgenic assays: Mouse Enhancer Screen Handbook and Methods UCSC converted the Experimental Data for hg19 and mm9 into bigBed format using the bedToBigBed utility. The data for hg38 was lifted over from hg19. The data for mm10 and mm39 were lifted over from mm9. Data Access VISTA Enhancers data can be explored interactively with the Table Browser and cross-referenced with the Data Integrator. For programmatic access, the track can be accessed using the Genome Browser's REST API. ReMap annotations can be downloaded from the Genome Browser's download server as a bigBed file. This compressed binary format can be remotely queried through command line utilities. Please note that some of the download files can be quite large. Credits Thanks to the Lawrence Berkeley National Laboratory for providing this data References Visel A, Minovitsky S, Dubchak I, Pennacchio LA. VISTA Enhancer Browser--a database of tissue-specific human enhancers. Nucleic Acids Res. 2007 Jan;35(Database issue):D88-92. PMID: 17130149; PMC: PMC1716724 windowmaskerSdust WM + SDust Genomic Intervals Masked by WindowMasker + SDust Variation and Repeats Description This track depicts masked sequence as determined by WindowMasker. The WindowMasker tool is included in the NCBI C++ toolkit. The source code for the entire toolkit is available from the NCBI FTP site. Methods To create this track, WindowMasker was run with the following parameters: windowmasker -mk_counts true -input mm39.fa -output wm_counts windowmasker -ustat wm_counts -sdust true -input mm39.fa -output repeats.bed The repeats.bed (BED3) file was loaded into the "windowmaskerSdust" table for this track. References Morgulis A, Gertz EM, Schäffer AA, Agarwala R. WindowMasker: window-based masker for sequenced genomes. Bioinformatics. 2006 Jan 15;22(2):134-41. PMID: 16287941