cons26way Conservation Nematode Multiz Alignment & Conservation (26 Species) Comparative Genomics Downloads for data in this track are available: Multiz alignments (MAF format), and phylogenetic trees PhyloP conservation (WIG format) PhastCons conservation (WIG format) Description This track shows multiple alignments of 26 nematode species and measurements of evolutionary conservation using two methods (phastCons and phyloP) from the PHAST package, for all 26 species. The multiple alignments were generated using multiz and other tools in the UCSC/Penn State Bioinformatics comparative genomics alignment pipeline. Conserved elements identified by phastCons are also displayed in this track. PhastCons is a hidden Markov model-based method that estimates the probability that each nucleotide belongs to a conserved element, based on the multiple alignment. It considers not just each individual alignment column, but also its flanking columns. By contrast, phyloP separately measures conservation at individual columns, ignoring the effects of their neighbors. As a consequence, the phyloP plots have a less smooth appearance than the phastCons plots, with more "texture" at individual sites. The two methods have different strengths and weaknesses. PhastCons is sensitive to "runs" of conserved sites, and is therefore effective for picking out conserved elements. PhyloP, on the other hand, is more appropriate for evaluating signatures of selection at particular nucleotides or classes of nucleotides (e.g., third codon positions, or first positions of miRNA target sites). Another important difference is that phyloP can measure acceleration (faster evolution than expected under neutral drift) as well as conservation (slower than expected evolution). In the phyloP plots, sites predicted to be conserved are assigned positive scores (and shown in blue), while sites predicted to be fast-evolving are assigned negative scores (and shown in red). The absolute values of the scores represent -log p-values under a null hypothesis of neutral evolution. The phastCons scores, by contrast, represent probabilities of negative selection and range between 0 and 1. Both phastCons and phyloP treat alignment gaps and unaligned nucleotides as missing data, and both were run with the same parameters. UCSC has repeatmasked and aligned all genome assemblies, and provides all the sequences for download. For genome assemblies not available in the genome browser, there are alternative browser views in the preview genome browser. The species aligned for this track include 26 nematode genome sequences. Compared to the previous 6-nematode alignment (ce11), this track includes 4 new nematode genomes and 2 nematode genomes with updated sequence assemblies (Table 1). The four new species are the assemblies: H. contortus (haeCon1) at an unknown coverage, M. Hapla (melHap1) at 10.4X coverage, M. incognita (melInc1) at 5X coverage, and B. Malayi (bruMal1) at 9X coverage. The C. Japonica (22X, caeJap3) and P. pacificus (8.92X, priPac2) assemblies have been updated from those used in the previous 6-species nematode alignment. OrganismSpeciesRelease dateUCSC/WormBaseversionalignment type C. elegansCaenorhabditis elegans Aug. 2014ce11/WBcel235/GCA_000002985.3reference species C. brenneriCaenorhabditis brenneri Nov. 2010caePb3/WS227_C. brenneri 6.0.1bMAF Net C. remaneiCaenorhabditis remanei Jul. 2007caeRem4/WS220MAF Net C. briggsaeCaenorhabditis briggsae Apr. 2011cb4/WS225MAF Net C. japonicaCaenorhabditis japonica Aug. 2010caeJap4/WS227_WUSTL 7.0.1/GCA_000147155.1MAF Net C. tropicalisCaenorhabditis tropicalis Nov. 2010caeSp111/WS226_WUSTL 3.0.1MAF Net C. angariaCaenorhabditis angaria Apr. 2012caeAng2/WS232/ps1010rel8MAF Net C. sp. 5 ju800Caenorhabditis sp5 ju800 Jan. 2012caeSp51/WS230_Caenorhabditis_sp_5-JU800-1.0MAF Net H. bacteriophora/m31eHeterorhabditis bacteriophora Aug. 2011hetBac1/WS229_H. bacteriophora 7.0/GCA_000223415.1MAF Net ThreadwormStrongyloides ratti Sep. 2014strRat2/S. ratti ED321/GCA_001040885.1MAF Net MicrowormPanagrellus redivivus Feb. 2013panRed1/WS240_Pred3/GCA_000341325.1MAF Net A. ceylanicumAncylostoma ceylanicum Mar. 2014ancCey1/WS243_Acey_2013.11.30.genDNA/GCA_000688135.1MAF Net N. americanusNecator americanus Dec. 2013necAme1/WS242_N_americanus_v1/GCA_000507365.1MAF Net Barber pole wormHaemonchus contortus Jul. 2013haeCon2/WS239_Haemonchus_contortus_MHco3-2.0MAF Net Pig roundwormAscaris suum Sep. 2012ascSuu1/GCA_000298755.1MAF Net P. exspectatusPristionchus exspectatus Mar. 2014priExs1/WS243_P_exspectatus_v1MAF Net P. pacificusPristionchus pacificus Aug. 2014priPac3/WS221_P_pacificus-v2MAF Net M. haplaMeloidogyne hapla Sep. 2008melHap1/WS210_M. hapla VW9MAF Net M. incognitaMeloidogyne incognita Feb. 2008melInc2/WS245_M. incognita PRJEA28837MAF Net Pine wood nematodeBursaphelenchus xylophilus Nov. 2011burXyl1/WS229_B. xylophilus Ka4C1MAF Net Dog heartwormDirofilaria immitis Sep. 2013dirImm1/WS240_D. immitis v2.2MAF Net Eye wormLoa loa Jul. 2012loaLoa1/WS235_L_loa_Cameroon_isolate/GCA_000183805.3MAF Net O. volvulusOnchocerca volvulus Nov. 2013oncVol1/WS241_O_volvulus_Cameroon_v3/GCA_000499405.1MAF Net Filarial wormBrugia malayi May. 2014bruMal2/WS244_B_malayi-3.1MAF Net TrichinellaTrichinella spiralis Jan. 2011triSpi1/WS225_Trichinella_spiralis-3.7.1/GCA_000181795.2MAF Net WhipwormTrichuris suis Jul. 2014triSui1/WS243_T. suis DCEP-RM93M male/GCA_000701005.1MAF Net Table 1. Genome assemblies included in the 26-way Conservation track. Display Conventions and Configuration In full and pack display modes, conservation scores are displayed as a wiggle track (histogram) in which the height reflects the size of the score. The conservation wiggles can be configured in a variety of ways to highlight different aspects of the displayed information. Click the Graph configuration help link for an explanation of the configuration options. Pairwise alignments of each species to the C. elegans genome are displayed below the conservation histogram as a grayscale density plot (in pack mode) or as a wiggle (in full mode) that indicates alignment quality. In dense display mode, conservation is shown in grayscale using darker values to indicate higher levels of overall conservation as scored by phastCons. Checkboxes on the track configuration page allow selection of the species to include in the pairwise display. Note that excluding species from the pairwise display does not alter the the conservation score display. To view detailed information about the alignments at a specific position, zoom the display in to 30,000 or fewer bases, then click on the alignment. Gap Annotation The Display chains between alignments configuration option enables display of gaps between alignment blocks in the pairwise alignments in a manner similar to the Chain track display. The following conventions are used: Single line: No bases in the aligned species. Possibly due to a lineage-specific insertion between the aligned blocks in the C. elegans genome or a lineage-specific deletion between the aligned blocks in the aligning species. Double line: Aligning species has one or more unalignable bases in the gap region. Possibly due to excessive evolutionary distance between species or independent indels in the region between the aligned blocks in both species. Pale yellow coloring: Aligning species has Ns in the gap region. Reflects uncertainty in the relationship between the DNA of both species, due to lack of sequence in relevant portions of the aligning species. Genomic Breaks Discontinuities in the genomic context (chromosome, scaffold or region) of the aligned DNA in the aligning species are shown as follows: Vertical blue bar: Represents a discontinuity that persists indefinitely on either side, e.g. a large region of DNA on either side of the bar comes from a different chromosome in the aligned species due to a large scale rearrangement. Green square brackets: Enclose shorter alignments consisting of DNA from one genomic context in the aligned species nested inside a larger chain of alignments from a different genomic context. The alignment within the brackets may represent a short misalignment, a lineage-specific insertion of a transposon in the C. elegans genome that aligns to a paralogous copy somewhere else in the aligned species, or other similar occurrence. Base Level When zoomed-in to the base-level display, the track shows the base composition of each alignment. The numbers and symbols on the Gaps line indicate the lengths of gaps in the C. elegans sequence at those alignment positions relative to the longest non-C. elegans sequence. If there is sufficient space in the display, the size of the gap is shown. If the space is insufficient and the gap size is a multiple of 3, a "*" is displayed; other gap sizes are indicated by "+". Codon translation is available in base-level display mode if the displayed region is identified as a coding segment. To display this annotation, select the species for translation from the pull-down menu in the Codon Translation configuration section at the top of the page. Then, select one of the following modes: No codon translation: The gene annotation is not used; the bases are displayed without translation. Use default species reading frames for translation: The annotations from the genome displayed in the Default species to establish reading frame pull-down menu are used to translate all the aligned species present in the alignment. Use reading frames for species if available, otherwise no translation: Codon translation is performed only for those species where the region is annotated as protein coding. Use reading frames for species if available, otherwise use default species: Codon translation is done on those species that are annotated as being protein coding over the aligned region using species-specific annotation; the remaining species are translated using the default species annotation. Codon translation uses the following gene tracks as the basis for translation, depending on the species chosen (Table 2). Species listed in the row labeled "None" do not have species-specific reading frames for gene translation. Gene TrackSpecies WS245 Worm Base Genes A. ceylanicum, Barber pole worm/H. contortus, C. angaria, C. brenneri, C. briggsae, C. elegans, C. japonica, C. remanei, C. sp. 5 ju800, C. tropicalis, Dog heartworm/D. immitis, Eye worm/L. loa, Filarial worm/B. malayi, H. bacteriophora/m31e, Microworm/P. redivivus, M. hapla, M. incognita, N. americanus, O. volvulus, P. exspectatus, P. pacificus, Pine wood nematode/B. xylophilus, Trichinella/T. spiralis, Whipworm/T. suis NCBI gene annotationsThreadworm/S. ratti no annotationPig roundworm/A. suum Table 2. Gene tracks used for codon translation. Methods Pairwise alignments with the C. elegans genome were generated for each species using lastz from repeat-masked genomic sequence. Pairwise alignments were then linked into chains using a dynamic programming algorithm that finds maximally scoring chains of gapless subsections of the alignments organized in a kd-tree. All pairwise alignment and chaining parameters are the same for all pairs. See also: nematode 26-way alignment parameters. High-scoring chains were then placed along the genome, with gaps filled by lower-scoring chains, to produce an alignment net. For more information about the chaining and netting process and parameters for each species, see the description pages for the Chain and Net tracks. The resulting best-in-genome pairwise alignments were progressively aligned using multiz/autoMZ, following the tree topology diagrammed above, to produce multiple alignments. The multiple alignments were post-processed to add annotations indicating alignment gaps, genomic breaks, and base quality of the component sequences. The annotated multiple alignments, in MAF format, are available for bulk download. An alignment summary table containing an entry for each alignment block in each species was generated to improve track display performance at large scales. Framing tables were constructed to enable visualization of codons in the multiple alignment display. Phylogenetic Tree Model Both phastCons and phyloP are phylogenetic methods that rely on a tree model containing the tree topology, branch lengths representing evolutionary distance at neutrally evolving sites, the background distribution of nucleotides, and a substitution rate matrix. The nematode tree model for this track was generated using the phyloFit program from the PHAST package (REV model, EM algorithm, medium precision) using multiple alignments of 4-fold degenerate sites extracted from the 26-way alignment (msa_view). The 4d sites were derived from the WormBase/Sanger gene set of C. elegans, filtered to select single-coverage long transcripts. PhastCons Conservation The phastCons program computes conservation scores based on a phylo-HMM, a type of probabilistic model that describes both the process of DNA substitution at each site in a genome and the way this process changes from one site to the next (Felsenstein and Churchill 1996, Yang 1995, Siepel and Haussler 2005). PhastCons uses a two-state phylo-HMM, with a state for conserved regions and a state for non-conserved regions. The value plotted at each site is the posterior probability that the corresponding alignment column was "generated" by the conserved state of the phylo-HMM. These scores reflect the phylogeny (including branch lengths) of the species in question, a continuous-time Markov model of the nucleotide substitution process, and a tendency for conservation levels to be autocorrelated along the genome (i.e., to be similar at adjacent sites). The general reversible (REV) substitution model was used. Unlike many conservation-scoring programs, phastCons does not rely on a sliding window of fixed size; therefore, short highly-conserved regions and long moderately conserved regions can both obtain high scores. More information about phastCons can be found in Siepel et al. 2005. The phastCons parameters were tuned to produce approximately 70% conserved elements in the C. elegans WormBase/Sanger gene coding regions. This parameter set (expected-length=15, target-coverage=0.3, rho=0.3) was then used to generate the nematode and caenorhabditis conservation scoring. PhyloP Conservation The phyloP program supports several different methods for computing p-values of conservation or acceleration, for individual nucleotides or larger elements (http://compgen.cshl.edu/phast/). Here it was used to produce separate scores at each base (--wig-scores option), considering all branches of the phylogeny rather than a particular subtree or lineage (i.e., the --subtree option was not used). The scores were computed by performing a likelihood ratio test at each alignment column (--method LRT), and scores for both conservation and acceleration were produced (--mode CONACC). Conserved Elements The conserved elements were predicted by running phastCons with the --most-conserved (aka --viterbi) option. The predicted elements are segments of the alignment that are likely to have been "generated" by the conserved state of the phylo-HMM. Each element is assigned a log-odds score equal to its log probability under the conserved model minus its log probability under the non-conserved model. The "score" field associated with this track contains transformed log-odds scores, taking values between 0 and 1000. (The scores are transformed using a monotonic function of the form a * log(x) + b.) The raw log odds scores are retained in the "name" field and can be seen on the details page or in the browser when the track's display mode is set to "pack" or "full". Credits This track was created using the following programs: Alignment tools: lastz (formerly blastz) and multiz by Bob Harris, Minmei Hou, Scott Schwartz and Webb Miller of the Penn State Bioinformatics Group Chaining and Netting: axtChain, chainNet by Jim Kent at UCSC Conservation scoring: phastCons, phyloP, phyloFit, tree_doctor, msa_view and other programs in PHAST by Adam Siepel at Cold Spring Harbor Laboratory (original development done at the Haussler lab at UCSC). MAF Annotation tools: mafAddIRows by Brian Raney, UCSC; mafAddQRows by Richard Burhans, Penn State; genePredToMafFrames by Mark Diekhans, UCSC Tree image generator: phyloPng by Galt Barber, UCSC Conservation track display: Kate Rosenbloom, Hiram Clawson (wiggle display), and Brian Raney (gap annotation and codon framing) at UCSC The phylogenetic tree is based on Kiontke et al. (2007). References Phylo-HMMs, phastCons, and phyloP: Felsenstein J, Churchill GA. A Hidden Markov Model approach to variation among sites in rate of evolution. Mol Biol Evol. 1996 Jan;13(1):93-104. PMID: 8583911 Pollard KS, Hubisz MJ, Rosenbloom KR, Siepel A. Detection of nonneutral substitution rates on mammalian phylogenies. Genome Res. 2010 Jan;20(1):110-21. PMID: 19858363; PMC: PMC2798823 Siepel A, Bejerano G, Pedersen JS, Hinrichs AS, Hou M, Rosenbloom K, Clawson H, Spieth J, Hillier LW, Richards S, et al. Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. Genome Res. 2005 Aug;15(8):1034-50. PMID: 16024819; PMC: PMC1182216 Siepel A, Haussler D. Phylogenetic Hidden Markov Models. In: Nielsen R, editor. Statistical Methods in Molecular Evolution. New York: Springer; 2005. pp. 325-351. Yang Z. A space-time process model for the evolution of DNA sequences. Genetics. 1995 Feb;139(2):993-1005. PMID: 7713447; PMC: PMC1206396 Chain/Net: Kent WJ, Baertsch R, Hinrichs A, Miller W, Haussler D. Evolution's cauldron: duplication, deletion, and rearrangement in the mouse and human genomes. Proc Natl Acad Sci U S A. 2003 Sep 30;100(20):11484-9. PMID: 14500911; PMC: PMC208784 Multiz: Blanchette M, Kent WJ, Riemer C, Elnitski L, Smit AF, Roskin KM, Baertsch R, Rosenbloom K, Clawson H, Green ED, et al. Aligning multiple genomic sequences with the threaded blockset aligner. Genome Res. 2004 Apr;14(4):708-15. PMID: 15060014; PMC: PMC383317 Harris RS. Improved pairwise alignment of genomic DNA. Ph.D. Thesis. Pennsylvania State University, USA. 2007. Blastz: Chiaromonte F, Yap VB, Miller W. Scoring pairwise genomic sequence alignments. Pac Symp Biocomput. 2002:115-26. PMID: 11928468 Schwartz S, Kent WJ, Smit A, Zhang Z, Baertsch R, Hardison RC, Haussler D, Miller W. Human-mouse alignments with BLASTZ. Genome Res. 2003 Jan;13(1):103-7. PMID: 12529312; PMC: PMC430961 Phylogenetic Tree: Kiontke K, Barrière A, Kolotuev I, Podbilewicz B, Sommer R, Fitch DH, Félix MA. Trends, stasis, and drift in the evolution of nematode vulva development. Curr Biol. 2007 Nov 20;17(22):1925-37. PMID: 18024125 cons26wayViewalign Multiz Alignments Nematode Multiz Alignment & Conservation (26 Species) Comparative Genomics multiz26way Multiz Align Multiz Alignments of 26 nematode assemblies Comparative Genomics cons26wayViewphastcons Element Conservation (phastCons) Nematode Multiz Alignment & Conservation (26 Species) Comparative Genomics phastCons26way 26-way Cons 26 nematodes conservation by PhastCons Comparative Genomics cons26wayViewelements Conserved Elements Nematode Multiz Alignment & Conservation (26 Species) Comparative Genomics phastConsElements26way 26-way El 26 nematodes Conserved Elements Comparative Genomics cons26wayViewphyloP Basewise Conservation (phyloP) Nematode Multiz Alignment & Conservation (26 Species) Comparative Genomics phyloP26way 26 nematodes Cons 26 nematodes Basewise Conservation by PhyloP Comparative Genomics cpgIslandExt CpG Islands CpG Islands (Islands < 300 Bases are Light Green) Expression and Regulation Description CpG islands are associated with genes, particularly housekeeping genes, in vertebrates. CpG islands are typically common near transcription start sites and may be associated with promoter regions. Normally a C (cytosine) base followed immediately by a G (guanine) base (a CpG) is rare in vertebrate DNA because the Cs in such an arrangement tend to be methylated. This methylation helps distinguish the newly synthesized DNA strand from the parent strand, which aids in the final stages of DNA proofreading after duplication. However, over evolutionary time, methylated Cs tend to turn into Ts because of spontaneous deamination. The result is that CpGs are relatively rare unless there is selective pressure to keep them or a region is not methylated for some other reason, perhaps having to do with the regulation of gene expression. CpG islands are regions where CpGs are present at significantly higher levels than is typical for the genome as a whole. The unmasked version of the track displays potential CpG islands that exist in repeat regions and would otherwise not be visible in the repeat masked version. By default, only the masked version of the track is displayed. To view the unmasked version, change the visibility settings in the track controls at the top of this page. Methods CpG islands were predicted by searching the sequence one base at a time, scoring each dinucleotide (+17 for CG and -1 for others) and identifying maximally scoring segments. Each segment was then evaluated for the following criteria: GC content of 50% or greater length greater than 200 bp ratio greater than 0.6 of observed number of CG dinucleotides to the expected number on the basis of the number of Gs and Cs in the segment The entire genome sequence, masking areas included, was used for the construction of the track Unmasked CpG. The track CpG Islands is constructed on the sequence after all masked sequence is removed. The CpG count is the number of CG dinucleotides in the island. The Percentage CpG is the ratio of CpG nucleotide bases (twice the CpG count) to the length. The ratio of observed to expected CpG is calculated according to the formula (cited in Gardiner-Garden et al. (1987)): Obs/Exp CpG = Number of CpG * N / (Number of C * Number of G) where N = length of sequence. The calculation of the track data is performed by the following command sequence: twoBitToFa assembly.2bit stdout | maskOutFa stdin hard stdout \ | cpg_lh /dev/stdin 2> cpg_lh.err \ | awk '{$2 = $2 - 1; width = $3 - $2; printf("%s\t%d\t%s\t%s %s\t%s\t%s\t%0.0f\t%0.1f\t%s\t%s\n", $1, $2, $3, $5, $6, width, $6, width*$7*0.01, 100.0*2*$6/width, $7, $9);}' \ | sort -k1,1 -k2,2n > cpgIsland.bed The unmasked track data is constructed from twoBitToFa -noMask output for the twoBitToFa command. Data access CpG islands and its associated tables can be explored interactively using the REST API, the Table Browser or the Data Integrator. All the tables can also be queried directly from our public MySQL servers, with more information available on our help page as well as on our blog. The source for the cpg_lh program can be obtained from src/utils/cpgIslandExt/. The cpg_lh program binary can be obtained from: http://hgdownload.soe.ucsc.edu/admin/exe/linux.x86_64/cpg_lh (choose "save file") Credits This track was generated using a modification of a program developed by G. Miklem and L. Hillier (unpublished). References Gardiner-Garden M, Frommer M. CpG islands in vertebrate genomes. J Mol Biol. 1987 Jul 20;196(2):261-82. PMID: 3656447 cpgIslandSuper CpG Islands CpG Islands (Islands < 300 Bases are Light Green) Expression and Regulation Description CpG islands are associated with genes, particularly housekeeping genes, in vertebrates. CpG islands are typically common near transcription start sites and may be associated with promoter regions. Normally a C (cytosine) base followed immediately by a G (guanine) base (a CpG) is rare in vertebrate DNA because the Cs in such an arrangement tend to be methylated. This methylation helps distinguish the newly synthesized DNA strand from the parent strand, which aids in the final stages of DNA proofreading after duplication. However, over evolutionary time, methylated Cs tend to turn into Ts because of spontaneous deamination. The result is that CpGs are relatively rare unless there is selective pressure to keep them or a region is not methylated for some other reason, perhaps having to do with the regulation of gene expression. CpG islands are regions where CpGs are present at significantly higher levels than is typical for the genome as a whole. The unmasked version of the track displays potential CpG islands that exist in repeat regions and would otherwise not be visible in the repeat masked version. By default, only the masked version of the track is displayed. To view the unmasked version, change the visibility settings in the track controls at the top of this page. Methods CpG islands were predicted by searching the sequence one base at a time, scoring each dinucleotide (+17 for CG and -1 for others) and identifying maximally scoring segments. Each segment was then evaluated for the following criteria: GC content of 50% or greater length greater than 200 bp ratio greater than 0.6 of observed number of CG dinucleotides to the expected number on the basis of the number of Gs and Cs in the segment The entire genome sequence, masking areas included, was used for the construction of the track Unmasked CpG. The track CpG Islands is constructed on the sequence after all masked sequence is removed. The CpG count is the number of CG dinucleotides in the island. The Percentage CpG is the ratio of CpG nucleotide bases (twice the CpG count) to the length. The ratio of observed to expected CpG is calculated according to the formula (cited in Gardiner-Garden et al. (1987)): Obs/Exp CpG = Number of CpG * N / (Number of C * Number of G) where N = length of sequence. The calculation of the track data is performed by the following command sequence: twoBitToFa assembly.2bit stdout | maskOutFa stdin hard stdout \ | cpg_lh /dev/stdin 2> cpg_lh.err \ | awk '{$2 = $2 - 1; width = $3 - $2; printf("%s\t%d\t%s\t%s %s\t%s\t%s\t%0.0f\t%0.1f\t%s\t%s\n", $1, $2, $3, $5, $6, width, $6, width*$7*0.01, 100.0*2*$6/width, $7, $9);}' \ | sort -k1,1 -k2,2n > cpgIsland.bed The unmasked track data is constructed from twoBitToFa -noMask output for the twoBitToFa command. Data access CpG islands and its associated tables can be explored interactively using the REST API, the Table Browser or the Data Integrator. All the tables can also be queried directly from our public MySQL servers, with more information available on our help page as well as on our blog. The source for the cpg_lh program can be obtained from src/utils/cpgIslandExt/. The cpg_lh program binary can be obtained from: http://hgdownload.soe.ucsc.edu/admin/exe/linux.x86_64/cpg_lh (choose "save file") Credits This track was generated using a modification of a program developed by G. Miklem and L. Hillier (unpublished). References Gardiner-Garden M, Frommer M. CpG islands in vertebrate genomes. J Mol Biol. 1987 Jul 20;196(2):261-82. PMID: 3656447 cons135way Cons 135 species Multiz Alignment & Conservation (135 species: 112 nematodes, 22 flatworms and Ciona intestinalis) Comparative Genomics Description This track shows multiple alignments of 135 species: 112 nematodes, 22 flatworms and one Ciona intestinalis sequence and measurements of evolutionary conservation using two methods (phastCons and phyloP) from the PHAST package, for all 135 species. The multiple alignments were generated using multiz and other tools in the UCSC/Penn State Bioinformatics comparative genomics alignment pipeline. Conserved elements identified by phastCons are also displayed in this track. The phylogenetic tree was derived from kmers in common counting between the sequences to obtain a 'distance' matrix, then using the phylip command 'neighbors' operation for the simple neighbor joining algorithm to establish this binary tree. This tree is not necessarily biologically correct, but it does serve as a useful guide tree for the multiz alignment procedure. See also: Phylip distance operations, assembly and alignment-free phylogeny reconstruction, and recapitulating phylogenies using k-mers. PhastCons (which has been used in previous Conservation tracks) is a hidden Markov model-based method that estimates the probability that each nucleotide belongs to a conserved element, based on the multiple alignment. It considers not just each individual alignment column, but also its flanking columns. By contrast, phyloP separately measures conservation at individual columns, ignoring the effects of their neighbors. As a consequence, the phyloP plots have a less smooth appearance than the phastCons plots, with more "texture" at individual sites. The two methods have different strengths and weaknesses. PhastCons is sensitive to "runs" of conserved sites, and is therefore effective for picking out conserved elements. PhyloP, on the other hand, is more appropriate for evaluating signatures of selection at particular nucleotides or classes of nucleotides (e.g., third codon positions, or first positions of miRNA target sites). Another important difference is that phyloP can measure acceleration (faster evolution than expected under neutral drift) as well as conservation (slower than expected evolution). In the phyloP plots, sites predicted to be conserved are assigned positive scores (and shown in blue), while sites predicted to be fast-evolving are assigned negative scores (and shown in red). The absolute values of the scores represent -log p-values under a null hypothesis of neutral evolution. The phastCons scores, by contrast, represent probabilities of negative selection and range between 0 and 1. Both phastCons and phyloP treat alignment gaps and unaligned nucleotides as missing data. See also: lastz parameters and other details, and chain minimum score and gap parameters used in these alignments. Missing sequence in the assemblies is highlighted in the track display by regions of yellow when zoomed out and Ns displayed at base level (see Gap Annotation, below). OrganismSpeciesAssembly namebrowser orNCBI sourcealignment type C. elegansCaenorhabditis elegans Feb. 2013 (WBcel235/ce11) Feb. 2013 (WBcel235/ce11) reference A. ceylanicumAncylostoma ceylanicum Mar. 2014 (WS243/Acey_2013.11.30.genDNA/ancCey1) Mar. 2014 (WS243/Acey_2013.11.30.genDNA/ancCey1) net Acrobeloides_nanusAcrobeloides nanus Jun. 2018 (v1) GCA_900406225.1 net Ancylostoma_caninumAncylostoma caninum Jul. 2018 (A_caninum_9.3.2.ec.cg.pg) GCA_003336725.1 net Ancylostoma_duodenaleAncylostoma duodenale Jan. 2015 (A_duodenale_2.2.ec.cg.pg) GCA_000816745.1 net Angiostrongylus_cantonensisAngiostrongylus cantonensis Nov. 2016 (ASM188428v1) GCA_001884285.1 net Ascaris_suumAscaris suum Nov. 2017 (ASM18702v3) GCA_000187025.3 net Barber pole wormHaemonchus contortus Jul. 2013 (WormBase WS239/haeCon2) Jul. 2013 (WormBase WS239/haeCon2) net Brugia_malayiBrugia malayi Mar. 2008 (ASM299v2) GCF_000002995.3 net Brugia_pahangiBrugia pahangi Sep. 2015 (Brugia_pa_1.0) GCA_001280985.1 net Bursaphelenchus_xylophilusBursaphelenchus xylophilus Oct. 2011 (ASM23113v1) GCA_000231135.1 net C. angariaCaenorhabditis angaria Apr. 2012 (WS232/ps1010rel8/caeAng2) Apr. 2012 (WS232/ps1010rel8/caeAng2) net C. brenneriCaenorhabditis brenneri Nov. 2010 (C. brenneri 6.0.1b/caePb3) Nov. 2010 (C. brenneri 6.0.1b/caePb3) net C. briggsaeCaenorhabditis briggsae Apr. 2011 (WS225/cb4) Apr. 2011 (WS225/cb4) net C. intestinalisCiona intestinalis Apr. 2011 (Kyoto KH/ci3) Apr. 2011 (Kyoto KH/ci3) net C. japonicaCaenorhabditis japonica Aug. 2010 (WUSTL 7.0.1/caeJap4) Aug. 2010 (WUSTL 7.0.1/caeJap4) net C. remaneiCaenorhabditis remanei Jul. 2007 (WS220/caeRem4) Jul. 2007 (WS220/caeRem4) net C. sp. 5 ju800Caenorhabditis sp5 ju800 Jan. 2012 (WS230/Caenorhabditis_sp_5-JU800-1.0/caeSp51) Jan. 2012 (WS230/Caenorhabditis_sp_5-JU800-1.0/caeSp51) net C. tropicalisCaenorhabditis tropicalis Nov. 2010 (WS226/WUSTL 3.0.1/caeSp111) Nov. 2010 (WS226/WUSTL 3.0.1/caeSp111) net C_briggsaeCaenorhabditis briggsae Jul. 2014 (CB4) GCA_000004555.3 net C_latensCaenorhabditis latens Aug. 2017 (CaeLat1.0) GCA_002259235.1 net C_nigoniCaenorhabditis nigoni Nov. 2017 (nigoni.pc_2016.07.14) GCA_002742825.1 net C_sp21_LS_2015Caenorhabditis sp. 21 LS-2015 Aug. 2018 (CPARV_v1) GCA_900536235.1 net C_sp26_LS_2015Caenorhabditis sp. 26 LS-2015 Aug. 2018 (CZANZ_v1) GCA_900536285.1 net C_sp31_LS_2015Caenorhabditis sp. 31 LS-2015 Aug. 2018 (CUTEL_v1) GCA_900536295.1 net C_sp32_LS_2015Caenorhabditis sp. 32 LS-2015 Aug. 2018 (CSULS_v1) GCA_900536325.1 net C_sp34_TK_2017Caenorhabditis sp. 34 TK-2017 Jun. 2017 (Sp34_v7) GCA_003052745.1 net C_sp38_MB_2015Caenorhabditis sp. 38 MB-2015 Aug. 2018 (CQUIO_v1) GCA_900536415.1 net C_sp39_LS_2015Caenorhabditis sp. 39 LS-2015 Aug. 2018 (CWAIT_v1) GCA_900536345.1 net C_sp40_LS_2015Caenorhabditis sp. 40 LS-2015 Aug. 2018 (CTRIB_v1) GCA_900536305.1 net Clonorchis_sinensisClonorchis sinensis Nov. 2011 (C_sinensis-2.0) GCA_000236345.1 net Dicrocoelium_dendriticumDicrocoelium dendriticum Sep. 2014 (D_dendriticum_Leon_v1_0_4) GCA_000950715.1 net Dictyocaulus_viviparusDictyocaulus viviparus Mar. 2015 (D_viviparus_9.2.1.ec.pg) GCA_000816705.1 net Diploscapter_coronatusDiploscapter coronatus Jun. 2017 (ASM220778v1) GCA_002207785.1 net Diploscapter_pachysDiploscapter pachys Sep. 2017 (DipSp1Ass11Ann3) GCA_002287525.1 net Dirofilaria_immitisDirofilaria immitis Aug. 2013 (ASM107739v1) GCA_001077395.1 net Ditylenchus_destructorDitylenchus destructor Mar. 2016 (ASM157970v1) GCA_001579705.1 net Dog heartwormDirofilaria immitis Sep. 2013 (WS240/D. immitis v2.2/dirImm1) Sep. 2013 (WS240/D. immitis v2.2/dirImm1) net Dugesia_japonicaDugesia japonica Jan. 2017 (Djap_assembly_v1) GCA_001938525.1 net Echinococcus_canadensisEchinococcus canadensis May 2016 (ECANG7) GCA_900004735.1 net Echinococcus_granulosusEchinococcus granulosus Jan. 2014 (ASM52419v1) GCA_000524195.1 net Echinococcus_multilocularisEchinococcus multilocularis Dec. 2015 (EMULTI002) GCA_000469725.3 net Elaeophora_elaphiElaeophora elaphi Nov. 2013 (EEL001) GCA_000499685.1 net Eye wormLoa loa Jul. 2012 (WS235/L_loa_Cameroon_isolate/loaLoa1) Jul. 2012 (WS235/L_loa_Cameroon_isolate/loaLoa1) net Fasciola_giganticaFasciola gigantica Jan. 2018 (ASM286751v1) GCA_002867515.1 net Fasciola_hepaticaFasciola hepatica Apr. 2018 (Fasciola_10x_pilon) GCA_900302435.1 net Filarial wormBrugia malayi May. 2014 (WS244/B_malayi-3.1/bruMal2) May. 2014 (WS244/B_malayi-3.1/bruMal2) net Girardia_tigrinaGirardia tigrina Jan. 2017 (gtig.1) GCA_001938485.1 net Globodera_ellingtonaeGlobodera ellingtonae Sep. 2016 (ASM172322v1) GCA_001723225.1 net Globodera_pallidaGlobodera pallida May 2014 (GPAL001) GCA_000724045.1 net Globodera_rostochiensisGlobodera rostochiensis Apr. 2016 (nGr) GCA_900079975.1 net Gyrodactylus_salarisGyrodactylus salaris Jun. 2014 (Gsalaris_v1) GCA_000715275.1 net H. bacteriophora/m31eHeterorhabditis bacteriophora Aug. 2011 (WS229/H. bacteriophora 7.0/hetBac1) Aug. 2011 (WS229/H. bacteriophora 7.0/hetBac1) net Haemonchus_contortusHaemonchus contortus Aug. 2013 (HCON) GCA_000469685.1 net Heligmosomoides_polygyrus_bakeriHeligmosomoides polygyrus bakeri Sep. 2016 (nHp_v2.0) GCA_900096555.1 net Heterodera_glycinesHeterodera glycines Apr. 2008 (HG2) GCA_000150805.1 net Hymenolepis_microstomaHymenolepis microstoma Dec. 2015 (HMIC002) GCA_000469805.2 net Loa_loaLoa loa Jul. 2012 (Loa_loa_V3.1) GCF_000183805.2 net M. haplaMeloidogyne hapla Sep. 2008 (M. hapla VW9 WS210/melHap1) Sep. 2008 (M. hapla VW9 WS210/melHap1) net M. incognitaMeloidogyne incognita Feb. 2008 (M. incognita WS245/PRJEA28837/melInc2) Feb. 2008 (M. incognita WS245/PRJEA28837/melInc2) net Macrostomum_lignanoMacrostomum lignano Aug. 2017 (Mlig_3_7) GCA_002269645.1 net Meloidogyne_arenariaMeloidogyne arenaria May 2018 (ASM313380v1) GCA_003133805.1 net Meloidogyne_floridensisMeloidogyne floridensis Jun. 2014 (nMf_1_1) GCA_000751915.1 net Meloidogyne_graminicolaMeloidogyne graminicola Nov. 2017 (Mgraminicola_V1) GCA_002778205.1 net Meloidogyne_incognitaMeloidogyne incognita May 2017 (Meloidogyne_incognita_V3) GCA_900182535.1 net Meloidogyne_javanicaMeloidogyne javanica Apr. 2017 (ASM90000394v1) GCA_900003945.1 net MicrowormPanagrellus redivivus Feb. 2013 (WS240/Pred3/panRed1) Feb. 2013 (WS240/Pred3/panRed1) net N. americanusNecator americanus Dec. 2013 (WS242/N_americanus_v1/necAme1) Dec. 2013 (WS242/N_americanus_v1/necAme1) net Necator_americanusNecator americanus Dec. 2013 (N_americanus_v1) GCF_000507365.1 net Nippostrongylus_brasiliensisNippostrongylus brasiliensis Aug. 2017 (NbL5_MIMR_Canu1.5) GCA_900200055.1 net O. volvulusOnchocerca volvulus Nov. 2013 (WS241/O_volvulus_Cameroon_v3/oncVol1) Nov. 2013 (WS241/O_volvulus_Cameroon_v3/oncVol1) net Oesophagostomum_dentatumOesophagostomum dentatum Dec. 2014 (O_dentatum_10.0.ec.cg.pg) GCA_000797555.1 net Onchocerca_flexuosaOnchocerca flexuosa Aug. 2017 (O_flexuosa_1.0.allpaths.pg.lrna) GCA_002249935.1 net Onchocerca_ochengiOnchocerca ochengi Mar. 2016 (O_ochengi_Ngaoundere) GCA_000950515.2 net Onchocerca_volvulusOnchocerca volvulus Feb. 2014 (ASM49940v2) GCA_000499405.2 net Opisthorchis_viverriniOpisthorchis viverrini Jul. 2014 (OpiViv1.0) GCA_000715545.1 net Oscheius_MCBOscheius sp. MCB Feb. 2015 (ASM93487v1) GCA_000934875.1 net Oscheius_TEL_2014Oscheius sp. TEL-2014 Jan. 2016 (ASM151353v1) GCA_001513535.1 net Oscheius_tipulaeOscheius tipulae May 2017 (Oscheius_tipulae_assembly_v2) GCA_900184235.1 net P. exspectatusPristionchus exspectatus Mar. 2014 (WS243/P_exspectatus_v1/priExs1) Mar. 2014 (WS243/P_exspectatus_v1/priExs1) net P. pacificusPristionchus pacificus Aug. 2014 (WS221/P_pacificus-v2/priPac3) Aug. 2014 (WS221/P_pacificus-v2/priPac3) net Parapristionchus_giblindavisiParapristionchus giblindavisi Jun. 2018 (Parapristionchus_genome) GCA_900491355.1 net Parascaris_univalensParascaris univalens Aug. 2017 (ASM225921v1) GCA_002259215.1 net Parastrongyloides_trichosuriParastrongyloides trichosuri Sep. 2014 (P_trichosuri_KNP) GCA_000941615.1 net Pig roundwormAscaris suum Sep. 2012 (WS229/AscSuum_1.0/ascSuu1) Sep. 2012 (WS229/AscSuum_1.0/ascSuu1) net Pine wood nematodeBursaphelenchus xylophilus Nov. 2011 (WS229/B. xylophilus Ka4C1/burXyl1) Nov. 2011 (WS229/B. xylophilus Ka4C1/burXyl1) net Plectus_sambesiiPlectus sambesii Nov. 2017 (Psam_v1.0) GCA_002796945.1 net Pristionchus_arcanusPristionchus arcanus Jun. 2018 (P._arcanus_genome) GCA_900490705.1 net Pristionchus_entomophagusPristionchus entomophagus Jun. 2018 (P._entomophagus_genome) GCA_900490825.1 net Pristionchus_exspectatusPristionchus exspectatus May 2018 (Pristionchus_exspectatus_de_novo_assembly) GCA_900380275.1 net Pristionchus_maxplanckiPristionchus maxplancki Jun. 2018 (Prisstionchus_maxplancki_genome) GCA_900490775.1 net Pristionchus_pacificusPristionchus pacificus Oct. 2017 (El_Paco) GCA_000180635.3 net Rhabditophanes_KR3021Rhabditophanes sp. KR3021 Sep. 2014 (Rhabditophanes_sp_KR3021) GCA_000944355.1 net Romanomermis_culicivoraxRomanomermis culicivorax Jan. 2014 (nRc.2.0) GCA_001039655.1 net Rotylenchulus_reniformisRotylenchulus reniformis Jun. 2015 (RREN1.0) GCA_001026735.1 net Schistosoma_haematobiumSchistosoma haematobium Jun. 2014 (SchHae_1.0) GCA_000699445.1 net Schistosoma_japonicumSchistosoma japonicum Apr. 2009 (ASM15177v1) GCA_000151775.1 net Schistosoma_mansoniSchistosoma mansoni Dec. 2011 (ASM23792v2) GCA_000237925.2 net Schmidtea_mediterraneaSchmidtea mediterranea Oct. 2017 (ASM260089v1) GCA_002600895.1 net Setaria_digitataSetaria digitata Jan. 2018 (Sdigitata) GCA_900083525.1 net Setaria_equinaSetaria equina Mar. 2018 (Setequ3.0) GCA_003012265.1 net Spirometra_erinaceieuropaeiSpirometra erinaceieuropaei Sep. 2014 (S_erinaceieuropaei) GCA_000951995.1 net Steinernema_carpocapsaeSteinernema carpocapsae Sep. 2014 (S_carpo_v1) GCA_000757645.1 net Steinernema_feltiaeSteinernema feltiae Sep. 2014 (S_felt_v1) GCA_000757705.1 net Steinernema_glaseriSteinernema glaseri Sep. 2014 (S_glas_v1) GCA_000757755.1 net Steinernema_monticolumSteinernema monticolum Dec. 2013 (S_monti_v1) GCA_000505645.1 net Steinernema_scapterisciSteinernema scapterisci Sep. 2014 (S_scapt_v1) GCA_000757745.1 net Strongyloides_papillosusStrongyloides papillosus Nov. 2014 (S_papillosus_LIN) GCA_000936265.1 net Strongyloides_stercoralisStrongyloides stercoralis Nov. 2014 (S_stercoralis_PV0001) GCA_000947215.1 net Strongyloides_venezuelensisStrongyloides venezuelensis Jun. 2015 (S_venezuelensis_HH1) GCA_001028725.1 net Subanguina_moxaeSubanguina moxae Apr. 2015 (SAMX_assembly_v0.8) GCA_000981365.1 net Taenia_asiaticaTaenia asiatica Sep. 2016 (Taenia_asiatica_TASYD01_v1) GCA_001693035.2 net Taenia_multicepsTaenia multiceps Jul. 2018 (T_multiceps_v2.0) GCA_001923025.2 net Taenia_saginataTaenia saginata Oct. 2016 (ASM169307v2) GCA_001693075.2 net Taenia_soliumTaenia solium Nov. 2016 (MEX_genome_complete.1-6-13) GCA_001870725.1 net Teladorsagia_circumcinctaTeladorsagia circumcincta Sep. 2017 (T_circumcincta.14.0.ec.cg.pg) GCA_002352805.1 net ThreadwormStrongyloides ratti Sep. 2014 (S. ratti ED321/strRat2) Sep. 2014 (S. ratti ED321/strRat2) net Toxocara_canisToxocara canis Dec. 2014 (Toxocara_canis_adult_r1.0) GCA_000803305.1 net TrichinellaTrichinella spiralis Jan. 2011 (WS225/Trichinella_spiralis-3.7.1/triSpi1) Jan. 2011 (WS225/Trichinella_spiralis-3.7.1/triSpi1) net Trichinella_T6Trichinella sp. T6 Nov. 2015 (T6_ISS34_r1.0) GCA_001447435.1 net Trichinella_T8Trichinella sp. T8 Nov. 2015 (T8_ISS272_r1.0) GCA_001447745.1 net Trichinella_T9Trichinella sp. T9 Nov. 2015 (T9_ISS409_r1.0) GCA_001447505.1 net Trichinella_britoviTrichinella britovi Nov. 2015 (T3_ISS120_r1.0) GCA_001447585.1 net Trichinella_murrelliTrichinella murrelli Jul. 2017 (ASM222148v1) GCA_002221485.1 net Trichinella_nativaTrichinella nativa Nov. 2015 (T2_ISS10_r1.0) GCA_001447565.1 net Trichinella_nelsoniTrichinella nelsoni Nov. 2015 (T7_ISS37_r1.0) GCA_001447455.1 net Trichinella_papuaeTrichinella papuae Nov. 2015 (T10_ISS1980_r1.0) GCA_001447755.1 net Trichinella_patagoniensisTrichinella patagoniensis Nov. 2015 (T12_ISS2496_r1.0) GCA_001447655.1 net Trichinella_pseudospiralisTrichinella pseudospiralis Nov. 2015 (T4_ISS588_r1.0) GCA_001447725.1 net Trichinella_spiralisTrichinella spiralis Jan. 2011 (Trichinella_spiralis-3.7.1) GCF_000181795.1 net Trichinella_zimbabwensisTrichinella zimbabwensis Nov. 2015 (T11_ISS1029_r1.0) GCA_001447665.1 net Trichuris_murisTrichuris muris Mar. 2014 (TMUE2.2) GCA_000612645.1 net Trichuris_trichiuraTrichuris trichiura Mar. 2014 (TTRE2.1) GCA_000613005.1 net WhipwormTrichuris suis Jul. 2014 (WS243/T. suis DCEP-RM93M male/triSui1) Jul. 2014 (WS243/T. suis DCEP-RM93M male/triSui1) net Wuchereria_bancroftiWuchereria bancrofti Feb. 2016 (Wb_PNG_Genome_assembly_pt22) GCA_001555675.1 net Table 1. Genome assemblies included in the 135-way Conservation track. Downloads for data in this track are available: Multiz alignments (MAF format), and phylogenetic trees PhyloP conservation (WIG format) PhastCons conservation (WIG format) Display Conventions and Configuration The track configuration options allow the user to display the three different sets of scores by all, subclass, individually, or any combination of these. In full and pack display modes, conservation scores are displayed as a wiggle track (histogram) in which the height reflects the value of the score. The conservation wiggles can be configured in a variety of ways to highlight different aspects of the displayed information. Click the Graph configuration help link for an explanation of the configuration options. Pairwise alignments of each species to the C. elegans genome are displayed below the conservation histogram as a grayscale density plot (in pack mode) or as a wiggle (in full mode) that indicates alignment quality. In dense display mode, conservation is shown in grayscale using darker values to indicate higher levels of overall conservation as scored by phastCons. Checkboxes on the track configuration page allow selection of the species to include in the pairwise display. Configuration buttons are available to select all of the species (Set all), deselect all of the species (Clear all), or use the default settings (Set defaults). Note that excluding species from the pairwise display does not alter the the conservation score display. To view detailed information about the alignments at a specific position, zoom the display in to 30,000 or fewer bases, then click on the alignment. Gap Annotation The Display chains between alignments configuration option enables display of gaps between alignment blocks in the pairwise alignments in a manner similar to the Chain track display. The following conventions are used: Single line: No bases in the aligned species. Possibly due to a lineage-specific insertion between the aligned blocks in the C. elegans genome or a lineage-specific deletion between the aligned blocks in the aligning species. Double line: Aligning species has one or more unalignable bases in the gap region. Possibly due to excessive evolutionary distance between species or independent indels in the region between the aligned blocks in both species. Pale yellow coloring: Aligning species has Ns in the gap region. Reflects uncertainty in the relationship between the DNA of both species, due to lack of sequence in relevant portions of the aligning species. Genomic Breaks Discontinuities in the genomic context (chromosome, scaffold or region) of the aligned DNA in the aligning species are shown as follows: Vertical blue bar: Represents a discontinuity that persists indefinitely on either side, e.g. a large region of DNA on either side of the bar comes from a different chromosome in the aligned species due to a large scale rearrangement. Green square brackets: Enclose shorter alignments consisting of DNA from one genomic context in the aligned species nested inside a larger chain of alignments from a different genomic context. The alignment within the brackets may represent a short misalignment, a lineage-specific insertion of a transposon in the C. elegans genome that aligns to a paralogous copy somewhere else in the aligned species, or other similar occurrence. Base Level When zoomed-in to the base-level display, the track shows the base composition of each alignment. The numbers and symbols on the Gaps line indicate the lengths of gaps in the C. elegans sequence at those alignment positions relative to the longest non-C. elegans sequence. If there is sufficient space in the display, the size of the gap is shown. If the space is insufficient and the gap size is a multiple of 3, a "*" is displayed; other gap sizes are indicated by "+". Codon translation is available in base-level display mode if the displayed region is identified as a coding segment. To display this annotation, select the species for translation from the pull-down menu in the Codon Translation configuration section at the top of the page. Then, select one of the following modes: No codon translation: The gene annotation is not used; the bases are displayed without translation. Use default species reading frames for translation: The annotations from the genome displayed in the Default species to establish reading frame pull-down menu are used to translate all the aligned species present in the alignment. Use reading frames for species if available, otherwise no translation: Codon translation is performed only for those species where the region is annotated as protein coding. Use reading frames for species if available, otherwise use default species: Codon translation is done on those species that are annotated as being protein coding over the aligned region using species-specific annotation; the remaining species are translated using the default species annotation. Codon translation uses the following gene tracks as the basis for translation, depending on the species chosen (Table 2). Gene TrackSpecies Ensembl Genes v92C. elegans, Ciona intestinalis WormBase WS245 genesC. angaria, C. japonica, C. briggsae, C. sp. 5 ju800, C. remanei, C. brenneri, C. tropicalis, P. exspectatus, P. pacificus, Pine wood nematode, N. americanus, A. ceylanicum, Pig roundworm, Barber pole worm, Whipworm, Microworm, Filarial worm, Dog heartworm, O. volvulus, Eye worm, M. incognita, M. hapla, H. bacteriophora/m31e, Trichinella no annotationsall others Table 2. Gene tracks used for codon translation. Methods Pairwise alignments with the C. elegans genome were generated for each species using lastz from repeat-masked genomic sequence. Pairwise alignments were then linked into chains using a dynamic programming algorithm that finds maximally scoring chains of gapless subsections of the alignments organized in a kd-tree. Please note the specific parameters for the alignments. High-scoring chains were then placed along the genome, with gaps filled by lower-scoring chains, to produce an alignment net. For more information about the chaining and netting process and parameters for each species, see the description pages for the Chain and Net tracks. The resulting best-in-genome pairwise alignments were progressively aligned using multiz/autoMZ, following the tree topology diagrammed above, to produce multiple alignments. The multiple alignments were post-processed to add annotations indicating alignment gaps, genomic breaks, and base quality of the component sequences. The annotated multiple alignments, in MAF format, are available for bulk download. An alignment summary table containing an entry for each alignment block in each species was generated to improve track display performance at large scales. Framing tables were constructed to enable visualization of codons in the multiple alignment display. Phylogenetic Tree Model Both phastCons and phyloP are phylogenetic methods that rely on a tree model containing the tree topology, branch lengths representing evolutionary distance at neutrally evolving sites, the background distribution of nucleotides, and a substitution rate matrix. The all species tree model for this track was generated using the phyloFit program from the PHAST package (REV model, EM algorithm, medium precision) using multiple alignments of 4-fold degenerate sites extracted from the 135-way alignment (msa_view). The 4d sites were derived from the NCBI RefSeq gene set, filtered to select single-coverage long transcripts. This same tree model was used in the phyloP calculations, however their background frequencies were modified to maintain reversibility. The resulting tree model for all species. PhastCons Conservation The phastCons program computes conservation scores based on a phylo-HMM, a type of probabilistic model that describes both the process of DNA substitution at each site in a genome and the way this process changes from one site to the next (Felsenstein and Churchill 1996, Yang 1995, Siepel and Haussler 2005). PhastCons uses a two-state phylo-HMM, with a state for conserved regions and a state for non-conserved regions. The value plotted at each site is the posterior probability that the corresponding alignment column was "generated" by the conserved state of the phylo-HMM. These scores reflect the phylogeny (including branch lengths) of the species in question, a continuous-time Markov model of the nucleotide substitution process, and a tendency for conservation levels to be autocorrelated along the genome (i.e., to be similar at adjacent sites). The general reversible (REV) substitution model was used. Unlike many conservation-scoring programs, phastCons does not rely on a sliding window of fixed size; therefore, short highly-conserved regions and long moderately conserved regions can both obtain high scores. More information about phastCons can be found in Siepel et al. 2005. The phastCons parameters used were: expected-length=45, target-coverage=0.3, rho=0.3. PhyloP Conservation The phyloP program supports several different methods for computing p-values of conservation or acceleration, for individual nucleotides or larger elements ( http://compgen.cshl.edu/phast/). Here it was used to produce separate scores at each base (--wig-scores option), considering all branches of the phylogeny rather than a particular subtree or lineage (i.e., the --subtree option was not used). The scores were computed by performing a likelihood ratio test at each alignment column (--method LRT), and scores for both conservation and acceleration were produced (--mode CONACC). Conserved Elements The conserved elements were predicted by running phastCons with the --viterbi option. The predicted elements are segments of the alignment that are likely to have been "generated" by the conserved state of the phylo-HMM. Each element is assigned a log-odds score equal to its log probability under the conserved model minus its log probability under the non-conserved model. The "score" field associated with this track contains transformed log-odds scores, taking values between 0 and 1000. (The scores are transformed using a monotonic function of the form a * log(x) + b.) The raw log odds scores are retained in the "name" field and can be seen on the details page or in the browser when the track's display mode is set to "pack" or "full". Credits This track was created using the following programs: Alignment tools: lastz (formerly blastz) and multiz by Minmei Hou, Scott Schwartz and Webb Miller of the Penn State Bioinformatics Group Chaining and Netting: axtChain, chainNet by Jim Kent at UCSC Conservation scoring: phastCons, phyloP, phyloFit, tree_doctor, msa_view and other programs in PHAST by Adam Siepel at Cold Spring Harbor Laboratory (original development done at the Haussler lab at UCSC). MAF Annotation tools: mafAddIRows by Brian Raney, UCSC; mafAddQRows by Richard Burhans, Penn State; genePredToMafFrames by Mark Diekhans, UCSC Tree image generator: phyloPng by Galt Barber, UCSC Conservation track display: Kate Rosenbloom, Hiram Clawson (wiggle display), and Brian Raney (gap annotation and codon framing) at UCSC The phylogenetic tree is based on Murphy et al. (2001) and general consensus in the vertebrate phylogeny community as of March 2007. References Phylo-HMMs, phastCons, and phyloP: Felsenstein J, Churchill GA. A Hidden Markov Model approach to variation among sites in rate of evolution. Mol Biol Evol. 1996 Jan;13(1):93-104. PMID: 8583911 Pollard KS, Hubisz MJ, Rosenbloom KR, Siepel A. Detection of nonneutral substitution rates on mammalian phylogenies. Genome Res. 2010 Jan;20(1):110-21. PMID: 19858363; PMC: PMC2798823 Siepel A, Bejerano G, Pedersen JS, Hinrichs AS, Hou M, Rosenbloom K, Clawson H, Spieth J, Hillier LW, Richards S, et al. Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. Genome Res. 2005 Aug;15(8):1034-50. PMID: 16024819; PMC: PMC1182216 Siepel A, Haussler D. Phylogenetic Hidden Markov Models. In: Nielsen R, editor. Statistical Methods in Molecular Evolution. New York: Springer; 2005. pp. 325-351. Yang Z. A space-time process model for the evolution of DNA sequences. Genetics. 1995 Feb;139(2):993-1005. PMID: 7713447; PMC: PMC1206396 Chain/Net: Kent WJ, Baertsch R, Hinrichs A, Miller W, Haussler D. Evolution's cauldron: duplication, deletion, and rearrangement in the mouse and human genomes. Proc Natl Acad Sci U S A. 2003 Sep 30;100(20):11484-9. PMID: 14500911; PMC: PMC208784 Multiz: Blanchette M, Kent WJ, Riemer C, Elnitski L, Smit AF, Roskin KM, Baertsch R, Rosenbloom K, Clawson H, Green ED, et al. Aligning multiple genomic sequences with the threaded blockset aligner. Genome Res. 2004 Apr;14(4):708-15. PMID: 15060014; PMC: PMC383317 Lastz (formerly Blastz): Chiaromonte F, Yap VB, Miller W. Scoring pairwise genomic sequence alignments. Pac Symp Biocomput. 2002:115-26. PMID: 11928468 Harris RS. Improved pairwise alignment of genomic DNA. Ph.D. Thesis. Pennsylvania State University, USA. 2007. Schwartz S, Kent WJ, Smit A, Zhang Z, Baertsch R, Hardison RC, Haussler D, Miller W. Human-mouse alignments with BLASTZ. Genome Res. 2003 Jan;13(1):103-7. PMID: 12529312; PMC: PMC430961 Phylogenetic Tree: Bernard G, Ragan MA, Chan CX. Recapitulating phylogenies using k-mers: from trees to networks. F1000Res. 2016;5:2789. PMID: 28105314 Fan H, Ives AR, Surget-Groba Y, Cannon CH. An assembly and alignment-free method of phylogeny reconstruction from next-generation sequencing data. BMC Genomics. 2015;16(1):522. PMID: 26169061 Murphy WJ, Eizirik E, O'Brien SJ, Madsen O, Scally M, Douady CJ, Teeling E, Ryder OA, Stanhope MJ, de Jong WW, Springer MS. Resolution of the early placental mammal radiation using Bayesian phylogenetics. Science. 2001 Dec 14;294(5550):2348-51. PMID: 11743200 cons135wayViewalign Multiz Alignments Multiz Alignment & Conservation (135 species: 112 nematodes, 22 flatworms and Ciona intestinalis) Comparative Genomics multiz135way Multiz Align Multiz Alignments of 135 species Comparative Genomics cons135wayViewphastcons Element Conservation (phastCons) Multiz Alignment & Conservation (135 species: 112 nematodes, 22 flatworms and Ciona intestinalis) Comparative Genomics phastCons135way Cons 135 species 135 species conservation by PhastCons Comparative Genomics cons135wayViewelements Conserved Elements Multiz Alignment & Conservation (135 species: 112 nematodes, 22 flatworms and Ciona intestinalis) Comparative Genomics phastConsElements135way 135 species El 135 species Conserved Elements Comparative Genomics cons135wayViewphyloP Basewise Conservation (phyloP) Multiz Alignment & Conservation (135 species: 112 nematodes, 22 flatworms and Ciona intestinalis) Comparative Genomics phyloP135way Cons 135 species 135 species Basewise Conservation by PhyloP Comparative Genomics refSeqComposite NCBI RefSeq RefSeq genes from NCBI Genes and Gene Predictions Description The NCBI RefSeq Genes composite track shows C. elegans protein-coding and non-protein-coding genes taken from the NCBI RNA reference sequences collection (RefSeq). All subtracks use coordinates provided by RefSeq, except for the UCSC RefSeq track, which UCSC produces by realigning the RefSeq RNAs to the genome. This realignment may result in occasional differences between the annotation coordinates provided by UCSC and NCBI. For RNA-seq analysis, we advise using NCBI aligned tables like RefSeq All or RefSeq Curated. See the Methods section for more details about how the different tracks were created. Please visit NCBI's Feedback for Gene and Reference Sequences (RefSeq) page to make suggestions, submit additions and corrections, or ask for help concerning RefSeq records. For more information on the different gene tracks, see our Genes FAQ. Display Conventions and Configuration This track is a composite track that contains differing data sets. To show only a selected set of subtracks, uncheck the boxes next to the tracks that you wish to hide. Note: Not all subtracks are available on all assemblies. The possible subtracks include: RefSeq aligned annotations and UCSC alignment of RefSeq annotations RefSeq All – all curated and predicted annotations provided by RefSeq. RefSeq Curated – subset of RefSeq All that includes only those annotations whose accessions begin with NM, NR, NP or YP. (NP and YP are used only for protein-coding genes on the mitochondrion; YP is used for human only.) They were manually curated, based on publications describing transcripts and manual reviews of evidence which includes EST and full-length cDNA alignments, protein sequences, splice sites and any other evidence available in databases or the scientific literature. The resulting sequences can differ from the genome, they exist independently from a particular human genome build, and so must be aligned to the genome to create a track. The "RefSeq Curated" track is NCBI's mapping of these transcripts to the genome. Another alignment track exists for these, the "UCSC RefSeq" track (see beloow). RefSeq Predicted – subset of RefSeq All that includes those annotations whose accessions begin with XM or XR. They were predicted based on protein, cDNA, EST and RNA-seq alignments to the genome assembly by the NCBI Gnomon prediction software. RefSeq Other – all other annotations produced by the RefSeq group that do not fit the requirements for inclusion in the RefSeq Curated or the RefSeq Predicted tracks. Examples are untranscribed pseudogenes or gene clusters, such as HOX or protocadherin alpha. They were manually curated from publications or databases but are not typical transcribed genes. RefSeq Alignments – alignments of RefSeq RNAs to the C. elegans genome provided by the RefSeq group, following the display conventions for PSL tracks. RefSeq Diffs – alignment differences between the C. elegans reference genome(s) and RefSeq curated transcripts. (Track not currently available for every assembly.) UCSC RefSeq – annotations generated from UCSC's realignment of RNAs with NM and NR accessions to the C. elegans genome. This track was previously known as the "RefSeq Genes" track. RefSeq Select (subset, only on hg38) – Subset of RefSeq Curated, transcripts marked as part of the RefSeq Select dataset. A single Select transcript is chosen as representative for each protein-coding gene. See NCBI RefSeq Select. RefSeq HGMD (subset) – Subset of RefSeq Curated, transcripts annotated by the Human Gene Mutation Database. This track is only available on the human genomes hg19 and hg38. It is the most restricted RefSeq subset, targeting clinical diagnostics. NCBI Orthologs – Orthologous genes were identified by NCBI's Eukaryotic Genome Annotation Pipeline for the NCBI Gene dataset using a combination of protein sequence similarity and local synteny analysis. Orthology is determined between the genome being annotated and a reference genome, such as human or zebrafish, and pairs of orthologs are grouped together. Transitive relationships are inferred within each group, for example, zebrafish <-> human <-> mouse. For more information on how NCBI calculates orthologs, see the details provided here. This track is available for the following assemblies: hg38, mm39, danRer11, canFam6, and bosTau9. The RefSeq All, RefSeq Curated, RefSeq Predicted, and UCSC RefSeq tracks follow the display conventions for gene prediction tracks. The color shading indicates the level of review the RefSeq record has undergone: predicted (light), provisional (medium), or reviewed (dark), as defined by RefSeq. Color Level of review Reviewed: the RefSeq record has been reviewed by NCBI staff or by a collaborator. The NCBI review process includes assessing available sequence data and the literature. Some RefSeq records may incorporate expanded sequence and annotation information. Provisional: the RefSeq record has not yet been subject to individual review. The initial sequence-to-gene association has been established by outside collaborators or NCBI staff. Predicted: the RefSeq record has not yet been subject to individual review, and some aspect of the RefSeq record is predicted. The item labels and codon display properties for features within this track can be configured through the check-box controls at the top of the track description page. To adjust the settings for an individual subtrack, click the wrench icon next to the track name in the subtrack list . Label: By default, items are labeled by gene name. Click the appropriate Label option to display the accession name or OMIM identifier instead of the gene name, show all or a subset of these labels including the gene name, OMIM identifier and accession names, or turn off the label completely. Codon coloring: This track has an optional codon coloring feature that allows users to quickly validate and compare gene predictions. To display codon colors, select the genomic codons option from the Color track by codons pull-down menu. For more information about this feature, go to the Coloring Gene Predictions and Annotations by Codon page. The RefSeq Diffs track contains five different types of inconsistency between the reference genome sequence and the RefSeq transcript sequences. The five types of differences are as follows: mismatch – aligned but mismatching bases, plus HGVS g. to show the genomic change required to match the transcript and HGVS c./n. to show the transcript change required to match the genome. short gap – genomic gaps that are too small to be introns (arbitrary cutoff of < 45 bp), most likely insertions/deletion variants or errors, with HGVS g. and c./n. showing differences. shift gap – shortGap items whose placement could be shifted left and/or right on the genome due to repetitive sequence, with HGVS c./n. position range of ambiguous region in transcript. Here, thin and thick lines are used -- the thin line shows the span of the repetitive sequence, and the thick line shows the rightmost shifted gap. double gap – genomic gaps that are long enough to be introns but that skip over transcript sequence (invisible in default setting), with HGVS c./n. deletion. skipped – sequence at the beginning or end of a transcript that is not aligned to the genome (invisible in default setting), with HGVS c./n. deletion HGVS Terminology (Human Genome Variation Society): g. = genomic sequence ; c. = coding DNA sequence ; n. = non-coding RNA reference sequence. When reporting HGVS with RefSeq sequences, to make sure that results from research articles can be mapped to the genome unambiguously, please specify the RefSeq annotation release displayed on the transcript's Genome Browser details page and also the RefSeq transcript ID with version (e.g. NM_012309.4 not NM_012309). Methods Tracks contained in the RefSeq annotation and RefSeq RNA alignment tracks were created at UCSC using data from the NCBI RefSeq project. Data files were downloaded from RefSeq in GFF file format and converted to the genePred and PSL table formats for display in the Genome Browser. Information about the NCBI annotation pipeline can be found here. The RefSeq Diffs track is generated by UCSC using NCBI's RefSeq RNA alignments. The UCSC RefSeq Genes track is constructed using the same methods as previous RefSeq Genes tracks. RefSeq RNAs were aligned against the C. elegans genome using BLAT. Those with an alignment of less than 15% were discarded. When a single RNA aligned in multiple places, the alignment having the highest base identity was identified. Only alignments having a base identity level within 0.1% of the best and at least 96% base identity with the genomic sequence were kept. The NCBI Orthologs track was generated using the latest NCBI files (gene2accession and gene_orthologs). NCBI chromosome identifiers were mapped to UCSC-compatible IDs using species-specific chromosome alias files, and genes were filtered to include only those located on valid NCBI chromosomes. A custom Python script processed the ortholog relationships and created bed files for each species. The bed files were then converted to BigBed format, with indexing for search functionality. The procedure is documented in the makeDoc from our GitHub repository. Data Access The raw data for these tracks can be accessed in multiple ways. It can be explored interactively using the REST API, Table Browser or Data Integrator. The tables can also be accessed programmatically through our public MySQL server or downloaded from our downloads server for local processing. The previous track versions are available in the archives of our downloads server. You can also access any RefSeq table entries in JSON format through our JSON API. The data in the RefSeq Other, RefSeq Diffs, and NCBI Orthologs tracks are organized in bigBed file format; more information about accessing the information in this bigBed file can be found below. The other subtracks are associated with database tables as follows: genePred format: RefSeq All - ncbiRefSeq RefSeq Curated - ncbiRefSeqCurated RefSeq Predicted - ncbiRefSeqPredicted UCSC RefSeq - refGene PSL format: RefSeq Alignments - ncbiRefSeqPsl The first column of each of these tables is "bin". This column is designed to speed up access for display in the Genome Browser, but can be safely ignored in downstream analysis. You can read more about the bin indexing system here. The annotations in the RefSeqOther, RefSeqDiffs, and NCBI Orthologs tracks are stored in bigBed files, which can be obtained from our downloads server here, ncbiRefSeqOther.bb, ncbiRefSeqDiffs.bb, and ncbiOrtho.bb. Individual regions or the whole set of genome-wide annotations can be obtained using our tool bigBedToBed which can be compiled from the source code or downloaded as a precompiled binary for your system from the utilities directory linked below. For example, to extract only annotations in a given region, you could use the following command: bigBedToBed http://hgdownload.soe.ucsc.edu/gbdb/ce11/ncbiRefSeq/ncbiRefSeqOther.bb -chrom=chr16 -start=34990190 -end=36727467 stdout You can download a GTF format version of the RefSeq All table from the GTF downloads directory. The genePred format tracks can also be converted to GTF format using the genePredToGtf utility, available from the utilities directory on the UCSC downloads server. The utility can be run from the command line like so: genePredToGtf ce11 ncbiRefSeqPredicted ncbiRefSeqPredicted.gtf Note that using genePredToGtf in this manner accesses our public MySQL server, and you therefore must set up your hg.conf as described on the MySQL page linked near the beginning of the Data Access section. A file containing the RNA sequences in FASTA format for all items in the RefSeq All, RefSeq Curated, and RefSeq Predicted tracks can be found on our downloads server here. Please refer to our mailing list archives for questions. Previous versions of the ncbiRefSeq set of tracks can be found on our archive download server. Credits This track was produced at UCSC from data generated by scientists worldwide and curated by the NCBI RefSeq project. References Kent WJ. BLAT - the BLAST-like alignment tool. Genome Res. 2002 Apr;12(4):656-64. PMID: 11932250; PMC: PMC187518 Pruitt KD, Brown GR, Hiatt SM, Thibaud-Nissen F, Astashyn A, Ermolaeva O, Farrell CM, Hart J, Landrum MJ, McGarvey KM et al. RefSeq: an update on mammalian reference sequences. Nucleic Acids Res. 2014 Jan;42(Database issue):D756-63. PMID: 24259432; PMC: PMC3965018 Pruitt KD, Tatusova T, Maglott DR. NCBI Reference Sequence (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Res. 2005 Jan 1;33(Database issue):D501-4. PMID: 15608248; PMC: PMC539979 refGene UCSC RefSeq UCSC annotations of RefSeq RNAs (NM_* and NR_*) Genes and Gene Predictions Description The RefSeq Genes track shows known C. elegans protein-coding and non-protein-coding genes taken from the NCBI RNA reference sequences collection (RefSeq). The data underlying this track are updated weekly. Please visit the Feedback for Gene and Reference Sequences (RefSeq) page to make suggestions, submit additions and corrections, or ask for help concerning RefSeq records. Display Conventions and Configuration This track follows the display conventions for gene prediction tracks. The color shading indicates the level of review the RefSeq record has undergone: predicted (light), provisional (medium), reviewed (dark). The item labels and display colors of features within this track can be configured through the controls at the top of the track description page. This page is accessed via the small button to the left of the track's graphical display or through the link on the track's control menu. Label: By default, items are labeled by gene name. Click the appropriate Label option to display the accession name instead of the gene name, show both the gene and accession names, or turn off the label completely. Codon coloring: This track contains an optional codon coloring feature that allows users to quickly validate and compare gene predictions. To display codon colors, select the genomic codons option from the Color track by codons pull-down menu. Go to the Coloring Gene Predictions and Annotations by Codon page for more information about this feature. Hide non-coding genes: By default, both the protein-coding and non-protein-coding genes are displayed. If you wish to see only the coding genes, click this box. Methods RefSeq RNAs were aligned against the C. elegans genome using blat; those with an alignment of less than 15% were discarded. When a single RNA aligned in multiple places, the alignment having the highest base identity was identified. Only alignments having a base identity level within 0.1% of the best and at least 96% base identity with the genomic sequence were kept. Credits This track was produced at UCSC from RNA sequence data generated by scientists worldwide and curated by the NCBI RefSeq project. References Kent WJ. BLAT - the BLAST-like alignment tool. Genome Res. 2002 Apr;12(4):656-64. PMID: 11932250; PMC: PMC187518 Pruitt KD, Tatusova T, Maglott DR. NCBI Reference Sequence (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Res. 2005 Jan 1;33(Database issue):D501-4. PMID: 15608248; PMC: PMC539979 ncbiRefSeqGenomicDiff RefSeq Diffs Differences between NCBI RefSeq Transcripts and the Reference Genome Genes and Gene Predictions ncbiRefSeqPsl RefSeq Alignments RefSeq Alignments of RNAs Genes and Gene Predictions ncbiRefSeqOther RefSeq Other NCBI RefSeq Other Annotations (not NM_*, NR_*, XM_*, XR_*, NP_* or YP_*) Genes and Gene Predictions ncbiRefSeqCurated RefSeq Curated NCBI RefSeq genes, curated subset (NM_*, NR_*, NP_* or YP_*) Genes and Gene Predictions ncbiRefSeq RefSeq All NCBI RefSeq genes, curated and predicted (NM_*, XM_*, NR_*, XR_*, NP_*, YP_*) Genes and Gene Predictions cpgIslandExtUnmasked Unmasked CpG CpG Islands on All Sequence (Islands < 300 Bases are Light Green) Expression and Regulation Description CpG islands are associated with genes, particularly housekeeping genes, in vertebrates. CpG islands are typically common near transcription start sites and may be associated with promoter regions. Normally a C (cytosine) base followed immediately by a G (guanine) base (a CpG) is rare in vertebrate DNA because the Cs in such an arrangement tend to be methylated. This methylation helps distinguish the newly synthesized DNA strand from the parent strand, which aids in the final stages of DNA proofreading after duplication. However, over evolutionary time, methylated Cs tend to turn into Ts because of spontaneous deamination. The result is that CpGs are relatively rare unless there is selective pressure to keep them or a region is not methylated for some other reason, perhaps having to do with the regulation of gene expression. CpG islands are regions where CpGs are present at significantly higher levels than is typical for the genome as a whole. The unmasked version of the track displays potential CpG islands that exist in repeat regions and would otherwise not be visible in the repeat masked version. By default, only the masked version of the track is displayed. To view the unmasked version, change the visibility settings in the track controls at the top of this page. Methods CpG islands were predicted by searching the sequence one base at a time, scoring each dinucleotide (+17 for CG and -1 for others) and identifying maximally scoring segments. Each segment was then evaluated for the following criteria: GC content of 50% or greater length greater than 200 bp ratio greater than 0.6 of observed number of CG dinucleotides to the expected number on the basis of the number of Gs and Cs in the segment The entire genome sequence, masking areas included, was used for the construction of the track Unmasked CpG. The track CpG Islands is constructed on the sequence after all masked sequence is removed. The CpG count is the number of CG dinucleotides in the island. The Percentage CpG is the ratio of CpG nucleotide bases (twice the CpG count) to the length. The ratio of observed to expected CpG is calculated according to the formula (cited in Gardiner-Garden et al. (1987)): Obs/Exp CpG = Number of CpG * N / (Number of C * Number of G) where N = length of sequence. The calculation of the track data is performed by the following command sequence: twoBitToFa assembly.2bit stdout | maskOutFa stdin hard stdout \ | cpg_lh /dev/stdin 2> cpg_lh.err \ | awk '{$2 = $2 - 1; width = $3 - $2; printf("%s\t%d\t%s\t%s %s\t%s\t%s\t%0.0f\t%0.1f\t%s\t%s\n", $1, $2, $3, $5, $6, width, $6, width*$7*0.01, 100.0*2*$6/width, $7, $9);}' \ | sort -k1,1 -k2,2n > cpgIsland.bed The unmasked track data is constructed from twoBitToFa -noMask output for the twoBitToFa command. Data access CpG islands and its associated tables can be explored interactively using the REST API, the Table Browser or the Data Integrator. All the tables can also be queried directly from our public MySQL servers, with more information available on our help page as well as on our blog. The source for the cpg_lh program can be obtained from src/utils/cpgIslandExt/. The cpg_lh program binary can be obtained from: http://hgdownload.soe.ucsc.edu/admin/exe/linux.x86_64/cpg_lh (choose "save file") Credits This track was generated using a modification of a program developed by G. Miklem and L. Hillier (unpublished). References Gardiner-Garden M, Frommer M. CpG islands in vertebrate genomes. J Mol Biol. 1987 Jul 20;196(2):261-82. PMID: 3656447 ws245Genes WS245 Genes Gene predictions from WormBase WS245 release Genes and Gene Predictions Description These gene predictions were generated by WormBase. Methods These gene predictions were produced and hand-curated by WormBase. For a detailed description of the methods, refer to Howe et al. (2016) in the References section below. Credits Thanks to WormBase for providing this annotation. References Howe KL, Bolt BJ, Cain S, Chan J, Chen WJ, Davis P, Done J, Down T, Gao S, Grove C et al. WormBase 2016: expanding to enable helminth genomic research. Nucleic Acids Res. 2016 Jan 4;44(D1):D774-80. PMID: 26578572; PMC: PMC4702863 intronEst Spliced ESTs C. elegans ESTs That Have Been Spliced mRNA and EST Description This track shows alignments between C. elegans expressed sequence tags (ESTs) in GenBank and the genome that show signs of splicing when aligned against the genome. ESTs are single-read sequences, typically about 500 bases in length, that usually represent fragments of transcribed genes. To be considered spliced, an EST must show evidence of at least one canonical intron, i.e. one that is at least 32 bases in length and has GT/AG ends. By requiring splicing, the level of contamination in the EST databases is drastically reduced at the expense of eliminating many genuine 3' ESTs. For a display of all ESTs (including unspliced), see the C. elegans EST track. Display Conventions and Configuration This track follows the display conventions for PSL alignment tracks. In dense display mode, darker shading indicates a larger number of aligned ESTs. The strand information (+/-) indicates the direction of the match between the EST and the matching genomic sequence. It bears no relationship to the direction of transcription of the RNA with which it might be associated. The description page for this track has a filter that can be used to change the display mode, alter the color, and include/exclude a subset of items within the track. This may be helpful when many items are shown in the track display, especially when only some are relevant to the current task. To use the filter: Type a term in one or more of the text boxes to filter the EST display. For example, to apply the filter to all ESTs expressed in a specific organ, type the name of the organ in the tissue box. To view the list of valid terms for each text box, consult the table in the Table Browser that corresponds to the factor on which you wish to filter. For example, the "tissue" table contains all the types of tissues that can be entered into the tissue text box. Wildcards may also be used in the filter. If filtering on more than one value, choose the desired combination logic. If "and" is selected, only ESTs that match all filter criteria will be highlighted. If "or" is selected, ESTs that match any one of the filter criteria will be highlighted. Choose the color or display characteristic that should be used to highlight or include/exclude the filtered items. If "exclude" is chosen, the browser will not display ESTs that match the filter criteria. If "include" is selected, the browser will display only those ESTs that match the filter criteria. This track may also be configured to display base labeling, a feature that allows the user to display all bases in the aligning sequence or only those that differ from the genomic sequence. For more information about this option, go to the Base Coloring for Alignment Tracks page. Methods To make an EST, RNA is isolated from cells and reverse transcribed into cDNA. Typically, the cDNA is cloned into a plasmid vector and a read is taken from the 5' and/or 3' primer. For most — but not all — ESTs, the reverse transcription is primed by an oligo-dT, which hybridizes with the poly-A tail of mature mRNA. The reverse transcriptase may or may not make it to the 5' end of the mRNA, which may or may not be degraded. In general, the 3' ESTs mark the end of transcription reasonably well, but the 5' ESTs may end at any point within the transcript. Some of the newer cap-selected libraries cover transcription start reasonably well. Before the cap-selection techniques emerged, some projects used random rather than poly-A priming in an attempt to retrieve sequence distant from the 3' end. These projects were successful at this, but as a side effect also deposited sequences from unprocessed mRNA and perhaps even genomic sequences into the EST databases. Even outside of the random-primed projects, there is a degree of non-mRNA contamination. Because of this, a single unspliced EST should be viewed with considerable skepticism. To generate this track, C. elegans ESTs from GenBank were aligned against the genome using blat. Note that the maximum intron length allowed by blat is 750,000 bases, which may eliminate some ESTs with very long introns that might otherwise align. When a single EST aligned in multiple places, the alignment having the highest base identity was identified. Only alignments having a base identity level within 0.5% of the best and at least 96% base identity with the genomic sequence are displayed in this track. Credits This track was produced at UCSC from EST sequence data submitted to the international public sequence databases by scientists worldwide. References Benson DA, Karsch-Mizrachi I, Lipman DJ, Ostell J, Wheeler DL. GenBank: update. Nucleic Acids Res. 2004 Jan 1;32(Database issue):D23-6. PMID: 14681350; PMC: PMC308779 Kent WJ. BLAT - the BLAST-like alignment tool. Genome Res. 2002 Apr;12(4):656-64. PMID: 11932250; PMC: PMC187518 gold Assembly Assembly from Fragments Mapping and Sequencing Description This track shows the sequences used in the Aug. 2014 C. elegans genome assembly. Genome assembly procedures are covered in the NCBI assembly documentation. NCBI also provides specific information about this assembly. The definition of this assembly is from the AGP file delivered with the sequence. The NCBI document AGP Specification describes the format of the AGP file. In dense mode, this track depicts the contigs that make up the currently viewed scaffold. Contig boundaries are distinguished by the use of alternating gold and brown coloration. Where gaps exist between contigs, spaces are shown between the gold and brown blocks. The relative order and orientation of the contigs within a scaffold is always known; therefore, a line is drawn in the graphical display to bridge the blocks. Component types found in this track (with counts of that type in parentheses): F - finished sequence (3,268) augustusGene AUGUSTUS AUGUSTUS ab initio gene predictions v3.1 Genes and Gene Predictions Description This track shows ab initio predictions from the program AUGUSTUS (version 3.1). The predictions are based on the genome sequence alone. For more information on the different gene tracks, see our Genes FAQ. Methods Statistical signal models were built for splice sites, branch-point patterns, translation start sites, and the poly-A signal. Furthermore, models were built for the sequence content of protein-coding and non-coding regions as well as for the length distributions of different exon and intron types. Detailed descriptions of most of these different models can be found in Mario Stanke's dissertation. This track shows the most likely gene structure according to a Semi-Markov Conditional Random Field model. Alternative splicing transcripts were obtained with a sampling algorithm (--alternatives-from-sampling=true --sample=100 --minexonintronprob=0.2 --minmeanexonintronprob=0.5 --maxtracks=3 --temperature=2). The different models used by Augustus were trained on a number of different species-specific gene sets, which included 1000-2000 training gene structures. The --species option allows one to choose the species used for training the models. Different training species were used for the --species option when generating these predictions for different groups of assemblies. Assembly Group Training Species Fish zebrafish Birds chicken Human and all other vertebrates human Nematodes caenorhabditis Drosophila fly A. mellifera honeybee1 A. gambiae culex S. cerevisiae saccharomyces This table describes which training species was used for a particular group of assemblies. When available, the closest related training species was used. Credits Thanks to the Stanke lab for providing the AUGUSTUS program. The training for the chicken version was done by Stefanie König and the training for the human and zebrafish versions was done by Mario Stanke. References Stanke M, Diekhans M, Baertsch R, Haussler D. Using native and syntenically mapped cDNA alignments to improve de novo gene finding. Bioinformatics. 2008 Mar 1;24(5):637-44. PMID: 18218656 Stanke M, Waack S. Gene prediction with a hidden Markov model and a new intron submodel. Bioinformatics. 2003 Oct;19 Suppl 2:ii215-25. PMID: 14534192 est C. elegans ESTs C. elegans ESTs Including Unspliced mRNA and EST Description This track shows alignments between C. elegans expressed sequence tags (ESTs) in GenBank and the genome. ESTs are single-read sequences, typically about 500 bases in length, that usually represent fragments of transcribed genes. Display Conventions and Configuration This track follows the display conventions for PSL alignment tracks. In dense display mode, the items that are more darkly shaded indicate matches of better quality. The strand information (+/-) indicates the direction of the match between the EST and the matching genomic sequence. It bears no relationship to the direction of transcription of the RNA with which it might be associated. The description page for this track has a filter that can be used to change the display mode, alter the color, and include/exclude a subset of items within the track. This may be helpful when many items are shown in the track display, especially when only some are relevant to the current task. To use the filter: Type a term in one or more of the text boxes to filter the EST display. For example, to apply the filter to all ESTs expressed in a specific organ, type the name of the organ in the tissue box. To view the list of valid terms for each text box, consult the table in the Table Browser that corresponds to the factor on which you wish to filter. For example, the "tissue" table contains all the types of tissues that can be entered into the tissue text box. Wildcards may also be used in the filter. If filtering on more than one value, choose the desired combination logic. If "and" is selected, only ESTs that match all filter criteria will be highlighted. If "or" is selected, ESTs that match any one of the filter criteria will be highlighted. Choose the color or display characteristic that should be used to highlight or include/exclude the filtered items. If "exclude" is chosen, the browser will not display ESTs that match the filter criteria. If "include" is selected, the browser will display only those ESTs that match the filter criteria. This track may also be configured to display base labeling, a feature that allows the user to display all bases in the aligning sequence or only those that differ from the genomic sequence. For more information about this option, go to the Base Coloring for Alignment Tracks page. Methods To make an EST, RNA is isolated from cells and reverse transcribed into cDNA. Typically, the cDNA is cloned into a plasmid vector and a read is taken from the 5' and/or 3' primer. For most — but not all — ESTs, the reverse transcription is primed by an oligo-dT, which hybridizes with the poly-A tail of mature mRNA. The reverse transcriptase may or may not make it to the 5' end of the mRNA, which may or may not be degraded. In general, the 3' ESTs mark the end of transcription reasonably well, but the 5' ESTs may end at any point within the transcript. Some of the newer cap-selected libraries cover transcription start reasonably well. Before the cap-selection techniques emerged, some projects used random rather than poly-A priming in an attempt to retrieve sequence distant from the 3' end. These projects were successful at this, but as a side effect also deposited sequences from unprocessed mRNA and perhaps even genomic sequences into the EST databases. Even outside of the random-primed projects, there is a degree of non-mRNA contamination. Because of this, a single unspliced EST should be viewed with considerable skepticism. To generate this track, C. elegans ESTs from GenBank were aligned against the genome using blat. Note that the maximum intron length allowed by blat is 750,000 bases, which may eliminate some ESTs with very long introns that might otherwise align. When a single EST aligned in multiple places, the alignment having the highest base identity was identified. Only alignments having a base identity level within 0.5% of the best and at least 96% base identity with the genomic sequence were kept. Credits This track was produced at UCSC from EST sequence data submitted to the international public sequence databases by scientists worldwide. References Benson DA, Karsch-Mizrachi I, Lipman DJ, Ostell J, Wheeler DL. GenBank: update. Nucleic Acids Res. 2004 Jan 1;32(Database issue):D23-6. PMID: 14681350; PMC: PMC308779 Kent WJ. BLAT - the BLAST-like alignment tool. Genome Res. 2002 Apr;12(4):656-64. PMID: 11932250; PMC: PMC187518 mrna C. elegans mRNAs C. elegans mRNAs from GenBank mRNA and EST Description The mRNA track shows alignments between C. elegans mRNAs in GenBank and the genome. Display Conventions and Configuration This track follows the display conventions for PSL alignment tracks. In dense display mode, the items that are more darkly shaded indicate matches of better quality. The description page for this track has a filter that can be used to change the display mode, alter the color, and include/exclude a subset of items within the track. This may be helpful when many items are shown in the track display, especially when only some are relevant to the current task. To use the filter: Type a term in one or more of the text boxes to filter the mRNA display. For example, to apply the filter to all mRNAs expressed in a specific organ, type the name of the organ in the tissue box. To view the list of valid terms for each text box, consult the table in the Table Browser that corresponds to the factor on which you wish to filter. For example, the "tissue" table contains all the types of tissues that can be entered into the tissue text box. Wildcards may also be used in the filter. If filtering on more than one value, choose the desired combination logic. If "and" is selected, only mRNAs that match all filter criteria will be highlighted. If "or" is selected, mRNAs that match any one of the filter criteria will be highlighted. Choose the color or display characteristic that should be used to highlight or include/exclude the filtered items. If "exclude" is chosen, the browser will not display mRNAs that match the filter criteria. If "include" is selected, the browser will display only those mRNAs that match the filter criteria. This track may also be configured to display codon coloring, a feature that allows the user to quickly compare mRNAs against the genomic sequence. For more information about this option, go to the Codon and Base Coloring for Alignment Tracks page. Methods GenBank C. elegans mRNAs were aligned against the genome using the blat program. When a single mRNA aligned in multiple places, the alignment having the highest base identity was found. Only alignments having a base identity level within 0.5% of the best and at least 96% base identity with the genomic sequence were kept. Credits The mRNA track was produced at UCSC from mRNA sequence data submitted to the international public sequence databases by scientists worldwide. References Benson DA, Karsch-Mizrachi I, Lipman DJ, Ostell J, Wheeler DL. GenBank: update. Nucleic Acids Res. 2004 Jan 1;32(Database issue):D23-6. PMID: 14681350; PMC: PMC308779 Kent WJ. BLAT - the BLAST-like alignment tool. Genome Res. 2002 Apr;12(4):656-64. PMID: 11932250; PMC: PMC187518 cytoBandIdeo Chromosome Band (Ideogram) Ideogram for Orientation Mapping and Sequencing crisprRanges CRISPR Regions Genome regions processed to find CRISPR/Cas9 target sites (exons +/- 200 bp) Genes and Gene Predictions Description This track shows regions of the genome within 200 bp of transcribed regions and DNA sequences targetable by CRISPR RNA guides using the Cas9 enzyme from S. pyogenes (PAM: NGG). CRISPR target sites were annotated with predicted specificity (off-target effects) and predicted efficiency (on-target cleavage) by various algorithms through the tool CRISPOR. Display Conventions and Configuration The track "CRISPR Regions" shows the regions of the genome where target sites were analyzed, i.e. within 200 bp of transcribed regions as annotated by Ensembl transcript models. The track "CRISPR Targets" shows the target sites in these regions. The target sequence of the guide is shown with a thick (exon) bar. The PAM motif match (NGG) is shown with a thinner bar. Guides are colored to reflect both predicted specificity and efficiency. Specificity reflects the "uniqueness" of a 20mer sequence in the genome; the less unique a sequence is, the more likely it is to cleave other locations of the genome (off-target effects). Efficiency is the frequency of cleavage at the target site (on-target efficiency). Shades of gray stand for sites that are hard to target specifically, as the 20mer is not very unique in the genome: impossible to target: target site has at least one identical copy in the genome and was not scored hard to target: many similar sequences in the genome that alignment stopped, repeat? hard to target: target site was aligned but results in a low specificity score <= 50 (see below) Colors highlight targets that are specific in the genome (MIT specificity > 50) but have different predicted efficiencies: unable to calculate Doench/Fusi 2016 efficiency score low predicted cleavage: Doench/Fusi 2016 Efficiency percentile <= 30 medium predicted cleavage: Doench/Fusi 2016 Efficiency percentile > 30 and < 55 high predicted cleavage: Doench/Fusi 2016 Efficiency > 55 Mouse-over a target site to show predicted specificity and efficiency scores: The MIT Specificity score summarizes all off-targets into a single number from 0-100. The higher the number, the fewer off-target effects are expected. We recommend guides with an MIT specificity > 50. The efficiency score tries to predict if a guide leads to rather strong or weak cleavage. According to (Haeussler et al. 2016), the Doench 2016 Efficiency score should be used to select the guide with the highest cleavage efficiency when expressing guides from RNA PolIII Promoters such as U6. Scores are given as percentiles, e.g. "70%" means that 70% of mammalian guides have a score equal or lower than this guide. The raw score number is also shown in parentheses after the percentile. The Moreno-Mateos 2015 Efficiency score should be used instead of the Doench 2016 score when transcribing the guide in vitro with a T7 promoter, e.g. for injections in mouse, zebrafish or Xenopus embryos. The Moreno-Mateos score is given in percentiles and the raw value in parentheses, see the note above. Click onto features to show all scores and predicted off-targets with up to four mismatches. The Out-of-Frame score by Bae et al. 2014 is correlated with the probability that mutations induced by the guide RNA will disrupt the open reading frame. The authors recommend out-of-frame scores > 66 to create knock-outs with a single guide efficiently. Off-target sites are sorted by the CFD (Cutting Frequency Determination) score (Doench et al. 2016). The higher the CFD score, the more likely there is off-target cleavage at that site. Off-targets with a CFD score < 0.023 are not shown on this page, but are availble when following the link to the external CRISPOR tool. When compared against experimentally validated off-targets by Haeussler et al. 2016, the large majority of predicted off-targets with CFD scores < 0.023 were false-positives. Methods Relationship between predictions and experimental data Like most algorithms, the MIT specificity score is not always a perfect predictor of off-target effects. Despite low scores, many tested guides caused few and/or weak off-target cleavage when tested with whole-genome assays (Figure 2 from Haeussler et al. 2016), as shown below, and the published data contains few data points with high specificity scores. Overall though, the assays showed that the higher the specificity score, the lower the off-target effects. Similarly, efficiency scoring is not very accurate: guides with low scores can be efficient and vice versa. As a general rule, however, the higher the score, the less likely that a guide is very inefficient. The following histograms illustrate, for each type of score, how the share of inefficient guides drops with increasing efficiency scores: When reading this plot, keep in mind that both scores were evaluated on their own training data. Especially for the Moreno-Mateos score, the results are too optimistic, due to overfitting. When evaluated on independent datasets, the correlation of the prediction with other assays was around 25% lower, see Haeussler et al. 2016. At the time of writing, there is no independent dataset available yet to determine the Moreno-Mateos accuracy for each score percentile range. Track methods Exons as predicted by Ensembl Gene models were used, extended by 200 basepairs on each side, searched for the -NGG motif. Flanking 20mer guide sequences were aligned to the genome with BWA and scored with MIT Specificity scores using the command-line version of crispor.org. Non-unique guide sequences were skipped. Flanking sequences were extracted from the genome and input for Crispor efficiency scoring, available from the Crispor downloads page, which includes the Doench 2016, Moreno-Mateos 2015 and Bae 2014 algorithms, among others. Data Access The raw data can be explored interactively with the Table Browser. For automated analysis, the genome annotation is stored in a bigBed file that can be downloaded from our download server. The files for this track are called crispr.bb and crisprDetails.tab and are located in the /gbdb/ce11/crispr directory of our downloads server. Individual regions or the whole genome annotation can be obtained using our tool bigBedToBed, which can be compiled from the source code or downloaded as a precompiled binary for your system. Instructions for downloading source code and binaries can be found here. The tool can also be used to obtain only features within a given range, e.g. bigBedToBed http://hgdownload.soe.ucsc.edu/gbdb/hg19/crisprRanges/crispr.bb -chrom=chr21 -start=0 -end=10000000 stdout The file crisprDetails.tab includes the details of the off-targets. The last column of the bigBed file is the offset of the respective line in crisprDetails.tab. E.g. if the last column is 14227033723, then the following command will extract the line with the corresponding off-target details: curl -s -r 14227033723-14227043723 http://hgdownload.soe.ucsc.edu/gbdb/hg19/crispr/crisprDetails.tab | head -n1. The off-target details can currently not be joined with the table browser. The file crisprDetails.tab is a tab-separated text file with two fields. The first field contains the numbers of off-targets for each mismatch, e.g. "0,0,1,3,49" means 0 off-targets at zero mismatches, 1 at two mismatches, 3 at three and 49 off-targets at four mismatches. The second field is a pipe-separated list of semicolon-separated tuples with the genome coordinates and the CFD score. E.g. "chr10;123376795+;42|chr5;148353274-;39" describes two off-targets, with the first at chr1:123376795 on the positive strand and a CFD score 0.42 Credits Track created by Maximilian Haeussler and Hiram Clawson, with helpful input from Jean-Paul Concordet (MNHN Paris) and Alberto Stolfi (NYU). References Haeussler M, Schönig K, Eckert H, Eschstruth A, Mianné J, Renaud JB, Schneider-Maunoury S, Shkumatava A, Teboul L, Kent J et al. Evaluation of off-target and on-target scoring algorithms and integration into the guide RNA selection tool CRISPOR. Genome Biol. 2016 Jul 5;17(1):148. PMID: 27380939; PMC: PMC4934014 Bae S, Kweon J, Kim HS, Kim JS. Microhomology-based choice of Cas9 nuclease target sites. Nat Methods. 2014 Jul;11(7):705-6. PMID: 24972169 Doench JG, Fusi N, Sullender M, Hegde M, Vaimberg EW, Donovan KF, Smith I, Tothova Z, Wilen C, Orchard R et al. Optimized sgRNA design to maximize activity and minimize off-target effects of CRISPR-Cas9. Nat Biotechnol. 2016 Feb;34(2):184-91. PMID: 26780180; PMC: PMC4744125 Hsu PD, Scott DA, Weinstein JA, Ran FA, Konermann S, Agarwala V, Li Y, Fine EJ, Wu X, Shalem O et al. DNA targeting specificity of RNA-guided Cas9 nucleases. Nat Biotechnol. 2013 Sep;31(9):827-32. PMID: 23873081; PMC: PMC3969858 Moreno-Mateos MA, Vejnar CE, Beaudoin JD, Fernandez JP, Mis EK, Khokha MK, Giraldez AJ. CRISPRscan: designing highly efficient sgRNAs for CRISPR-Cas9 targeting in vivo. Nat Methods. 2015 Oct;12(10):982-8. PMID: 26322839; PMC: PMC4589495 crispr CRISPR CRISPR/Cas9 Sp. Pyog. target sites Genes and Gene Predictions Description This track shows regions of the genome within 200 bp of transcribed regions and DNA sequences targetable by CRISPR RNA guides using the Cas9 enzyme from S. pyogenes (PAM: NGG). CRISPR target sites were annotated with predicted specificity (off-target effects) and predicted efficiency (on-target cleavage) by various algorithms through the tool CRISPOR. Display Conventions and Configuration The track "CRISPR Regions" shows the regions of the genome where target sites were analyzed, i.e. within 200 bp of transcribed regions as annotated by Ensembl transcript models. The track "CRISPR Targets" shows the target sites in these regions. The target sequence of the guide is shown with a thick (exon) bar. The PAM motif match (NGG) is shown with a thinner bar. Guides are colored to reflect both predicted specificity and efficiency. Specificity reflects the "uniqueness" of a 20mer sequence in the genome; the less unique a sequence is, the more likely it is to cleave other locations of the genome (off-target effects). Efficiency is the frequency of cleavage at the target site (on-target efficiency). Shades of gray stand for sites that are hard to target specifically, as the 20mer is not very unique in the genome: impossible to target: target site has at least one identical copy in the genome and was not scored hard to target: many similar sequences in the genome that alignment stopped, repeat? hard to target: target site was aligned but results in a low specificity score <= 50 (see below) Colors highlight targets that are specific in the genome (MIT specificity > 50) but have different predicted efficiencies: unable to calculate Doench/Fusi 2016 efficiency score low predicted cleavage: Doench/Fusi 2016 Efficiency percentile <= 30 medium predicted cleavage: Doench/Fusi 2016 Efficiency percentile > 30 and < 55 high predicted cleavage: Doench/Fusi 2016 Efficiency > 55 Mouse-over a target site to show predicted specificity and efficiency scores: The MIT Specificity score summarizes all off-targets into a single number from 0-100. The higher the number, the fewer off-target effects are expected. We recommend guides with an MIT specificity > 50. The efficiency score tries to predict if a guide leads to rather strong or weak cleavage. According to (Haeussler et al. 2016), the Doench 2016 Efficiency score should be used to select the guide with the highest cleavage efficiency when expressing guides from RNA PolIII Promoters such as U6. Scores are given as percentiles, e.g. "70%" means that 70% of mammalian guides have a score equal or lower than this guide. The raw score number is also shown in parentheses after the percentile. The Moreno-Mateos 2015 Efficiency score should be used instead of the Doench 2016 score when transcribing the guide in vitro with a T7 promoter, e.g. for injections in mouse, zebrafish or Xenopus embryos. The Moreno-Mateos score is given in percentiles and the raw value in parentheses, see the note above. Click onto features to show all scores and predicted off-targets with up to four mismatches. The Out-of-Frame score by Bae et al. 2014 is correlated with the probability that mutations induced by the guide RNA will disrupt the open reading frame. The authors recommend out-of-frame scores > 66 to create knock-outs with a single guide efficiently. Off-target sites are sorted by the CFD (Cutting Frequency Determination) score (Doench et al. 2016). The higher the CFD score, the more likely there is off-target cleavage at that site. Off-targets with a CFD score < 0.023 are not shown on this page, but are availble when following the link to the external CRISPOR tool. When compared against experimentally validated off-targets by Haeussler et al. 2016, the large majority of predicted off-targets with CFD scores < 0.023 were false-positives. Methods Relationship between predictions and experimental data Like most algorithms, the MIT specificity score is not always a perfect predictor of off-target effects. Despite low scores, many tested guides caused few and/or weak off-target cleavage when tested with whole-genome assays (Figure 2 from Haeussler et al. 2016), as shown below, and the published data contains few data points with high specificity scores. Overall though, the assays showed that the higher the specificity score, the lower the off-target effects. Similarly, efficiency scoring is not very accurate: guides with low scores can be efficient and vice versa. As a general rule, however, the higher the score, the less likely that a guide is very inefficient. The following histograms illustrate, for each type of score, how the share of inefficient guides drops with increasing efficiency scores: When reading this plot, keep in mind that both scores were evaluated on their own training data. Especially for the Moreno-Mateos score, the results are too optimistic, due to overfitting. When evaluated on independent datasets, the correlation of the prediction with other assays was around 25% lower, see Haeussler et al. 2016. At the time of writing, there is no independent dataset available yet to determine the Moreno-Mateos accuracy for each score percentile range. Track methods Exons as predicted by Ensembl Gene models were used, extended by 200 basepairs on each side, searched for the -NGG motif. Flanking 20mer guide sequences were aligned to the genome with BWA and scored with MIT Specificity scores using the command-line version of crispor.org. Non-unique guide sequences were skipped. Flanking sequences were extracted from the genome and input for Crispor efficiency scoring, available from the Crispor downloads page, which includes the Doench 2016, Moreno-Mateos 2015 and Bae 2014 algorithms, among others. Data Access The raw data can be explored interactively with the Table Browser. For automated analysis, the genome annotation is stored in a bigBed file that can be downloaded from our download server. The files for this track are called crispr.bb and crisprDetails.tab and are located in the /gbdb/ce11/crispr directory of our downloads server. Individual regions or the whole genome annotation can be obtained using our tool bigBedToBed, which can be compiled from the source code or downloaded as a precompiled binary for your system. Instructions for downloading source code and binaries can be found here. The tool can also be used to obtain only features within a given range, e.g. bigBedToBed http://hgdownload.soe.ucsc.edu/gbdb/hg19/crispr/crispr.bb -chrom=chr21 -start=0 -end=10000000 stdout The file crisprDetails.tab includes the details of the off-targets. The last column of the bigBed file is the offset of the respective line in crisprDetails.tab. E.g. if the last column is 14227033723, then the following command will extract the line with the corresponding off-target details: curl -s -r 14227033723-14227043723 http://hgdownload.soe.ucsc.edu/gbdb/hg19/crispr/crisprDetails.tab | head -n1. The off-target details can currently not be joined with the table browser. The file crisprDetails.tab is a tab-separated text file with two fields. The first field contains the numbers of off-targets for each mismatch, e.g. "0,0,1,3,49" means 0 off-targets at zero mismatches, 1 at two mismatches, 3 at three and 49 off-targets at four mismatches. The second field is a pipe-separated list of semicolon-separated tuples with the genome coordinates and the CFD score. E.g. "chr10;123376795+;42|chr5;148353274-;39" describes two off-targets, with the first at chr1:123376795 on the positive strand and a CFD score 0.42 Credits Track created by Maximilian Haeussler and Hiram Clawson, with helpful input from Jean-Paul Concordet (MNHN Paris) and Alberto Stolfi (NYU). References Haeussler M, Schönig K, Eckert H, Eschstruth A, Mianné J, Renaud JB, Schneider-Maunoury S, Shkumatava A, Teboul L, Kent J et al. Evaluation of off-target and on-target scoring algorithms and integration into the guide RNA selection tool CRISPOR. Genome Biol. 2016 Jul 5;17(1):148. PMID: 27380939; PMC: PMC4934014 Bae S, Kweon J, Kim HS, Kim JS. Microhomology-based choice of Cas9 nuclease target sites. Nat Methods. 2014 Jul;11(7):705-6. PMID: 24972169 Doench JG, Fusi N, Sullender M, Hegde M, Vaimberg EW, Donovan KF, Smith I, Tothova Z, Wilen C, Orchard R et al. Optimized sgRNA design to maximize activity and minimize off-target effects of CRISPR-Cas9. Nat Biotechnol. 2016 Feb;34(2):184-91. PMID: 26780180; PMC: PMC4744125 Hsu PD, Scott DA, Weinstein JA, Ran FA, Konermann S, Agarwala V, Li Y, Fine EJ, Wu X, Shalem O et al. DNA targeting specificity of RNA-guided Cas9 nucleases. Nat Biotechnol. 2013 Sep;31(9):827-32. PMID: 23873081; PMC: PMC3969858 Moreno-Mateos MA, Vejnar CE, Beaudoin JD, Fernandez JP, Mis EK, Khokha MK, Giraldez AJ. CRISPRscan: designing highly efficient sgRNAs for CRISPR-Cas9 targeting in vivo. Nat Methods. 2015 Oct;12(10):982-8. PMID: 26322839; PMC: PMC4589495 crisprTargets CRISPR Targets CRISPR/Cas9 -NGG Targets Genes and Gene Predictions Description This track shows regions of the genome within 200 bp of transcribed regions and DNA sequences targetable by CRISPR RNA guides using the Cas9 enzyme from S. pyogenes (PAM: NGG). CRISPR target sites were annotated with predicted specificity (off-target effects) and predicted efficiency (on-target cleavage) by various algorithms through the tool CRISPOR. Display Conventions and Configuration The track "CRISPR Regions" shows the regions of the genome where target sites were analyzed, i.e. within 200 bp of transcribed regions as annotated by Ensembl transcript models. The track "CRISPR Targets" shows the target sites in these regions. The target sequence of the guide is shown with a thick (exon) bar. The PAM motif match (NGG) is shown with a thinner bar. Guides are colored to reflect both predicted specificity and efficiency. Specificity reflects the "uniqueness" of a 20mer sequence in the genome; the less unique a sequence is, the more likely it is to cleave other locations of the genome (off-target effects). Efficiency is the frequency of cleavage at the target site (on-target efficiency). Shades of gray stand for sites that are hard to target specifically, as the 20mer is not very unique in the genome: impossible to target: target site has at least one identical copy in the genome and was not scored hard to target: many similar sequences in the genome that alignment stopped, repeat? hard to target: target site was aligned but results in a low specificity score <= 50 (see below) Colors highlight targets that are specific in the genome (MIT specificity > 50) but have different predicted efficiencies: unable to calculate Doench/Fusi 2016 efficiency score low predicted cleavage: Doench/Fusi 2016 Efficiency percentile <= 30 medium predicted cleavage: Doench/Fusi 2016 Efficiency percentile > 30 and < 55 high predicted cleavage: Doench/Fusi 2016 Efficiency > 55 Mouse-over a target site to show predicted specificity and efficiency scores: The MIT Specificity score summarizes all off-targets into a single number from 0-100. The higher the number, the fewer off-target effects are expected. We recommend guides with an MIT specificity > 50. The efficiency score tries to predict if a guide leads to rather strong or weak cleavage. According to (Haeussler et al. 2016), the Doench 2016 Efficiency score should be used to select the guide with the highest cleavage efficiency when expressing guides from RNA PolIII Promoters such as U6. Scores are given as percentiles, e.g. "70%" means that 70% of mammalian guides have a score equal or lower than this guide. The raw score number is also shown in parentheses after the percentile. The Moreno-Mateos 2015 Efficiency score should be used instead of the Doench 2016 score when transcribing the guide in vitro with a T7 promoter, e.g. for injections in mouse, zebrafish or Xenopus embryos. The Moreno-Mateos score is given in percentiles and the raw value in parentheses, see the note above. Click onto features to show all scores and predicted off-targets with up to four mismatches. The Out-of-Frame score by Bae et al. 2014 is correlated with the probability that mutations induced by the guide RNA will disrupt the open reading frame. The authors recommend out-of-frame scores > 66 to create knock-outs with a single guide efficiently. Off-target sites are sorted by the CFD (Cutting Frequency Determination) score (Doench et al. 2016). The higher the CFD score, the more likely there is off-target cleavage at that site. Off-targets with a CFD score < 0.023 are not shown on this page, but are availble when following the link to the external CRISPOR tool. When compared against experimentally validated off-targets by Haeussler et al. 2016, the large majority of predicted off-targets with CFD scores < 0.023 were false-positives. Methods Relationship between predictions and experimental data Like most algorithms, the MIT specificity score is not always a perfect predictor of off-target effects. Despite low scores, many tested guides caused few and/or weak off-target cleavage when tested with whole-genome assays (Figure 2 from Haeussler et al. 2016), as shown below, and the published data contains few data points with high specificity scores. Overall though, the assays showed that the higher the specificity score, the lower the off-target effects. Similarly, efficiency scoring is not very accurate: guides with low scores can be efficient and vice versa. As a general rule, however, the higher the score, the less likely that a guide is very inefficient. The following histograms illustrate, for each type of score, how the share of inefficient guides drops with increasing efficiency scores: When reading this plot, keep in mind that both scores were evaluated on their own training data. Especially for the Moreno-Mateos score, the results are too optimistic, due to overfitting. When evaluated on independent datasets, the correlation of the prediction with other assays was around 25% lower, see Haeussler et al. 2016. At the time of writing, there is no independent dataset available yet to determine the Moreno-Mateos accuracy for each score percentile range. Track methods Exons as predicted by Ensembl Gene models were used, extended by 200 basepairs on each side, searched for the -NGG motif. Flanking 20mer guide sequences were aligned to the genome with BWA and scored with MIT Specificity scores using the command-line version of crispor.org. Non-unique guide sequences were skipped. Flanking sequences were extracted from the genome and input for Crispor efficiency scoring, available from the Crispor downloads page, which includes the Doench 2016, Moreno-Mateos 2015 and Bae 2014 algorithms, among others. Data Access The raw data can be explored interactively with the Table Browser. For automated analysis, the genome annotation is stored in a bigBed file that can be downloaded from our download server. The files for this track are called crispr.bb and crisprDetails.tab and are located in the /gbdb/ce11/crispr directory of our downloads server. Individual regions or the whole genome annotation can be obtained using our tool bigBedToBed, which can be compiled from the source code or downloaded as a precompiled binary for your system. Instructions for downloading source code and binaries can be found here. The tool can also be used to obtain only features within a given range, e.g. bigBedToBed http://hgdownload.soe.ucsc.edu/gbdb/hg19/crisprTargets/crispr.bb -chrom=chr21 -start=0 -end=10000000 stdout The file crisprDetails.tab includes the details of the off-targets. The last column of the bigBed file is the offset of the respective line in crisprDetails.tab. E.g. if the last column is 14227033723, then the following command will extract the line with the corresponding off-target details: curl -s -r 14227033723-14227043723 http://hgdownload.soe.ucsc.edu/gbdb/hg19/crispr/crisprDetails.tab | head -n1. The off-target details can currently not be joined with the table browser. The file crisprDetails.tab is a tab-separated text file with two fields. The first field contains the numbers of off-targets for each mismatch, e.g. "0,0,1,3,49" means 0 off-targets at zero mismatches, 1 at two mismatches, 3 at three and 49 off-targets at four mismatches. The second field is a pipe-separated list of semicolon-separated tuples with the genome coordinates and the CFD score. E.g. "chr10;123376795+;42|chr5;148353274-;39" describes two off-targets, with the first at chr1:123376795 on the positive strand and a CFD score 0.42 Credits Track created by Maximilian Haeussler and Hiram Clawson, with helpful input from Jean-Paul Concordet (MNHN Paris) and Alberto Stolfi (NYU). References Haeussler M, Schönig K, Eckert H, Eschstruth A, Mianné J, Renaud JB, Schneider-Maunoury S, Shkumatava A, Teboul L, Kent J et al. Evaluation of off-target and on-target scoring algorithms and integration into the guide RNA selection tool CRISPOR. Genome Biol. 2016 Jul 5;17(1):148. PMID: 27380939; PMC: PMC4934014 Bae S, Kweon J, Kim HS, Kim JS. Microhomology-based choice of Cas9 nuclease target sites. Nat Methods. 2014 Jul;11(7):705-6. PMID: 24972169 Doench JG, Fusi N, Sullender M, Hegde M, Vaimberg EW, Donovan KF, Smith I, Tothova Z, Wilen C, Orchard R et al. Optimized sgRNA design to maximize activity and minimize off-target effects of CRISPR-Cas9. Nat Biotechnol. 2016 Feb;34(2):184-91. PMID: 26780180; PMC: PMC4744125 Hsu PD, Scott DA, Weinstein JA, Ran FA, Konermann S, Agarwala V, Li Y, Fine EJ, Wu X, Shalem O et al. DNA targeting specificity of RNA-guided Cas9 nucleases. Nat Biotechnol. 2013 Sep;31(9):827-32. PMID: 23873081; PMC: PMC3969858 Moreno-Mateos MA, Vejnar CE, Beaudoin JD, Fernandez JP, Mis EK, Khokha MK, Giraldez AJ. CRISPRscan: designing highly efficient sgRNAs for CRISPR-Cas9 targeting in vivo. Nat Methods. 2015 Oct;12(10):982-8. PMID: 26322839; PMC: PMC4589495 ensGene Ensembl Genes Ensembl Genes Genes and Gene Predictions Description These gene predictions were generated by Ensembl. For more information on the different gene tracks, see our Genes FAQ. Methods For a description of the methods used in Ensembl gene predictions, please refer to Hubbard et al. (2002), also listed in the References section below. Data access Ensembl Gene data can be explored interactively using the Table Browser or the Data Integrator. For local downloads, the genePred format files for ce11 are available in our downloads directory as ensGene.txt.gz or in our genes download directory in GTF format. For programmatic access, the data can be queried from the REST API or directly from our public MySQL servers. Instructions on this method are available on our MySQL help page and on our blog. Previous versions of this track can be found on our archive download server. Credits We would like to thank Ensembl for providing these gene annotations. For more information, please see Ensembl's genome annotation page. References Hubbard T, Barker D, Birney E, Cameron G, Chen Y, Clark L, Cox T, Cuff J, Curwen V, Down T et al. The Ensembl genome database project. Nucleic Acids Res. 2002 Jan 1;30(1):38-41. PMID: 11752248; PMC: PMC99161 evaSnpContainer EVA SNP Short Genetic Variants from European Variant Archive Variation and Repeats Description These tracks contain mappings of single nucleotide variants and small insertions and deletions (indels) from the European Variation Archive (EVA) for the C. elegans ce11 genome. The dbSNP database at NCBI no longer hosts non-human variants. Interpreting and Configuring the Graphical Display Variants are shown as single tick marks at most zoom levels. When viewing the track at or near base-level resolution, the displayed width of the SNP variant corresponds to the width of the variant in the reference sequence. Insertions are indicated by a single tick mark displayed between two nucleotides, single nucleotide polymorphisms are displayed as the width of a single base, and multiple nucleotide variants are represented by a block that spans two or more bases. The display is set to automatically collapse to dense visibility when there are more than 100k variants in the window. When the window size is more than 250k bp, the display is switched to density graph mode. Searching, details, and filtering Navigation to an individual variant can be accomplished by typing or copying the variant identifier (rsID) or the genomic coordinates into the Position/Search box on the Browser. A click on an item in the graphical display displays a page with data about that variant. Data fields include the Reference and Alternate Alleles, the class of the variant as reported by EVA, the source of the data, the amino acid change, if any, and the functional class as determined by UCSC's Variant Annotation Integrator. Variants can be filtered using the track controls to show subsets of the data by either EVA Sequence Ontology (SO) term, UCSC-generated functional effect, or by color, which bins the UCSC functional effects into general classes. Mouse-over Mousing over an item shows the ucscClass, which is the consequence according to the Variant Annotation Integrator, and the aaChange when one is available, which is the change in amino acid in HGVS.p terms. Items may have multiple ucscClasses, which will all be shown in the mouse-over in a comma-separated list. Likewise, multiple HGVS.p terms may be shown for each rsID separated by spaces describing all possible AA changes. Multiple items may appear due to different variant predictions on multiple gene transcripts. For all organisms, the gene models used were the NCBI RefSeq curated when available, if not then ensembl genes, or finally, UCSC mappings of RefSeq if neither of the previous models was possible. Track colors Variants are colored according to the most potentially deleterious functional effect prediction according to the Variant Annotation Integrator. Specific bins can be seen in the Methods section below. Color Variant Type Protein-altering variants and splice site variants Synonymous codon variants Non-coding transcript or Untranslated Region (UTR) variants Intergenic and intronic variants Sequence Ontology (SO) Variants are classified by EVA into one of the following sequence ontology terms: substitution — A single nucleotide in the reference is replaced by another, alternate allele. deletion — One or more nucleotides are deleted. The representation in the database is to display one additional nucleotide in both the Reference field (Ref) and the Alternate Allele field (Alt). E.g. a variant that is a deletion of an A maybe be represented as Ref = GA and Alt = G. insertion — One or more nucleotides are inserted. The representation in the database is to display one additional nucleotide in both the Reference field (Ref) and the Alternate Allele field (Alt). E.g. a variant that is an insertion of a T may be represented as Ref = G and Alt = GT. delins — Similar to a tandem repeat, in that the runs of Ref and Alt Alleles are of different length, except that there is more than one type of nucleotide, e.g., Ref = CCAAAAACAAAAACA, Alt = ACAAAAAC. multipleNucleotideVariant — More than one nucleotide is substituted by an equal number of different nucleotides, e.g., Ref = AA, Alt = GC. sequence alteration — A parent term meant to signify a deviation from another sequence. Can be assigned to variants that have not been characterized yet. Methods Data were downloaded from the European Variation Archive EVA current_ids.vcf.gz files corresponding to the proper assembly. Chromosome names were converted to UCSC-style and the variants passed through the Variant Annotation Integrator to predict consequence. For every organism the NCBI RefSeq curated models were used when available, followed by ensembl genes, and finally UCSC mapping of RefSeq when neither of the previous models were possible. Variants were then colored according to their predicted consequence in the following fashion: Protein-altering variants and splice site variants - exon_loss_variant, frameshift_variant, inframe_deletion, inframe_insertion, initiator_codon_variant, missense_variant, splice_acceptor_variant, splice_donor_variant, splice_region_variant, stop_gained, stop_lost, coding_sequence_variant, transcript_ablation Synonymous codon variants - synonymous_variant, stop_retained_variant Non-coding transcript or Untranslated Region (UTR) variants - 5_prime_UTR_variant, 3_prime_UTR_variant, complex_transcript_variant, non_coding_transcript_exon_variant Intergenic and intronic variants - upstream_gene_variant, downstream_gene_variant, intron_variant, intergenic_variant, NMD_transcript_variant, no_sequence_alteration Sequence Ontology ("SO:") terms were converted to the variant classes, then the files were converted to BED, and then bigBed format. No functional annotations were provided by the EVA (e.g., missense, nonsense, etc). These were computed using UCSC's Variant Annotation Integrator (Hinrichs, et al., 2016). Amino-acid substitutions for missense variants are based on RefSeq alignments of mRNA transcripts, which do not always match the amino acids predicted from translating the genomic sequence. Therefore, in some instances, the variant and the genomic nucleotide and associated amino acid may be reversed. E.g., a Pro > Arg change from the perspective of the mRNA would be Arg > Pro from the perspective of the genomic sequence. Also, in bosTau9, galGal5, rheMac10, and danRer11 the mitochondrial sequence was removed or renamed to match UCSC. For complete documentation of the processing of these tracks, see the makedoc corresponding to the version of interest. For example, the EVA Release 7 MakeDoc. Data Access Note: It is not recommended to use LiftOver to convert SNPs between assemblies, and more information about how to convert SNPs between assemblies can be found on the following FAQ entry. The data can be explored interactively with the Table Browser, or the Data Integrator. For automated analysis, the data may be queried from our REST API. Please refer to our mailing list archives for questions, or our Data Access FAQ for more information. For automated download and analysis, this annotation is stored in a bigBed file that can be downloaded from our download server. Use the corresponding version number for the track of interest, e.g. evaSnp7.bb. Individual regions or the whole genome annotation can be obtained using our tool bigBedToBed which can be compiled from the source code or downloaded as a precompiled binary for your system. Instructions for downloading source code and binaries can be found here. The tool can also be used to obtain only features within a given range, e.g. bigBedToBed https://hgdownload.soe.ucsc.edu/gbdb/ce11/bbi/evaSnp7.bb -chrom=chr21 -start=0 -end=100000000 stdout Credits This track was produced from the European Variation Archive release data. Consequences were predicted using UCSC's Variant Annotation Integrator and NCBI's RefSeq as well as ensembl gene models. References Cezard T, Cunningham F, Hunt SE, Koylass B, Kumar N, Saunders G, Shen A, Silva AF, Tsukanov K, Venkataraman S et al. The European Variation Archive: a FAIR resource of genomic variation for all species. Nucleic Acids Res. 2021 Oct 28:gkab960. doi:10.1093/nar/gkab960. Epub ahead of print. PMID: 34718739. PMID: PMC8728205. Hinrichs AS, Raney BJ, Speir ML, Rhead B, Casper J, Karolchik D, Kuhn RM, Rosenbloom KR, Zweig AS, Haussler D, Kent WJ. UCSC Data Integrator and Variant Annotation Integrator. Bioinformatics. 2016 May 1;32(9):1430-2. PMID: 26740527; PMC: PMC4848401 evaSnp7 EVA SNP Release 7 Short Genetic Variants from European Variant Archive Release 7 Variation and Repeats Description These tracks contain mappings of single nucleotide variants and small insertions and deletions (indels) from the European Variation Archive (EVA) for the C. elegans ce11 genome. The dbSNP database at NCBI no longer hosts non-human variants. Interpreting and Configuring the Graphical Display Variants are shown as single tick marks at most zoom levels. When viewing the track at or near base-level resolution, the displayed width of the SNP variant corresponds to the width of the variant in the reference sequence. Insertions are indicated by a single tick mark displayed between two nucleotides, single nucleotide polymorphisms are displayed as the width of a single base, and multiple nucleotide variants are represented by a block that spans two or more bases. The display is set to automatically collapse to dense visibility when there are more than 100k variants in the window. When the window size is more than 250k bp, the display is switched to density graph mode. Searching, details, and filtering Navigation to an individual variant can be accomplished by typing or copying the variant identifier (rsID) or the genomic coordinates into the Position/Search box on the Browser. A click on an item in the graphical display displays a page with data about that variant. Data fields include the Reference and Alternate Alleles, the class of the variant as reported by EVA, the source of the data, the amino acid change, if any, and the functional class as determined by UCSC's Variant Annotation Integrator. Variants can be filtered using the track controls to show subsets of the data by either EVA Sequence Ontology (SO) term, UCSC-generated functional effect, or by color, which bins the UCSC functional effects into general classes. Mouse-over Mousing over an item shows the ucscClass, which is the consequence according to the Variant Annotation Integrator, and the aaChange when one is available, which is the change in amino acid in HGVS.p terms. Items may have multiple ucscClasses, which will all be shown in the mouse-over in a comma-separated list. Likewise, multiple HGVS.p terms may be shown for each rsID separated by spaces describing all possible AA changes. Multiple items may appear due to different variant predictions on multiple gene transcripts. For all organisms, the gene models used were the NCBI RefSeq curated when available, if not then ensembl genes, or finally, UCSC mappings of RefSeq if neither of the previous models was possible. Track colors Variants are colored according to the most potentially deleterious functional effect prediction according to the Variant Annotation Integrator. Specific bins can be seen in the Methods section below. Color Variant Type Protein-altering variants and splice site variants Synonymous codon variants Non-coding transcript or Untranslated Region (UTR) variants Intergenic and intronic variants Sequence Ontology (SO) Variants are classified by EVA into one of the following sequence ontology terms: substitution — A single nucleotide in the reference is replaced by another, alternate allele. deletion — One or more nucleotides are deleted. The representation in the database is to display one additional nucleotide in both the Reference field (Ref) and the Alternate Allele field (Alt). E.g. a variant that is a deletion of an A maybe be represented as Ref = GA and Alt = G. insertion — One or more nucleotides are inserted. The representation in the database is to display one additional nucleotide in both the Reference field (Ref) and the Alternate Allele field (Alt). E.g. a variant that is an insertion of a T may be represented as Ref = G and Alt = GT. delins — Similar to a tandem repeat, in that the runs of Ref and Alt Alleles are of different length, except that there is more than one type of nucleotide, e.g., Ref = CCAAAAACAAAAACA, Alt = ACAAAAAC. multipleNucleotideVariant — More than one nucleotide is substituted by an equal number of different nucleotides, e.g., Ref = AA, Alt = GC. sequence alteration — A parent term meant to signify a deviation from another sequence. Can be assigned to variants that have not been characterized yet. Methods Data were downloaded from the European Variation Archive EVA current_ids.vcf.gz files corresponding to the proper assembly. Chromosome names were converted to UCSC-style and the variants passed through the Variant Annotation Integrator to predict consequence. For every organism the NCBI RefSeq curated models were used when available, followed by ensembl genes, and finally UCSC mapping of RefSeq when neither of the previous models were possible. Variants were then colored according to their predicted consequence in the following fashion: Protein-altering variants and splice site variants - exon_loss_variant, frameshift_variant, inframe_deletion, inframe_insertion, initiator_codon_variant, missense_variant, splice_acceptor_variant, splice_donor_variant, splice_region_variant, stop_gained, stop_lost, coding_sequence_variant, transcript_ablation Synonymous codon variants - synonymous_variant, stop_retained_variant Non-coding transcript or Untranslated Region (UTR) variants - 5_prime_UTR_variant, 3_prime_UTR_variant, complex_transcript_variant, non_coding_transcript_exon_variant Intergenic and intronic variants - upstream_gene_variant, downstream_gene_variant, intron_variant, intergenic_variant, NMD_transcript_variant, no_sequence_alteration Sequence Ontology ("SO:") terms were converted to the variant classes, then the files were converted to BED, and then bigBed format. No functional annotations were provided by the EVA (e.g., missense, nonsense, etc). These were computed using UCSC's Variant Annotation Integrator (Hinrichs, et al., 2016). Amino-acid substitutions for missense variants are based on RefSeq alignments of mRNA transcripts, which do not always match the amino acids predicted from translating the genomic sequence. Therefore, in some instances, the variant and the genomic nucleotide and associated amino acid may be reversed. E.g., a Pro > Arg change from the perspective of the mRNA would be Arg > Pro from the perspective of the genomic sequence. Also, in bosTau9, galGal5, rheMac10, and danRer11 the mitochondrial sequence was removed or renamed to match UCSC. For complete documentation of the processing of these tracks, see the makedoc corresponding to the version of interest. For example, the EVA Release 7 MakeDoc. Data Access Note: It is not recommended to use LiftOver to convert SNPs between assemblies, and more information about how to convert SNPs between assemblies can be found on the following FAQ entry. The data can be explored interactively with the Table Browser, or the Data Integrator. For automated analysis, the data may be queried from our REST API. Please refer to our mailing list archives for questions, or our Data Access FAQ for more information. For automated download and analysis, this annotation is stored in a bigBed file that can be downloaded from our download server. Use the corresponding version number for the track of interest, e.g. evaSnp7.bb. Individual regions or the whole genome annotation can be obtained using our tool bigBedToBed which can be compiled from the source code or downloaded as a precompiled binary for your system. Instructions for downloading source code and binaries can be found here. The tool can also be used to obtain only features within a given range, e.g. bigBedToBed https://hgdownload.soe.ucsc.edu/gbdb/ce11/bbi/evaSnp7.bb -chrom=chr21 -start=0 -end=100000000 stdout Credits This track was produced from the European Variation Archive release data. Consequences were predicted using UCSC's Variant Annotation Integrator and NCBI's RefSeq as well as ensembl gene models. References Cezard T, Cunningham F, Hunt SE, Koylass B, Kumar N, Saunders G, Shen A, Silva AF, Tsukanov K, Venkataraman S et al. The European Variation Archive: a FAIR resource of genomic variation for all species. Nucleic Acids Res. 2021 Oct 28:gkab960. doi:10.1093/nar/gkab960. Epub ahead of print. PMID: 34718739. PMID: PMC8728205. Hinrichs AS, Raney BJ, Speir ML, Rhead B, Casper J, Karolchik D, Kuhn RM, Rosenbloom KR, Zweig AS, Haussler D, Kent WJ. UCSC Data Integrator and Variant Annotation Integrator. Bioinformatics. 2016 May 1;32(9):1430-2. PMID: 26740527; PMC: PMC4848401 gap Gap Gap Locations Mapping and Sequencing Description This track shows the gaps in the Aug. 2014 C. elegans genome assembly. Genome assembly procedures are covered in the NCBI assembly documentation. NCBI also provides specific information about this assembly. The definition of the gaps in this assembly is from the AGP file delivered with the sequence. The NCBI document AGP Specification describes the format of the AGP file. Gaps are represented as black boxes in this track. If the relative order and orientation of the contigs on either side of the gap is supported by read pair data, it is a bridged gap and a white line is drawn through the black box representing the gap. This assembly contains the following principal types of gaps: gc5BaseBw GC Percent GC Percent in 5-Base Windows Mapping and Sequencing Description The GC percent track shows the percentage of G (guanine) and C (cytosine) bases in 5-base windows. High GC content is typically associated with gene-rich areas. This track may be configured in a variety of ways to highlight different apsects of the displayed information. Click the "Graph configuration help" link for an explanation of the configuration options. Credits The data and presentation of this graph were prepared by Hiram Clawson. genscan Genscan Genes Genscan Gene Predictions Genes and Gene Predictions Description This track shows predictions from the Genscan program written by Chris Burge. The predictions are based on transcriptional, translational and donor/acceptor splicing signals as well as the length and compositional distributions of exons, introns and intergenic regions. For more information on the different gene tracks, see our Genes FAQ. Display Conventions and Configuration This track follows the display conventions for gene prediction tracks. The track description page offers the following filter and configuration options: Color track by codons: Select the genomic codons option to color and label each codon in a zoomed-in display to facilitate validation and comparison of gene predictions. Go to the Coloring Gene Predictions and Annotations by Codon page for more information about this feature. Methods For a description of the Genscan program and the model that underlies it, refer to Burge and Karlin (1997) in the References section below. The splice site models used are described in more detail in Burge (1998) below. Credits Thanks to Chris Burge for providing the Genscan program. References Burge C. Modeling Dependencies in Pre-mRNA Splicing Signals. In: Salzberg S, Searls D, Kasif S, editors. Computational Methods in Molecular Biology. Amsterdam: Elsevier Science; 1998. p. 127-163. Burge C, Karlin S. Prediction of complete gene structures in human genomic DNA. J. Mol. Biol. 1997 Apr 25;268(1):78-94. PMID: 9149143 ucscToINSDC INSDC Accession at INSDC - International Nucleotide Sequence Database Collaboration Mapping and Sequencing Description This track associates UCSC Genome Browser chromosome names to accession names from the International Nucleotide Sequence Database Collaboration (INSDC). The data were downloaded from the NCBI assembly database. Credits The data for this track was prepared by Hiram Clawson. nestedRepeats Interrupted Rpts Fragments of Interrupted Repeats Joined by RepeatMasker ID Variation and Repeats Description This track shows joined fragments of interrupted repeats extracted from the output of the RepeatMasker program which screens DNA sequences for interspersed repeats and low complexity DNA sequences using the Repbase Update library of repeats from the Genetic Information Research Institute (GIRI). Repbase Update is described in Jurka (2000) in the References section below. The detailed annotations from RepeatMasker are in the RepeatMasker track. This track shows fragments of original repeat insertions which have been interrupted by insertions of younger repeats or through local rearrangements. The fragments are joined using the ID column of RepeatMasker output. Display Conventions and Configuration In pack or full mode, each interrupted repeat is displayed as boxes (fragments) joined by horizontal lines, labeled with the repeat name. If all fragments are on the same strand, arrows are added to the horizontal line to indicate the strand. In dense or squish mode, labels and arrows are omitted and in dense mode, all items are collapsed to fit on a single row. Items are shaded according to the average identity score of their fragments. Usually, the shade of an item is similar to the shades of its fragments unless some fragments are much more diverged than others. The score displayed above is the average identity score, clipped to a range of 50% - 100% and then mapped to the range 0 - 1000 for shading in the browser. Methods UCSC has used the most current versions of the RepeatMasker software and repeat libraries available to generate these data. Note that these versions may be newer than those that are publicly available on the Internet. Data are generated using the RepeatMasker -s flag. Additional flags may be used for certain organisms. See the FAQ for more information. Credits Thanks to Arian Smit, Robert Hubley and GIRI for providing the tools and repeat libraries used to generate this track. References Smit AFA, Hubley R, Green P. RepeatMasker Open-3.0. https://www.repeatmasker.org/. 1996-2010. Repbase Update is described in: Jurka J. Repbase Update: a database and an electronic journal of repetitive elements. Trends Genet. 2000 Sep;16(9):418-420. PMID: 10973072 For a discussion of repeats in mammalian genomes, see: Smit AF. Interspersed repeats and other mementos of transposable elements in mammalian genomes. Curr Opin Genet Dev. 1999 Dec;9(6):657-63. PMID: 10607616 Smit AF. The origin of interspersed repeats in the human genome. Curr Opin Genet Dev. 1996 Dec;6(6):743-8. PMID: 8994846 nematodesChainNet Nematodes Chain/Net Nematodes Chain and Net Alignments Comparative Genomics Description This track shows regions of the genome that are alignable to other genomes ("chain" subtracks) or in synteny ("net" subtracks). The alignable parts are shown with thick blocks that look like exons. Non-alignable parts between these are shown like introns. Chain Track The chain track shows alignments of a query genome sequence to the C. elegans genome using a gap scoring system that allows longer gaps than traditional affine gap scoring systems. It can also tolerate gaps in both the query sequence and C. elegans simultaneously. These "double-sided" gaps can be caused by local inversions and overlapping deletions in both species. The chain track displays boxes joined together by either single or double lines. The boxes represent aligning regions. Single lines indicate gaps that are largely due to a deletion in the the query sequence assembly or an insertion in the C. elegans assembly. Double lines represent more complex gaps that involve substantial sequence in both species. This may result from inversions, overlapping deletions, an abundance of local mutation, or an unsequenced gap in one species. In cases where multiple chains align over a particular region of the C. elegans genome, the chains with single-lined gaps are often due to processed pseudogenes, while chains with double-lined gaps are more often due to paralogs and unprocessed pseudogenes. In the "pack" and "full" display modes, the individual feature names indicate the chromosome, strand, and location (in thousands) of the match for each matching alignment. Net Track The net track shows the best query sequence/C. elegans chain for every part of the C. elegans genome. It is useful for finding syntenic regions, possibly orthologs, and for studying genome rearrangement. Display Conventions and Configuration Chain Track By default, the chains to chromosome-based assemblies are colored based on which chromosome they map to in the aligning organism. To turn off the coloring, check the "off" button next to: Color track based on chromosome. To display only the chains of one chromosome in the aligning organism, enter the name of that chromosome (e.g. chr4) in box next to: Filter by chromosome. Net Track In full display mode, the top-level (level 1) chains are the largest, highest-scoring chains that span this region. In many cases gaps exist in the top-level chain. When possible, these are filled in by other chains that are displayed at level 2. The gaps in level 2 chains may be filled by level 3 chains and so forth. In the graphical display, the boxes represent ungapped alignments; the lines represent gaps. Click on a box to view detailed information about the chain as a whole; click on a line to display information about the gap. The detailed information is useful in determining the cause of the gap or, for lower level chains, the genomic rearrangement. Individual items in the display are categorized as one of four types (other than gap): Top - the best, longest match. Displayed on level 1. Syn - line-ups on the same chromosome as the gap in the level above it. Inv - a line-up on the same chromosome as the gap above it, but in the opposite orientation. NonSyn - a match to a chromosome different from the gap in the level above. Methods Chain track Transposons that have been inserted since the query sequence/C. elegans split were removed from the assemblies. The abbreviated genomes were aligned with lastz, and the transposons were added back in. The resulting alignments were converted into axt format using the lavToAxt program. The axt alignments were fed into axtChain, which organizes all alignments between a single query sequence chromosome and a single C. elegans chromosome into a group and creates a kd-tree out of the gapless subsections (blocks) of the alignments. A dynamic program was then run over the kd-trees to find the maximally scoring chains of these blocks. Chains scoring below a minimum score of "5000" were discarded; the remaining chains are displayed in this track. The linear gap matrix used with axtChain: -linearGap=loose tablesize 11 smallSize 111 position 1 2 3 11 111 2111 12111 32111 72111 152111 252111 qGap 325 360 400 450 600 1100 3600 7600 15600 31600 56600 tGap 325 360 400 450 600 1100 3600 7600 15600 31600 56600 bothGap 625 660 700 750 900 1400 4000 8000 16000 32000 57000 Net track Chains were derived from lastz alignments, using the methods described on the chain tracks description pages, and sorted with the highest-scoring chains in the genome ranked first. The program chainNet was then used to place the chains one at a time, trimming them as necessary to fit into sections not already covered by a higher-scoring chain. During this process, a natural hierarchy emerged in which a chain that filled a gap in a higher-scoring chain was placed underneath that chain. The program netSyntenic was used to fill in information about the relationship between higher- and lower-level chains, such as whether a lower-level chain was syntenic or inverted relative to the higher-level chain. The program netClass was then used to fill in how much of the gaps and chains contained Ns (sequencing gaps) in one or both species and how much was filled with transposons inserted before and after the two organisms diverged. Credits LASTZ was developed at Miller Lab at Pennsylvania State University by Bob Harris. Lineage-specific repeats were identified by Arian Smit and his RepeatMasker program. The axtChain program was developed at the University of California at Santa Cruz by Jim Kent with advice from Webb Miller and David Haussler. The browser display and database storage of the chains and nets were created by Robert Baertsch and Jim Kent. The chainNet, netSyntenic, and netClass programs were developed at the University of California Santa Cruz by Jim Kent. References Harris RS. Improved pairwise alignment of genomic DNA. Ph.D. Thesis. Pennsylvania State University, USA. 2007. Chiaromonte F, Yap VB, Miller W. Scoring pairwise genomic sequence alignments. Pac Symp Biocomput. 2002:115-26. PMID: 11928468 Kent WJ, Baertsch R, Hinrichs A, Miller W, Haussler D. Evolution's cauldron: duplication, deletion, and rearrangement in the mouse and human genomes. Proc Natl Acad Sci U S A. 2003 Sep 30;100(20):11484-9. PMID: 14500911; PMC: PMC208784 Schwartz S, Kent WJ, Smit A, Zhang Z, Baertsch R, Hardison RC, Haussler D, Miller W. Human-mouse alignments with BLASTZ. Genome Res. 2003 Jan;13(1):103-7. PMID: 12529312; PMC: PMC430961 nematodesChainNetViewnet Nets Nematodes Chain and Net Alignments Comparative Genomics netCi3 C. intestinalis Net C. intestinalis (Apr. 2011 (Kyoto KH/ci3)) Alignment Net Comparative Genomics netTriSpi1 triSpi1 Net Trichinella (Jan. 2011 (WS225/Trichinella_spiralis-3.7.1/triSpi1)) Alignment Net Comparative Genomics netTriSui1 triSui1 Net Whipworm (Jul. 2014 (WS243/T. suis DCEP-RM93M male/triSui1)) Alignment Net Comparative Genomics netHaeCon2 haeCon2 Net Barber pole worm (Jul. 2013 (WormBase WS239/haeCon2)) Alignment Net Comparative Genomics netMelInc2 melInc2 Net M. incognita (Feb. 2008 (M. incognita WS245/PRJEA28837/melInc2)) Alignment Net Comparative Genomics netMelHap1 melHap1 Net M. hapla (Sep. 2008 (M. hapla VW9 WS210/melHap1)) Alignment Net Comparative Genomics netBurXyl1 burXyl1 Net Pine wood nematode (Nov. 2011 (WS229/B. xylophilus Ka4C1/burXyl1)) Alignment Net Comparative Genomics netPanRed1 panRed1 Net Microworm (Feb. 2013 (WS240/Pred3/panRed1)) Alignment Net Comparative Genomics netStrRat2 strRat2 Net Threadworm (Sep. 2014 (S. ratti ED321/strRat2)) Alignment Net Comparative Genomics netDirImm1 dirImm1 Net Dog heartworm (Sep. 2013 (WS240/D. immitis v2.2/dirImm1)) Alignment Net Comparative Genomics netLoaLoa1 loaLoa1 Net Eye worm (Jul. 2012 (WS235/L_loa_Cameroon_isolate/loaLoa1)) Alignment Net Comparative Genomics netBruMal2 bruMal2 Net Filarial worm (May. 2014 (WS244/B_malayi-3.1/bruMal2)) Alignment Net Comparative Genomics netOncVol1 oncVol1 Net O. volvulus (Nov. 2013 (WS241/O_volvulus_Cameroon_v3/oncVol1)) Alignment Net Comparative Genomics netAscSuu1 ascSuu1 Net Pig roundworm (Sep. 2012 (WS229/AscSuum_1.0/ascSuu1)) Alignment Net Comparative Genomics netPriExs1 priExs1 Net P. exspectatus (Mar. 2014 (WS243/P_exspectatus_v1/priExs1)) Alignment Net Comparative Genomics netPriPac3 priPac3 Net P. pacificus (Aug. 2014 (WS221/P_pacificus-v2/priPac3)) Alignment Net Comparative Genomics netNecAme1 necAme1 Net N. americanus (Dec. 2013 (WS242/N_americanus_v1/necAme1)) Alignment Net Comparative Genomics netHetBac1 hetBac1 Net H. bacteriophora/m31e (Aug. 2011 (WS229/H. bacteriophora 7.0/hetBac1)) Alignment Net Comparative Genomics netAncCey1 ancCey1 Net A. ceylanicum (Mar. 2014 (WS243/Acey_2013.11.30.genDNA/ancCey1)) Alignment Net Comparative Genomics netCb4 cb4 Net C. briggsae (Apr. 2011 (WS225/cb4)) Alignment Net Comparative Genomics netCaeAng2 caeAng2 Net C. angaria (Apr. 2012 (WS232/ps1010rel8/caeAng2)) Alignment Net Comparative Genomics netCaeSp51 caeSp51 Net C. sp. 5 ju800 (Jan. 2012 (WS230/Caenorhabditis_sp_5-JU800-1.0/caeSp51)) Alignment Net Comparative Genomics netCaeJap4 caeJap4 Net C. japonica (Aug. 2010 (WUSTL 7.0.1/caeJap4)) Alignment Net Comparative Genomics netCaeSp111 caeSp111 Net C. tropicalis (Nov. 2010 (WS226/WUSTL 3.0.1/caeSp111)) Alignment Net Comparative Genomics netCaePb3 caePb3 Net C. brenneri (Nov. 2010 (C. brenneri 6.0.1b/caePb3)) Alignment Net Comparative Genomics netCaeRem4 caeRem4 Net C. remanei (Jul. 2007 (WS220/caeRem4)) Alignment Net Comparative Genomics nematodesChainNetViewchain Chains Nematodes Chain and Net Alignments Comparative Genomics chainCi3 C. intestinalis Chain C. intestinalis (Apr. 2011 (Kyoto KH/ci3)) Chained Alignments Comparative Genomics chainDicrocoelium_dendriticum Dicrocoelium_dendriticum Chain Dicrocoelium_dendriticum (Dicrocoelium_dendriticum) Chained Alignments Comparative Genomics chainSpirometra_erinaceieuropaei Spirometra_erinaceieuropaei Chain Spirometra_erinaceieuropaei (Spirometra_erinaceieuropaei) Chained Alignments Comparative Genomics chainEchinococcus_granulosus Echinococcus_granulosus Chain Echinococcus_granulosus (Echinococcus_granulosus) Chained Alignments Comparative Genomics chainTaenia_solium Taenia_solium Chain Taenia_solium (Taenia_solium) Chained Alignments Comparative Genomics chainEchinococcus_canadensis Echinococcus_canadensis Chain Echinococcus_canadensis (Echinococcus_canadensis) Chained Alignments Comparative Genomics chainTaenia_asiatica Taenia_asiatica Chain Taenia_asiatica (Taenia_asiatica) Chained Alignments Comparative Genomics chainEchinococcus_multilocularis Echinococcus_multilocularis Chain Echinococcus_multilocularis (Echinococcus_multilocularis) Chained Alignments Comparative Genomics chainOpisthorchis_viverrini Opisthorchis_viverrini Chain Opisthorchis_viverrini (Opisthorchis_viverrini) Chained Alignments Comparative Genomics chainClonorchis_sinensis Clonorchis_sinensis Chain Clonorchis_sinensis (Clonorchis_sinensis) Chained Alignments Comparative Genomics chainTaenia_saginata Taenia_saginata Chain Taenia_saginata (Taenia_saginata) Chained Alignments Comparative Genomics chainFasciola_gigantica Fasciola_gigantica Chain Fasciola_gigantica (Fasciola_gigantica) Chained Alignments Comparative Genomics chainTaenia_multiceps Taenia_multiceps Chain Taenia_multiceps (Taenia_multiceps) Chained Alignments Comparative Genomics chainFasciola_hepatica Fasciola_hepatica Chain Fasciola_hepatica (Fasciola_hepatica) Chained Alignments Comparative Genomics chainHymenolepis_microstoma Hymenolepis_microstoma Chain Hymenolepis_microstoma (Hymenolepis_microstoma) Chained Alignments Comparative Genomics chainGyrodactylus_salaris Gyrodactylus_salaris Chain Gyrodactylus_salaris (Gyrodactylus_salaris) Chained Alignments Comparative Genomics chainSchistosoma_japonicum Schistosoma_japonicum Chain Schistosoma_japonicum (Schistosoma_japonicum) Chained Alignments Comparative Genomics chainSchistosoma_haematobium Schistosoma_haematobium Chain Schistosoma_haematobium (Schistosoma_haematobium) Chained Alignments Comparative Genomics chainSchistosoma_mansoni Schistosoma_mansoni Chain Schistosoma_mansoni (Schistosoma_mansoni) Chained Alignments Comparative Genomics chainMacrostomum_lignano Macrostomum_lignano Chain Macrostomum_lignano (Macrostomum_lignano) Chained Alignments Comparative Genomics chainDugesia_japonica Dugesia_japonica Chain Dugesia_japonica (Dugesia_japonica) Chained Alignments Comparative Genomics chainGirardia_tigrina Girardia_tigrina Chain Girardia_tigrina (Girardia_tigrina) Chained Alignments Comparative Genomics chainSchmidtea_mediterranea Schmidtea_mediterranea Chain Schmidtea_mediterranea (Schmidtea_mediterranea) Chained Alignments Comparative Genomics chainTriSpi1 triSpi1 Chain Trichinella (Jan. 2011 (WS225/Trichinella_spiralis-3.7.1/triSpi1)) Chained Alignments Comparative Genomics chainTriSui1 triSui1 Chain Whipworm (Jul. 2014 (WS243/T. suis DCEP-RM93M male/triSui1)) Chained Alignments Comparative Genomics chainTrichuris_muris Trichuris_muris Chain Trichuris_muris (Trichuris_muris) Chained Alignments Comparative Genomics chainTrichuris_trichiura Trichuris_trichiura Chain Trichuris_trichiura (Trichuris_trichiura) Chained Alignments Comparative Genomics chainRomanomermis_culicivorax Romanomermis_culicivorax Chain Romanomermis_culicivorax (Romanomermis_culicivorax) Chained Alignments Comparative Genomics chainTrichinella_spiralis Trichinella_spiralis Chain Trichinella_spiralis (Trichinella_spiralis) Chained Alignments Comparative Genomics chainTrichinella_nativa Trichinella_nativa Chain Trichinella_nativa (Trichinella_nativa) Chained Alignments Comparative Genomics chainTrichinella_nelsoni Trichinella_nelsoni Chain Trichinella_nelsoni (Trichinella_nelsoni) Chained Alignments Comparative Genomics chainTrichinella_patagoniensis Trichinella_patagoniensis Chain Trichinella_patagoniensis (Trichinella_patagoniensis) Chained Alignments Comparative Genomics chainTrichinella_T9 Trichinella_T9 Chain Trichinella_T9 (Trichinella_T9) Chained Alignments Comparative Genomics chainTrichinella_T8 Trichinella_T8 Chain Trichinella_T8 (Trichinella_T8) Chained Alignments Comparative Genomics chainTrichinella_britovi Trichinella_britovi Chain Trichinella_britovi (Trichinella_britovi) Chained Alignments Comparative Genomics chainTrichinella_T6 Trichinella_T6 Chain Trichinella_T6 (Trichinella_T6) Chained Alignments Comparative Genomics chainTrichinella_papuae Trichinella_papuae Chain Trichinella_papuae (Trichinella_papuae) Chained Alignments Comparative Genomics chainTrichinella_pseudospiralis Trichinella_pseudospiralis Chain Trichinella_pseudospiralis (Trichinella_pseudospiralis) Chained Alignments Comparative Genomics chainTrichinella_zimbabwensis Trichinella_zimbabwensis Chain Trichinella_zimbabwensis (Trichinella_zimbabwensis) Chained Alignments Comparative Genomics chainTrichinella_murrelli Trichinella_murrelli Chain Trichinella_murrelli (Trichinella_murrelli) Chained Alignments Comparative Genomics chainPlectus_sambesii Plectus_sambesii Chain Plectus_sambesii (Plectus_sambesii) Chained Alignments Comparative Genomics chainHaeCon2 haeCon2 Chain Barber pole worm (Jul. 2013 (WormBase WS239/haeCon2)) Chained Alignments Comparative Genomics chainAngiostrongylus_cantonensis Angiostrongylus_cantonensis Chain Angiostrongylus_cantonensis (Angiostrongylus_cantonensis) Chained Alignments Comparative Genomics chainHaemonchus_contortus Haemonchus_contortus Chain Haemonchus_contortus (Haemonchus_contortus) Chained Alignments Comparative Genomics chainTeladorsagia_circumcincta Teladorsagia_circumcincta Chain Teladorsagia_circumcincta (Teladorsagia_circumcincta) Chained Alignments Comparative Genomics chainHeligmosomoides_polygyrus_bakeri Heligmosomoides_polygyrus_bakeri Chain Heligmosomoides_polygyrus_bakeri (Heligmosomoides_polygyrus_bakeri) Chained Alignments Comparative Genomics chainNippostrongylus_brasiliensis Nippostrongylus_brasiliensis Chain Nippostrongylus_brasiliensis (Nippostrongylus_brasiliensis) Chained Alignments Comparative Genomics chainDictyocaulus_viviparus Dictyocaulus_viviparus Chain Dictyocaulus_viviparus (Dictyocaulus_viviparus) Chained Alignments Comparative Genomics chainMelInc2 melInc2 Chain M. incognita (Feb. 2008 (M. incognita WS245/PRJEA28837/melInc2)) Chained Alignments Comparative Genomics chainMelHap1 melHap1 Chain M. hapla (Sep. 2008 (M. hapla VW9 WS210/melHap1)) Chained Alignments Comparative Genomics chainBurXyl1 burXyl1 Chain Pine wood nematode (Nov. 2011 (WS229/B. xylophilus Ka4C1/burXyl1)) Chained Alignments Comparative Genomics chainPanRed1 panRed1 Chain Microworm (Feb. 2013 (WS240/Pred3/panRed1)) Chained Alignments Comparative Genomics chainHeterodera_glycines Heterodera_glycines Chain Heterodera_glycines (Heterodera_glycines) Chained Alignments Comparative Genomics chainGlobodera_pallida Globodera_pallida Chain Globodera_pallida (Globodera_pallida) Chained Alignments Comparative Genomics chainSteinernema_glaseri Steinernema_glaseri Chain Steinernema_glaseri (Steinernema_glaseri) Chained Alignments Comparative Genomics chainGlobodera_rostochiensis Globodera_rostochiensis Chain Globodera_rostochiensis (Globodera_rostochiensis) Chained Alignments Comparative Genomics chainGlobodera_ellingtonae Globodera_ellingtonae Chain Globodera_ellingtonae (Globodera_ellingtonae) Chained Alignments Comparative Genomics chainSubanguina_moxae Subanguina_moxae Chain Subanguina_moxae (Subanguina_moxae) Chained Alignments Comparative Genomics chainSteinernema_feltiae Steinernema_feltiae Chain Steinernema_feltiae (Steinernema_feltiae) Chained Alignments Comparative Genomics chainSteinernema_scapterisci Steinernema_scapterisci Chain Steinernema_scapterisci (Steinernema_scapterisci) Chained Alignments Comparative Genomics chainSteinernema_carpocapsae Steinernema_carpocapsae Chain Steinernema_carpocapsae (Steinernema_carpocapsae) Chained Alignments Comparative Genomics chainMeloidogyne_floridensis Meloidogyne_floridensis Chain Meloidogyne_floridensis (Meloidogyne_floridensis) Chained Alignments Comparative Genomics chainSteinernema_monticolum Steinernema_monticolum Chain Steinernema_monticolum (Steinernema_monticolum) Chained Alignments Comparative Genomics chainBursaphelenchus_xylophilus Bursaphelenchus_xylophilus Chain Bursaphelenchus_xylophilus (Bursaphelenchus_xylophilus) Chained Alignments Comparative Genomics chainDitylenchus_destructor Ditylenchus_destructor Chain Ditylenchus_destructor (Ditylenchus_destructor) Chained Alignments Comparative Genomics chainRotylenchulus_reniformis Rotylenchulus_reniformis Chain Rotylenchulus_reniformis (Rotylenchulus_reniformis) Chained Alignments Comparative Genomics chainMeloidogyne_graminicola Meloidogyne_graminicola Chain Meloidogyne_graminicola (Meloidogyne_graminicola) Chained Alignments Comparative Genomics chainRhabditophanes_KR3021 Rhabditophanes_KR3021 Chain Rhabditophanes_KR3021 (Rhabditophanes_KR3021) Chained Alignments Comparative Genomics chainAcrobeloides_nanus Acrobeloides_nanus Chain Acrobeloides_nanus (Acrobeloides_nanus) Chained Alignments Comparative Genomics chainMeloidogyne_javanica Meloidogyne_javanica Chain Meloidogyne_javanica (Meloidogyne_javanica) Chained Alignments Comparative Genomics chainMeloidogyne_incognita Meloidogyne_incognita Chain Meloidogyne_incognita (Meloidogyne_incognita) Chained Alignments Comparative Genomics chainParastrongyloides_trichosuri Parastrongyloides_trichosuri Chain Parastrongyloides_trichosuri (Parastrongyloides_trichosuri) Chained Alignments Comparative Genomics chainStrongyloides_papillosus Strongyloides_papillosus Chain Strongyloides_papillosus (Strongyloides_papillosus) Chained Alignments Comparative Genomics chainMeloidogyne_arenaria Meloidogyne_arenaria Chain Meloidogyne_arenaria (Meloidogyne_arenaria) Chained Alignments Comparative Genomics chainStrongyloides_venezuelensis Strongyloides_venezuelensis Chain Strongyloides_venezuelensis (Strongyloides_venezuelensis) Chained Alignments Comparative Genomics chainStrongyloides_stercoralis Strongyloides_stercoralis Chain Strongyloides_stercoralis (Strongyloides_stercoralis) Chained Alignments Comparative Genomics chainStrRat2 strRat2 Chain Threadworm (Sep. 2014 (S. ratti ED321/strRat2)) Chained Alignments Comparative Genomics chainElaeophora_elaphi Elaeophora_elaphi Chain Elaeophora_elaphi (Elaeophora_elaphi) Chained Alignments Comparative Genomics chainDirImm1 dirImm1 Chain Dog heartworm (Sep. 2013 (WS240/D. immitis v2.2/dirImm1)) Chained Alignments Comparative Genomics chainLoaLoa1 loaLoa1 Chain Eye worm (Jul. 2012 (WS235/L_loa_Cameroon_isolate/loaLoa1)) Chained Alignments Comparative Genomics chainBruMal2 bruMal2 Chain Filarial worm (May. 2014 (WS244/B_malayi-3.1/bruMal2)) Chained Alignments Comparative Genomics chainOncVol1 oncVol1 Chain O. volvulus (Nov. 2013 (WS241/O_volvulus_Cameroon_v3/oncVol1)) Chained Alignments Comparative Genomics chainAscSuu1 ascSuu1 Chain Pig roundworm (Sep. 2012 (WS229/AscSuum_1.0/ascSuu1)) Chained Alignments Comparative Genomics chainToxocara_canis Toxocara_canis Chain Toxocara_canis (Toxocara_canis) Chained Alignments Comparative Genomics chainParascaris_univalens Parascaris_univalens Chain Parascaris_univalens (Parascaris_univalens) Chained Alignments Comparative Genomics chainAscaris_suum Ascaris_suum Chain Ascaris_suum (Ascaris_suum) Chained Alignments Comparative Genomics chainSetaria_equina Setaria_equina Chain Setaria_equina (Setaria_equina) Chained Alignments Comparative Genomics chainOnchocerca_flexuosa Onchocerca_flexuosa Chain Onchocerca_flexuosa (Onchocerca_flexuosa) Chained Alignments Comparative Genomics chainSetaria_digitata Setaria_digitata Chain Setaria_digitata (Setaria_digitata) Chained Alignments Comparative Genomics chainLoa_loa Loa_loa Chain Loa_loa (Loa_loa) Chained Alignments Comparative Genomics chainBrugia_malayi Brugia_malayi Chain Brugia_malayi (Brugia_malayi) Chained Alignments Comparative Genomics chainOnchocerca_ochengi Onchocerca_ochengi Chain Onchocerca_ochengi (Onchocerca_ochengi) Chained Alignments Comparative Genomics chainBrugia_pahangi Brugia_pahangi Chain Brugia_pahangi (Brugia_pahangi) Chained Alignments Comparative Genomics chainDirofilaria_immitis Dirofilaria_immitis Chain Dirofilaria_immitis (Dirofilaria_immitis) Chained Alignments Comparative Genomics chainWuchereria_bancrofti Wuchereria_bancrofti Chain Wuchereria_bancrofti (Wuchereria_bancrofti) Chained Alignments Comparative Genomics chainOnchocerca_volvulus Onchocerca_volvulus Chain Onchocerca_volvulus (Onchocerca_volvulus) Chained Alignments Comparative Genomics chainPriExs1 priExs1 Chain P. exspectatus (Mar. 2014 (WS243/P_exspectatus_v1/priExs1)) Chained Alignments Comparative Genomics chainPriPac3 priPac3 Chain P. pacificus (Aug. 2014 (WS221/P_pacificus-v2/priPac3)) Chained Alignments Comparative Genomics chainOscheius_MCB Oscheius_MCB Chain Oscheius_MCB (Oscheius_MCB) Chained Alignments Comparative Genomics chainNecAme1 necAme1 Chain N. americanus (Dec. 2013 (WS242/N_americanus_v1/necAme1)) Chained Alignments Comparative Genomics chainHetBac1 hetBac1 Chain H. bacteriophora/m31e (Aug. 2011 (WS229/H. bacteriophora 7.0/hetBac1)) Chained Alignments Comparative Genomics chainPristionchus_exspectatus Pristionchus_exspectatus Chain Pristionchus_exspectatus (Pristionchus_exspectatus) Chained Alignments Comparative Genomics chainPristionchus_entomophagus Pristionchus_entomophagus Chain Pristionchus_entomophagus (Pristionchus_entomophagus) Chained Alignments Comparative Genomics chainPristionchus_maxplancki Pristionchus_maxplancki Chain Pristionchus_maxplancki (Pristionchus_maxplancki) Chained Alignments Comparative Genomics chainPristionchus_pacificus Pristionchus_pacificus Chain Pristionchus_pacificus (Pristionchus_pacificus) Chained Alignments Comparative Genomics chainPristionchus_arcanus Pristionchus_arcanus Chain Pristionchus_arcanus (Pristionchus_arcanus) Chained Alignments Comparative Genomics chainOesophagostomum_dentatum Oesophagostomum_dentatum Chain Oesophagostomum_dentatum (Oesophagostomum_dentatum) Chained Alignments Comparative Genomics chainAncylostoma_duodenale Ancylostoma_duodenale Chain Ancylostoma_duodenale (Ancylostoma_duodenale) Chained Alignments Comparative Genomics chainOscheius_TEL_2014 Oscheius_TEL_2014 Chain Oscheius_TEL_2014 (Oscheius_TEL_2014) Chained Alignments Comparative Genomics chainNecator_americanus Necator_americanus Chain Necator_americanus (Necator_americanus) Chained Alignments Comparative Genomics chainParapristionchus_giblindavisi Parapristionchus_giblindavisi Chain Parapristionchus_giblindavisi (Parapristionchus_giblindavisi) Chained Alignments Comparative Genomics chainAncCey1 ancCey1 Chain A. ceylanicum (Mar. 2014 (WS243/Acey_2013.11.30.genDNA/ancCey1)) Chained Alignments Comparative Genomics chainOscheius_tipulae Oscheius_tipulae Chain Oscheius_tipulae (Oscheius_tipulae) Chained Alignments Comparative Genomics chainAncylostoma_caninum Ancylostoma_caninum Chain Ancylostoma_caninum (Ancylostoma_caninum) Chained Alignments Comparative Genomics chainC_sp21_LS_2015 C_sp21_LS_2015 Chain C_sp21_LS_2015 (C_sp21_LS_2015) Chained Alignments Comparative Genomics chainDiploscapter_pachys Diploscapter_pachys Chain Diploscapter_pachys (Diploscapter_pachys) Chained Alignments Comparative Genomics chainDiploscapter_coronatus Diploscapter_coronatus Chain Diploscapter_coronatus (Diploscapter_coronatus) Chained Alignments Comparative Genomics chainC_sp32_LS_2015 C_sp32_LS_2015 Chain C_sp32_LS_2015 (C_sp32_LS_2015) Chained Alignments Comparative Genomics chainCb4 cb4 Chain C. briggsae (Apr. 2011 (WS225/cb4)) Chained Alignments Comparative Genomics chainCaeAng2 caeAng2 Chain C. angaria (Apr. 2012 (WS232/ps1010rel8/caeAng2)) Chained Alignments Comparative Genomics chainCaeSp51 caeSp51 Chain C. sp. 5 ju800 (Jan. 2012 (WS230/Caenorhabditis_sp_5-JU800-1.0/caeSp51)) Chained Alignments Comparative Genomics chainC_sp31_LS_2015 C_sp31_LS_2015 Chain C_sp31_LS_2015 (C_sp31_LS_2015) Chained Alignments Comparative Genomics chainC_sp38_MB_2015 C_sp38_MB_2015 Chain C_sp38_MB_2015 (C_sp38_MB_2015) Chained Alignments Comparative Genomics chainC_sp39_LS_2015 C_sp39_LS_2015 Chain C_sp39_LS_2015 (C_sp39_LS_2015) Chained Alignments Comparative Genomics chainCaeJap4 caeJap4 Chain C. japonica (Aug. 2010 (WUSTL 7.0.1/caeJap4)) Chained Alignments Comparative Genomics chainC_sp40_LS_2015 C_sp40_LS_2015 Chain C_sp40_LS_2015 (C_sp40_LS_2015) Chained Alignments Comparative Genomics chainC_sp34_TK_2017 C_sp34_TK_2017 Chain C_sp34_TK_2017 (C_sp34_TK_2017) Chained Alignments Comparative Genomics chainCaeSp111 caeSp111 Chain C. tropicalis (Nov. 2010 (WS226/WUSTL 3.0.1/caeSp111)) Chained Alignments Comparative Genomics chainC_latens C_latens Chain C_latens (C_latens) Chained Alignments Comparative Genomics chainCaePb3 caePb3 Chain C. brenneri (Nov. 2010 (C. brenneri 6.0.1b/caePb3)) Chained Alignments Comparative Genomics chainC_sp26_LS_2015 C_sp26_LS_2015 Chain C_sp26_LS_2015 (C_sp26_LS_2015) Chained Alignments Comparative Genomics chainC_briggsae C_briggsae Chain C_briggsae (C_briggsae) Chained Alignments Comparative Genomics chainCaeRem4 caeRem4 Chain C. remanei (Jul. 2007 (WS220/caeRem4)) Chained Alignments Comparative Genomics chainC_nigoni C_nigoni Chain C_nigoni (C_nigoni) Chained Alignments Comparative Genomics xenoMrna Other mRNAs Non-C. elegans mRNAs from GenBank mRNA and EST Description This track displays translated blat alignments of vertebrate and invertebrate mRNA in GenBank from organisms other than C. elegans. Display Conventions and Configuration This track follows the display conventions for PSL alignment tracks. In dense display mode, the items that are more darkly shaded indicate matches of better quality. The strand information (+/-) for this track is in two parts. The first + indicates the orientation of the query sequence whose translated protein produced the match (here always 5' to 3', hence +). The second + or - indicates the orientation of the matching translated genomic sequence. Because the two orientations of a DNA sequence give different predicted protein sequences, there are four combinations. ++ is not the same as --, nor is +- the same as -+. The description page for this track has a filter that can be used to change the display mode, alter the color, and include/exclude a subset of items within the track. This may be helpful when many items are shown in the track display, especially when only some are relevant to the current task. To use the filter: Type a term in one or more of the text boxes to filter the mRNA display. For example, to apply the filter to all mRNAs expressed in a specific organ, type the name of the organ in the tissue box. To view the list of valid terms for each text box, consult the table in the Table Browser that corresponds to the factor on which you wish to filter. For example, the "tissue" table contains all the types of tissues that can be entered into the tissue text box. Wildcards may also be used in the filter. If filtering on more than one value, choose the desired combination logic. If "and" is selected, only mRNAs that match all filter criteria will be highlighted. If "or" is selected, mRNAs that match any one of the filter criteria will be highlighted. Choose the color or display characteristic that should be used to highlight or include/exclude the filtered items. If "exclude" is chosen, the browser will not display mRNAs that match the filter criteria. If "include" is selected, the browser will display only those mRNAs that match the filter criteria. This track may also be configured to display codon coloring, a feature that allows the user to quickly compare mRNAs against the genomic sequence. For more information about this option, go to the Codon and Base Coloring for Alignment Tracks page. Methods The mRNAs were aligned against the C. elegans genome using translated blat. When a single mRNA aligned in multiple places, the alignment having the highest base identity was found. Only those alignments having a base identity level within 1% of the best and at least 25% base identity with the genomic sequence were kept. Credits The mRNA track was produced at UCSC from mRNA sequence data submitted to the international public sequence databases by scientists worldwide. References Benson DA, Karsch-Mizrachi I, Lipman DJ, Ostell J, Wheeler DL. GenBank: update. Nucleic Acids Res. 2004 Jan 1;32(Database issue):D23-6. PMID: 14681350; PMC: PMC308779 Kent WJ. BLAT - the BLAST-like alignment tool. Genome Res. 2002 Apr;12(4):656-64. PMID: 11932250; PMC: PMC187518 xenoRefGene Other RefSeq Non-C. elegans RefSeq Genes Genes and Gene Predictions Description This track shows known protein-coding and non-protein-coding genes for organisms other than C. elegans, taken from the NCBI RNA reference sequences collection (RefSeq). The data underlying this track are updated weekly. Display Conventions and Configuration This track follows the display conventions for gene prediction tracks. The color shading indicates the level of review the RefSeq record has undergone: predicted (light), provisional (medium), reviewed (dark). The item labels and display colors of features within this track can be configured through the controls at the top of the track description page. Label: By default, items are labeled by gene name. Click the appropriate Label option to display the accession name instead of the gene name, show both the gene and accession names, or turn off the label completely. Codon coloring: This track contains an optional codon coloring feature that allows users to quickly validate and compare gene predictions. To display codon colors, select the genomic codons option from the Color track by codons pull-down menu. For more information about this feature, go to the Coloring Gene Predictions and Annotations by Codon page. Hide non-coding genes: By default, both the protein-coding and non-protein-coding genes are displayed. If you wish to see only the coding genes, click this box. Methods The RNAs were aligned against the C. elegans genome using blat; those with an alignment of less than 15% were discarded. When a single RNA aligned in multiple places, the alignment having the highest base identity was identified. Only alignments having a base identity level within 0.5% of the best and at least 25% base identity with the genomic sequence were kept. Credits This track was produced at UCSC from RNA sequence data generated by scientists worldwide and curated by the NCBI RefSeq project. References Kent WJ. BLAT--the BLAST-like alignment tool. Genome Res. 2002 Apr;12(4):656-64. PMID: 11932250; PMC: PMC187518 Pruitt KD, Brown GR, Hiatt SM, Thibaud-Nissen F, Astashyn A, Ermolaeva O, Farrell CM, Hart J, Landrum MJ, McGarvey KM et al. RefSeq: an update on mammalian reference sequences. Nucleic Acids Res. 2014 Jan;42(Database issue):D756-63. PMID: 24259432; PMC: PMC3965018 Pruitt KD, Tatusova T, Maglott DR. NCBI Reference Sequence (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Res. 2005 Jan 1;33(Database issue):D501-4. PMID: 15608248; PMC: PMC539979 ucscToRefSeq RefSeq Acc RefSeq Accession Mapping and Sequencing Description This track associates UCSC Genome Browser chromosome names to accession identifiers from the NCBI Reference Sequence Database (RefSeq). The data were downloaded from the NCBI assembly database. Credits The data for this track was prepared by Hiram Clawson. simpleRepeat Simple Repeats Simple Tandem Repeats by TRF Variation and Repeats Description This track displays simple tandem repeats (possibly imperfect repeats) located by Tandem Repeats Finder (TRF) which is specialized for this purpose. These repeats can occur within coding regions of genes and may be quite polymorphic. Repeat expansions are sometimes associated with specific diseases. Methods For more information about the TRF program, see Benson (1999). Credits TRF was written by Gary Benson. References Benson G. Tandem repeats finder: a program to analyze DNA sequences. Nucleic Acids Res. 1999 Jan 15;27(2):573-80. PMID: 9862982; PMC: PMC148217 uniprot UniProt UniProt SwissProt/TrEMBL Protein Annotations Genes and Gene Predictions Description This track shows protein sequences and annotations on them from the UniProt/SwissProt database, mapped to genomic coordinates. UniProt/SwissProt data has been curated from scientific publications by the UniProt staff, UniProt/TrEMBL data has been predicted by various computational algorithms. The annotations are divided into multiple subtracks, based on their "feature type" in UniProt. The first two subtracks below - one for SwissProt, one for TrEMBL - show the alignments of protein sequences to the genome, all other tracks below are the protein annotations mapped through these alignments to the genome. Track Name Description UCSC Alignment, SwissProt = curated protein sequences Protein sequences from SwissProt mapped to the genome. All other tracks are (start,end) SwissProt annotations on these sequences mapped through this alignment. Even protein sequences without a single curated annotation (splice isoforms) are visible in this track. Each UniProt protein has one main isoform, which is colored in dark. Alternative isoforms are sequences that do not have annotations on them and are colored in light-blue. They can be hidden with the TrEMBL/Isoform filter (see below). UCSC Alignment, TrEMBL = predicted protein sequences Protein sequences from TrEMBL mapped to the genome. All other tracks below are (start,end) TrEMBL annotations mapped to the genome using this track. This track is hidden by default. To show it, click its checkbox on the track configuration page. UniProt Signal Peptides Regions found in proteins destined to be secreted, generally cleaved from mature protein. UniProt Extracellular Domains Protein domains with the comment "Extracellular". UniProt Transmembrane Domains Protein domains of the type "Transmembrane". UniProt Cytoplasmic Domains Protein domains with the comment "Cytoplasmic". UniProt Polypeptide Chains Polypeptide chain in mature protein after post-processing. UniProt Regions of Interest Regions that have been experimentally defined, such as the role of a region in mediating protein-protein interactions or some other biological process. UniProt Domains Protein domains, zinc finger regions and topological domains. UniProt Disulfide Bonds Disulfide bonds. UniProt Amino Acid Modifications Glycosylation sites, modified residues and lipid moiety-binding regions. UniProt Amino Acid Mutations Mutagenesis sites and sequence variants. UniProt Protein Primary/Secondary Structure Annotations Beta strands, helices, coiled-coil regions and turns. UniProt Sequence Conflicts Differences between Genbank sequences and the UniProt sequence. UniProt Repeats Regions of repeated sequence motifs or repeated domains. UniProt Other Annotations All other annotations, e.g. compositional bias For consistency and convenience for users of mutation-related tracks, the subtrack "UniProt/SwissProt Variants" is a copy of the track "UniProt Variants" in the track group "Phenotype and Literature", or "Variation and Repeats", depending on the assembly. Display Conventions and Configuration Genomic locations of UniProt/SwissProt annotations are labeled with a short name for the type of annotation (e.g. "glyco", "disulf bond", "Signal peptide" etc.). A click on them shows the full annotation and provides a link to the UniProt/SwissProt record for more details. TrEMBL annotations are always shown in light blue, except in the Signal Peptides, Extracellular Domains, Transmembrane Domains, and Cytoplamsic domains subtracks. Mouse over a feature to see the full UniProt annotation comment. For variants, the mouse over will show the full name of the UniProt disease acronym. The subtracks for domains related to subcellular location are sorted from outside to inside of the cell: Signal peptide, extracellular, transmembrane, and cytoplasmic. Features in the "UniProt Modifications" (modified residues) track are drawn in light green. Disulfide bonds are shown in dark grey. Topological domains in maroon and zinc finger regions in olive green. Duplicate annotations are removed as far as possible: if a TrEMBL annotation has the same genome position and same feature type, comment, disease and mutated amino acids as a SwissProt annotation, it is not shown again. Two annotations mapped through different protein sequence alignments but with the same genome coordinates are only shown once. On the configuration page of this track, you can choose to hide any TrEMBL annotations. This filter will also hide the UniProt alternative isoform protein sequences because both types of information are less relevant to most users. Please contact us if you want more detailed filtering features. Note that for the human hg38 assembly and SwissProt annotations, there also is a public track hub prepared by UniProt itself, with genome annotations maintained by UniProt using their own mapping method based on those Gencode/Ensembl gene models that are annotated in UniProt for a given protein. For proteins that differ from the genome, UniProt's mapping method will, in most cases, map a protein and its annotations to an unexpected location (see below for details on UCSC's mapping method). Methods Briefly, UniProt protein sequences were aligned to the transcripts associated with the protein, the top-scoring alignments were retained, and the result was projected to the genome through a transcript-to-genome alignment. Depending on the genome, the transcript-genome alignments was either provided by the source database (NBCI RefSeq), created at UCSC (UCSC RefSeq) or derived from the transcripts (Ensembl/Augustus). The transcript set is NCBI RefSeq for hg38, UCSC RefSeq for hg19 (due to alt/fix haplotype misplacements in the NCBI RefSeq set on hg19). For other genomes, RefSeq, Ensembl and Augustus are tried, in this order. The resulting protein-genome alignments of this process are available in the file formats for liftOver or pslMap from our data archive (see "Data Access" section below). An important step of the mapping process protein -> transcript -> genome is filtering the alignment from protein to transcript. Due to differences between the UniProt proteins and the transcripts (proteins were made many years before the transcripts were made, and human genomes have variants), the transcript with the highest BLAST score when aligning the protein to all transcripts is not always the correct transcript for a protein sequence. Therefore, the protein sequence is aligned to only a very short list of one or sometimes more transcripts, selected by a three-step procedure: Use transcripts directly annotated by UniProt: for organisms that have a RefSeq transcript track, proteins are aligned to the RefSeq transcripts that are annotated by UniProt for this particular protein. Use transcripts for NCBI Gene ID annotated by UniProt: If no transcripts are annotated on the protein, or the annotated ones have been deprecated by NCBI, but a NCBI Gene ID is annotated, the RefSeq transcripts for this Gene ID are used. This can result in multiple matching transcripts for a protein. Use best matching transcript: If no NCBI Gene is annotated, then BLAST scores are used to pick the transcripts. There can be multiple transcripts for one protein, as their coding sequences can be identical. All transcripts within 1% of the highest observed BLAST score are used. For strategy 2 and 3, many of the transcripts found do not differ in coding sequence, so the resulting alignments on the genome will be identical. Therefore, any identical alignments are removed in a final filtering step. The details page of these alignments will contain a list of all transcripts that result in the same protein-genome alignment. On hg38, only a handful of edge cases (pseudogenes, very recently added proteins) remain in 2023 where strategy 3 has to be used. In other words, when an NCBI or UCSC RefSeq track is used for the mapping and to align a protein sequence to the correct transcript, we use a three stage process: If UniProt has annotated a given RefSeq transcript for a given protein sequence, the protein is aligned to this transcript. Any difference in the version suffix is tolerated in this comparison. If no transcript is annotated or the transcript cannot be found in the NCBI/UCSC RefSeq track, the UniProt-annotated NCBI Gene ID is resolved to a set of NCBI RefSeq transcript IDs via the most current version of NCBI genes tables. Only the top match of the resulting alignments and all others within 1% of its score are used for the mapping. If no transcript can be found after step (2), the protein is aligned to all transcripts, the top match, and all others within 1% of its score are used. This system was designed to resolve the problem of incorrect mappings of proteins, mostly on hg38, due to differences between the SwissProt sequences and the genome reference sequence, which has changed since the proteins were defined. The problem is most pronounced for gene families composed of either very repetitive or very similar proteins. To make sure that the alignments always go to the best chromosome location, all _alt and _fix reference patch sequences are ignored for the alignment, so the patches are entirely free of UniProt annotations. Please contact us if you have feedback on this process or example edge cases. We are not aware of a way to evaluate the results completely and in an automated manner. Proteins were aligned to transcripts with TBLASTN, converted to PSL, filtered with pslReps (93% query coverage, keep alignments within top 1% score), lifted to genome positions with pslMap and filtered again with pslReps. UniProt annotations were obtained from the UniProt XML file. The UniProt annotations were then mapped to the genome through the alignment described above using the pslMap program. This approach draws heavily on the LS-SNP pipeline by Mark Diekhans. Like all Genome Browser source code, the main script used to build this track can be found on Github. Older releases This track is automatically updated on an ongoing basis, every 2-3 months. The current version name is always shown on the track details page, it includes the release of UniProt, the version of the transcript set and a unique MD5 that is based on the protein sequences, the transcript sequences, the mapping file between both and the transcript-genome alignment. The exact transcript that was used for the alignment is shown when clicking a protein alignment in one of the two alignment tracks. For reproducibility of older analysis results and for manual inspection, previous versions of this track are available for browsing in the form of the UCSC UniProt Archive Track Hub (click this link to connect the hub now). The underlying data of all releases of this track (past and current) can be obtained from our downloads server, including the UniProt protein-to-genome alignment. Data Access The raw data of the current track can be explored interactively with the Table Browser, or the Data Integrator. For automated analysis, the genome annotation is stored in a bigBed file that can be downloaded from the download server. The exact filenames can be found in the track configuration file. Annotations can be converted to ASCII text by our tool bigBedToBed which can be compiled from the source code or downloaded as a precompiled binary for your system. Instructions for downloading source code and binaries can be found here. The tool can also be used to obtain only features within a given range, for example: bigBedToBed http://hgdownload.soe.ucsc.edu/gbdb/ce11/uniprot/unipStruct.bb -chrom=chr6 -start=0 -end=1000000 stdout Please refer to our mailing list archives for questions, or our Data Access FAQ for more information. Lifting from UniProt to genome coordinates in pipelines To facilitate mapping protein coordinates to the genome, we provide the alignment files in formats that are suitable for our command line tools. Our command line programs liftOver or pslMap can be used to map coordinates on protein sequences to genome coordinates. The filenames are unipToGenome.over.chain.gz (liftOver) and unipToGenomeLift.psl.gz (pslMap). Example commands: wget -q https://hgdownload.soe.ucsc.edu/goldenPath/archive/hg38/uniprot/2022_03/unipToGenome.over.chain.gz wget -q https://hgdownload.soe.ucsc.edu/admin/exe/linux.x86_64/liftOver chmod a+x liftOver echo 'Q99697 1 10 annotationOnProtein' > prot.bed liftOver prot.bed unipToGenome.over.chain.gz genome.bed cat genome.bed Credits This track was created by Maximilian Haeussler at UCSC, with a lot of input from Chris Lee, Mark Diekhans and Brian Raney, feedback from the UniProt staff, Alejo Mujica, Regeneron Pharmaceuticals and Pia Riestra, GeneDx. Thanks to UniProt for making all data available for download. References UniProt Consortium. Reorganizing the protein space at the Universal Protein Resource (UniProt). Nucleic Acids Res. 2012 Jan;40(Database issue):D71-5. PMID: 22102590; PMC: PMC3245120 Yip YL, Scheib H, Diemand AV, Gattiker A, Famiglietti LM, Gasteiger E, Bairoch A. The Swiss-Prot variant page and the ModSNP database: a resource for sequence and structure information on human protein variants. Hum Mutat. 2004 May;23(5):464-70. PMID: 15108278 unipConflict Seq. Conflicts UniProt Sequence Conflicts Genes and Gene Predictions unipRepeat Repeats UniProt Repeats Genes and Gene Predictions unipStruct Structure UniProt Protein Primary/Secondary Structure Annotations Genes and Gene Predictions unipOther Other Annot. UniProt Other Annotations Genes and Gene Predictions unipMut Mutations UniProt Amino Acid Mutations Genes and Gene Predictions unipModif AA Modifications UniProt Amino Acid Modifications Genes and Gene Predictions unipDomain Domains UniProt Domains Genes and Gene Predictions unipDisulfBond Disulf. Bonds UniProt Disulfide Bonds Genes and Gene Predictions unipChain Chains UniProt Mature Protein Products (Polypeptide Chains) Genes and Gene Predictions unipLocCytopl Cytoplasmic UniProt Cytoplasmic Domains Genes and Gene Predictions unipLocTransMemb Transmembrane UniProt Transmembrane Domains Genes and Gene Predictions unipInterest Interest UniProt Regions of Interest Genes and Gene Predictions unipLocExtra Extracellular UniProt Extracellular Domain Genes and Gene Predictions unipLocSignal Signal Peptide UniProt Signal Peptides Genes and Gene Predictions unipAliTrembl TrEMBL Aln. UCSC alignment of TrEMBL proteins to genome Genes and Gene Predictions unipAliSwissprot SwissProt Aln. UCSC alignment of SwissProt proteins to genome (dark blue: main isoform, light blue: alternative isoforms) Genes and Gene Predictions spMut UniProt Variants UniProt/SwissProt Amino Acid Substitutions Variation and Repeats Description NOTE: This track is intended for use primarily by physicians and other professionals concerned with genetic disorders, by genetics researchers, and by advanced students in science and medicine. While the genome browser database is open to the public, users seeking information about a personal medical or genetic condition are urged to consult with a qualified physician for diagnosis and for answers to personal questions. This track shows the genomic positions of natural and artifical amino acid variants in the UniProt/SwissProt database. The data has been curated from scientific publications by the UniProt staff. Display Conventions and Configuration Genomic locations of UniProt/SwissProt variants are labeled with the amino acid change at a given position and, if known, the abbreviated disease name. A "?" is used if there is no disease annotated at this location, but the protein is described as being linked to only a single disease in UniProt. Mouse over a mutation to see the UniProt comments. Artificially-introduced mutations are colored green and naturally-occurring variants are colored red. For full information about a particular variant, click the "UniProt variant" linkout. The "UniProt record" linkout lists all variants of a particular protein sequence. The "Source articles" linkout lists the articles in PubMed that originally described the variant(s) and were used as evidence by the UniProt curators. Methods UniProt sequences were aligned to RefSeq sequences first with BLAT, then lifted to genome positions with pslMap. UniProt variants were parsed from the UniProt XML file. The variants were then mapped to the genome through the alignment using the pslMap program. This mapping approach draws heavily on the LS-SNP pipeline by Mark Diekhans. The complete script is part of the kent source tree and is located in src/hg/utils/uniprotMutations. Data Access The raw data can be explored interactively with the Table Browser, or the Data Integrator. For automated analysis, the genome annotation is stored in a bigBed file that can be downloaded from the download server. The underlying data file for this track is called spMut.bb. Individual regions or the whole genome annotation can be obtained using our tool bigBedToBed which can be compiled from the source code or downloaded as a precompiled binary for your system. Instructions for downloading source code and binaries can be found here. The tool can also be used to obtain only features within a given range, for example: bigBedToBed http://hgdownload.soe.ucsc.edu/gbdb/ce11/bbi/uniprot/spMut.bb -chrom=chr6 -start=0 -end=1000000 stdout Please refer to our mailing list archives for questions, or our Data Access FAQ for more information. Credits This track was created by Maximilian Haeussler, with advice from Mark Diekhans and Brian Raney. References UniProt Consortium. Reorganizing the protein space at the Universal Protein Resource (UniProt). Nucleic Acids Res. 2012 Jan;40(Database issue):D71-5. PMID: 22102590; PMC: PMC3245120 Yip YL, Scheib H, Diemand AV, Gattiker A, Famiglietti LM, Gasteiger E, Bairoch A. The Swiss-Prot variant page and the ModSNP database: a resource for sequence and structure information on human protein variants. Hum Mutat. 2004 May;23(5):464-70. PMID: 15108278 windowmaskerSdust WM + SDust Genomic Intervals Masked by WindowMasker + SDust Variation and Repeats Description This track depicts masked sequence as determined by WindowMasker. The WindowMasker tool is included in the NCBI C++ toolkit. The source code for the entire toolkit is available from the NCBI FTP site. Methods To create this track, WindowMasker was run with the following parameters: windowmasker -mk_counts true -input ce11.fa -output wm_counts windowmasker -ustat wm_counts -sdust true -input ce11.fa -output repeats.bed The repeats.bed (BED3) file was loaded into the "windowmaskerSdust" table for this track. References Morgulis A, Gertz EM, Schäffer AA, Agarwala R. WindowMasker: window-based masker for sequenced genomes. Bioinformatics. 2006 Jan 15;22(2):134-41. PMID: 16287941 rmsk RepeatMasker Repeating Elements by RepeatMasker Variation and Repeats Description This track was created by using Arian Smit's RepeatMasker program which screens DNA sequences for interspersed repeats and low complexity DNA sequences. The program outputs a detailed annotation of the repeats that are present in the query sequence (represented by this track) as well as a modified version of the query sequence in which all the annotated repeats have been masked (generally available on the Downloads page). RepeatMasker uses the Repbase Update library of repeats from the Genetic Information Research Institute (GIRI). Repbase Update is described in Jurka, J. (2000) in the References section below. Display Conventions and Configuration In full display mode, this track displays up to ten different classes of repeats: Short interspersed nuclear elements (SINE), which include ALUs Long interspersed nuclear elements (LINE) Long terminal repeat elements (LTR) which include retroposons DNA repeat elements (DNA) Simple repeats (micro-satellites) Low complexity repeats Satellite repeats RNA repeats (including RNA, tRNA, rRNA, snRNA, scRNA) Other repeats which includes class RC (Rolling Circle) Unknown The level of color shading in the graphical display reflects the amount of base mismatch, base deletion, and base insertion associated with a repeat element. The higher the combined number of these, the lighter the shading. Methods UCSC has used the most current versions of the RepeatMasker software and repeat libraries available to generate these data. Note that these versions may be newer than those that are publicly available on the Internet. Data are generated using the RepeatMasker -s flag. Additional flags may be used for certain organisms. Repeats are soft-masked. Alignments may extend through repeats, but are not permitted to initiate in them. See the FAQ for more information. Credits Thanks to Arian Smit, Robert Hubley and GIRI for providing the tools and repeat libraries used to generate this track. References Repbase Update is described in Jurka J. Repbase update: a database and an electronic journal of repetitive elements. Trends Genet. 2000 Sep;16(9):418-420. chainSelf Self Chain C. elegans Chained Self Alignments Variation and Repeats Description This track shows alignments of the C. elegans genome with itself, using a gap scoring system that allows longer gaps than traditional affine gap scoring systems. The system can also tolerate gaps in both sets of sequence simultaneously. After filtering out the "trivial" alignments produced when identical locations of the genome map to one another (e.g. chrN mapping to chrN), the remaining alignments point out areas of duplication within the C. elegans genome. The chain track displays boxes joined together by either single or double lines. The boxes represent aligning regions. Single lines indicate gaps that are largely due to a deletion in the query assembly or an insertion in the target assembly. Double lines represent more complex gaps that involve substantial sequence in both the query and target assemblies. This may result from inversions, overlapping deletions, an abundance of local mutation, or an unsequenced gap in one of the assemblies. In cases where multiple chains align over a particular region of the C. elegans genome, the chains with single-lined gaps are often due to processed pseudogenes, while chains with double-lined gaps are more often due to paralogs and unprocessed pseudogenes. In the "pack" and "full" display modes, the individual feature names indicate the chromosome, strand, and location (in thousands) of the match for each matching alignment. Display Conventions and Configuration By default, the chains to chromosome-based assemblies are colored based on which chromosome they map to in the aligning organism. To turn off the coloring, check the "off" button next to: Color track based on chromosome. To display only the chains of one chromosome in the aligning organism, enter the name of that chromosome (e.g. chr4) in box next to: Filter by chromosome. Methods The genome was aligned to itself using blastz. Trivial alignments were filtered out, and the remaining alignments were converted into axt format using the lavToAxt program. The axt alignments were fed into axtChain, which organizes all alignments between a single target chromosome and a single query chromosome into a group and creates a kd-tree out of the gapless subsections (blocks) of the alignments. A dynamic program was then run over the kd-trees to find the maximally scoring chains of these blocks. Chains scoring below a threshold were discarded; the remaining chains are displayed in this track. Credits Blastz was developed at Pennsylvania State University by Minmei Hou, Scott Schwartz, Zheng Zhang, and Webb Miller with advice from Ross Hardison. Lineage-specific repeats were identified by Arian Smit and his RepeatMasker program. The axtChain program was developed at the University of California at Santa Cruz by Jim Kent with advice from Webb Miller and David Haussler. The browser display and database storage of the chains were generated by Robert Baertsch and Jim Kent. References Chiaromonte F, Yap VB, Miller W. Scoring pairwise genomic sequence alignments. Pac Symp Biocomput. 2002:115-26. PMID: 11928468 Kent WJ, Baertsch R, Hinrichs A, Miller W, Haussler D. Evolution's cauldron: Duplication, deletion, and rearrangement in the mouse and human genomes. Proc Natl Acad Sci U S A. 2003 Sep 30;100(20):11484-9. PMID: 14500911; PMC: PMC208784 Schwartz S, Kent WJ, Smit A, Zhang Z, Baertsch R, Hardison R, Haussler D, Miller W. Human-Mouse Alignments with BLASTZ. Genome Res. 2003 Jan;13(1):103-7. PMID: 12529312; PMC: PMC430961