Introduction ^^^^^^^^^^^^ The Dec. 2013 assembly of the human genome (GRCh38 Genome Reference Consortium Human Reference 38), is called hg38 at UCSC. This directory contains the genome as released by UCSC, selected annotation files and updates. The directory "genes/" contains GTF/GFF files for the main gene transcript sets. For more information about this assembly, see these NCBI resources: http://www.ncbi.nlm.nih.gov/genome/51 http://www.ncbi.nlm.nih.gov/genome/assembly/883148 http://www.ncbi.nlm.nih.gov/bioproject/31257 These files are used by the UCSC Genome Browser for display and analysis. If you want to do analysis and show it later on the browser, it is usually easiest to run your analysis on the UCSC hg38 file. For most users, this will be the file "latest/hg38.fa.gz" in this directory. However, if you need a genome file for alignment or variant calling, please read the section "Analysis set" below. The sequences of the main chromosomes are identical to the genome files distributed by NCBI and the EBI, but the sequence names are different. For example, the name of chromosome 1 is called "chr1" at UCSC, "NC_000001.11" at NCBI, and "1" at the EBI. Also, the lowercasing in the files is not exactly identical, as UCSC, NCBI and EBI run Repeatmasker with slightly different settings. The NCBI accession of the UCSC hg38 genome is GCA_000001405.15. The version that includes the updates for patch release 14 GRCh38.p14 has the NCBI accession GCA_000001405.29. Analysis set ^^^^^^^^^^^^ The GRCh38 assembly contains more than just the chromosome sequences, but also a mitochondrial genome, unplaced sequences, centromeric sequences and alternates. To better capture variation in the human genome across the world it contains more copies of some loci than hg19. Some of these additions, like the EBV genome, are mostly relevant for genomic analysis, i.e. alignment. For an overview of the different types and reasons for the additions see https://software.broadinstitute.org/gatk/documentation/article?id=11010 This means that if you want to use the genome sequence for alignment and especially for variant calling, you should use the optimal genome file for your aligner. The genome file can make a big difference, especially for variant calling. In most cases, the authors of your alignment program will provide advice on which hg38 genome version to use and usually they recommend one of the files in our analysisSet/ directory, like the GATK link above. These special genome files sometimes remove the alternate sequences, sometimes they add decoys or change single nucleotides towards the major allele, but they never insert or delete sequences, so the annotation coordinates remain the same. - for BWA see also https://lh3.github.io/2017/11/13/which-human-reference-genome-to-use - for Novoalign see its manual at http://www.novocraft.com/userfiles/file/Novocraft.pdf - For Bowtie, see the different versions of the human genome that the Bowtie authors provide: http://bowtie-bio.sourceforge.net/index.shtml Also see analysisSet/README.txt for further details Patches ^^^^^^^ Like hg19, hg38 has been updated with patches since its release in 2013. GRC patch releases do not change any previously existing sequences; they simply add small, new sequences for fix patches or alternate haplotypes that correspond to specific regions of the main chromosome sequences (see below). For most users, the patches are unlikely to make a difference and may complicate the analysis as they introduce more duplication. If you want a version of the genome without these complexities, look at the analysisSet/ subdirectory. The initial/ subdirectory contains files for the initial release of GRCh38, which includes the original alternate sequences (261) and no fix sequences. The p11/ subdirectory contains files for GRCh38.p11 (patch release 11). The p12/ subdirectory contains files for GRCh38.p12 (patch release 12). The p13/ subdirectory contains files for GRCh38.p13 (patch release 13). The p14/ subdirectory contains files for GRCh38.p14 (patch release 14). The "latest/" symbolic link points to the subdirectory for the most recent patch version. hg38.* files in this directory are the same as files in the initial/ subdirectory, i.e. they are from the initial GRCh38 release and do not include the patch sequences that are now included in the Genome Browser. (The recently added hg38.gc5Base.* files are an exception to the rule; they do include patch sequences.) Sequence names ^^^^^^^^^^^^^^ For historical reasons, what UCSC calls "chr1", Ensembl calls "1" and NCBI calls "NC_000067.6". The sequences are identical though. To map between UCSC, Ensembl and NCBI names, use our table "chromAlias", available via our Table Browser or as file: https://hgdownload.cse.ucsc.edu/goldenPath/mm10/database/chromAlias.txt.gz We also provide a Python command line tool to convert sequence names in the most common genomics file formats: http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/chromToUcsc During genome assembly, reads are assembled into "contigs" (a few kbp long), which are then joined into longer "scaffolds" of a few hundred kbp. These are finally placed, often manually e.g. with FISH assays, onto chromosomes. As a result, the hg38 genome sequence files contains different types of sequences: Chromosomes: - made from scaffolds placed onto chromosome locations, 95% of the genome file - format: chr{chromosome number or name} - e.g. chr1 or chrX, chrM for the mitochondrial genome. Unlocalized scaffolds: - a sequence found in an assembly that is associated with a specific chromosome but cannot be ordered or oriented on that chromosome. - format: chr{chromosome number or name}_{sequence_accession}v{sequence_version}_random - e.g. chr17_GL000205v2_random Unplaced scaffolds: - a sequence found in an assembly that is not associated with any chromosome. - format: chrUn_{sequence_accession}v{sequence_version} - e.g. chrUn_GL000220v1 Alternate loci scaffolds: - a scaffold that provides an alternate representation of a locus found in the primary assembly. These sequences do not represent a complete chromosome sequence although there is no hard limit on the size of the alternate locus; currently these are less than 1 Mb. These could either be NOVEL patch sequences, added through patch releases, or present in the initial assembly release. - format: chr{chromosome number or name}_{sequence_accession}v{sequence_version}_alt - e.g. chr6_GL000250v2_alt Fix loci scaffolds: - a patch that corrects sequence or reduces an assembly gap in a given major release. FIX patch sequences are meant to be incorporated into the primary or existing alt-loci assembly units at the next major release. - these sequences are not part of the files in the initial/ directory - format: chr{chromosome number or name}_{sequence_accession}v{sequence_version}_fix - e.g. chr2_KN538362v1_fix Files ^^^^^ Files in this directory reflect the initial 2013 release of the genome, the most current versions are in the "latest/" subdirectory: hg38.fa.gz - "Soft-masked" assembly sequence in one file. Repeats from RepeatMasker and Tandem Repeats Finder (with period of 12 or less) are shown in lower case; non-repeating sequence is shown in upper case. (again, the most current version of this file is latest/hg38.fa.gz) hg38.2bit - contains the complete human/hg38 genome sequence in the 2bit file format. Repeats from RepeatMasker and Tandem Repeats Finder (with period of 12 or less) are shown in lower case; non-repeating sequence is shown in upper case. The utility program, twoBitToFa (available from the kent src tree), can be used to extract .fa file(s) from this file. A pre-compiled version of the command line tool can be found at: http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/ See also: http://genome.ucsc.edu/admin/git.html http://genome.ucsc.edu/admin/jk-install.html hg38.agp.gz - Description of how the assembly was generated from fragments. hg38.chromFa.tar.gz - The assembly sequence in one file per chromosome. Repeats from RepeatMasker and Tandem Repeats Finder (with period of 12 or less) are shown in lower case; non-repeating sequence is shown in upper case. hg38.chromFaMasked.tar.gz - The assembly sequence in one file per chromosome. Repeats are masked by capital Ns; non-repeating sequence is shown in upper case. hg38.fa.masked.gz - "Hard-masked" assembly sequence in one file. Repeats are masked by capital Ns; non-repeating sequence is shown in upper case. hg38.fa.out.gz - RepeatMasker .out file. RepeatMasker was run with the -s (sensitive) setting. June 20 2013 (open-4-0-3) version of RepeatMasker RepBase library: RELEASE 20130422 hg38.fa.align.gz - RepeatMasker .align file. RepeatMasker was run with the -s (sensitive) setting. June 20 2013 (open-4-0-3) version of RepeatMasker RepBase library: RELEASE 20130422 hg38.trf.bed.gz - Tandem Repeats Finder locations, filtered to keep repeats with period less than or equal to 12, and translated into UCSC's BED format. md5sum.txt - checksums of files in this directory mrna.fa.gz - Human mRNA from GenBank. This sequence data is updated regularly via automatic GenBank updates. refMrna.fa.gz - RefSeq mRNA from the same species as the genome. This sequence data is updated regularly via automatic GenBank updates. upstream1000.fa.gz - Sequences 1000 bases upstream of annotated transcription starts of RefSeq genes with annotated 5' UTRs. This file is updated regularly. It might be slightly out of sync with the RefSeq data shown on the browser, as is it updated daily for most assemblies. upstream2000.fa.gz - Same as upstream1000, but 2000 bases. upstream5000.fa.gz - Same as upstream1000, but 5000 bases. xenoMrna.fa.gz - GenBank mRNAs from species other than that of the genome. This sequence data is updated regularly via automatic GenBank updates. hg38.chrom.sizes - Two-column tab-separated text file containing assembly sequence names and sizes. hg38.gc5Base.wigVarStep.gz - ascii data wiggle variable step values used - to construct the GC Percent track hg38.gc5Base.bw - binary bigWig data for the gc5Base track. hg38.chromAlias.txt - sequence name alias file, one line for each sequence name. First column is sequence name followed by tab separated alias names. hg38.chromAlias.bb - bigBed file for alias sequence names, one line for each sequence name. The first three columns are the sequence in BED format, followed by tab-separated alias names. The .bb file is used by bedToBigBed as a URL to avoid having to download the entire chromAlias.txt file. From the usage message: -sizesIsChromAliasBb -- If set, then chrom.sizes file is assumed to be a chromAlias bigBed file or a URL to a such a file (see above). More documentation is found here: https://genomewiki.ucsc.edu/index.php?title=Chrom_Alias Dropped in Genbank and Refseq official releases patch14 since these 2 old versions are obsolete and no longer needed. This patch contains their v2 replacements. chr11_KQ759759v1_fix chr22_KQ759762v1_fix Dropped in Refseq official release Patch14. These 3 are contamination or obsolete. chr10_KI270825v1_alt chr22_KI270734v1_random chr11_KI270721v1_random Because of the difficulty of removing the old chroms chr11_KQ759759v1_fix and chr22_KQ759762v1_fix from all of the database tables and bigData files, custom tracks, and hubs, we are not dropping them from the UCSC hg38 patch 14 .2bit and chromInfo. However, we have dropped them from chromAlias to accord with the Genbank and Refseq official releases for patch14. Dropped in Patch13 from Refseq chrUn_KI270752v1 HSCHRUN_RANDOM_CTG29 KI270752.1 KI270752.1 is no longer part of the RefSeq assembly because it is hamster sequence derived from the human-hamster CHO cell line. https://www.ncbi.nlm.nih.gov/grc/human/issues/HG-2587 ------------------------------------------------------------------ How to Download ^^^^^^^^^^^^^^^ If you plan to download a large file or multiple files from this directory, we recommend that you use ftp rather than downloading the files via our website. To do so, ftp to hgdownload.cse.ucsc.edu [username: anonymous, password: your email address], then cd to the directory goldenPath/hg38/bigZips. To download multiple files, use the "mget" command: mget ... - or - mget -a (to download all the files in the directory) Alternate methods to ftp access. Using an rsync command to download the entire directory: rsync -avzP rsync://hgdownload.cse.ucsc.edu/goldenPath/hg38/bigZips/ . For a single file, e.g. chromFa.tar.gz rsync -avzP rsync://hgdownload.cse.ucsc.edu/goldenPath/hg38/bigZips/chromFa.tar.gz . Or with wget, all files: wget --timestamping 'ftp://hgdownload.cse.ucsc.edu/goldenPath/hg38/bigZips/*' With wget, a single file: wget --timestamping 'ftp://hgdownload.cse.ucsc.edu/goldenPath/hg38/bigZips/chromFa.tar.gz' -O chromFa.tar.gz To unpack the *.tar.gz files: tar xvzf .tar.gz To uncompress the fa.gz files: gunzip .fa.gz GenBank Data Usage ^^^^^^^^^^^^^^^^^^ The GenBank database is designed to provide and encourage access within the scientific community to the most up to date and comprehensive DNA sequence information. Therefore, NCBI places no restrictions on the use or distribution of the GenBank data. However, some submitters may claim patent, copyright, or other intellectual property rights in all or a portion of the data they have submitted. NCBI is not in a position to assess the validity of such claims, and therefore cannot provide comment or unrestricted permission concerning the use, copying, or distribution of the information contained in GenBank. -----------------------------------------------------------------------------