This file is from:

This directory contains compressed multiple alignments of 44 virus sequences.
These 44 sequences represent coronavirus strains in bat populations

The 'reference' sequence for this collection is the sequence:

  NC_045512v2 - 2019-12-30 - Wuhan-Hu-1

Description files in this directory:

  md5sum.txt - md5 sums to verify copied files
  wuhCor1.44way.nameList.txt - relating the accession name to
                              sequence name, and sample collection date

  wuhCor1.44way.nh - Phylogenetic tree used for multiz alignment.
           The phylogenetic tree was calculated on 31mer frequency similarity
           and neighbor joining that distance matrix with the phylip toolset:
           'neighbor' command:

  wuhCor1.multiz44way.maf.gz - alignments with gap annotation with
                                accession identifiers

  sequences/ - directory with files:

  sequences/dnaFasta44.tgz - gzipped tar file for the DNA fasta, 44 sequences

  sequences/proteinFasta44.tgz - gzipped tar file for the proteins as obtained
                     from the genbank records, for example in:
                     one .faa.gz for each sequence

  sequences/proteinTab44.tgz - the same proteins arranged in single lines
                     of the format:

  sequenceName.proteinName<tab>amino acids . . .

  One file for each of the sequences.

  This format file is convenient for extracting proteins from all the
  sequences that have a similar length.  For example, the longest protein
  (the 'spike' protein) is over 6000 AAs, after unpacking the tgz file
  into a directory:

  zcat * | awk -F$'\t' 'length($2) > 6000' \
     | awk -F$'\t' '{printf ">%s\n%s\n", $1, $2}' > spikeProtein.faa

  You can drop that spikeProtein.faa file into a multiple aligner such
  as 'COBALT'
  to obtain a multiple alignment of that protein for 99 of these sequences

For a description of multiple alignment format (MAF), see

PhastCons conservation scores for these alignments are available at:

PhyloP conservation scores for these alignments are available at:

To download a large file or multiple files from this directory, we recommend
that you use rsync or ftp rather than downloading the files via our website.

Via rsync:
rsync -avz --progress \
        rsync:// ./

Via FTP:
    user name: anonymous
    password: <your email address>
    go to the directory goldenPath/wuhCor1/multiz44way

To download multiple files from the UNIX command line, use the "mget" command.
    mget <filename1> <filename2> ...
    - or -
    mget -a (to download all the files in the directory)
Use the "prompt" command to toggle the interactive mode if you do not want
to be prompted for each file that you download.

All the files in this directory are freely usable for any
purpose. For data use restrictions regarding the individual
genome assemblies, see
      Name                             Last modified      Size  Description
Parent Directory - md5sum.txt 2020-03-18 11:11 483 sequences/ 2020-03-26 15:37 - wuhCor1.44way.descriptiveName.nh 2020-03-18 09:34 2.9K wuhCor1.44way.nameList.txt 2020-03-14 13:55 1.9K wuhCor1.44way.phyloDistance.txt 2020-03-18 10:43 1.8K wuhCor1.multiz44way.maf.gz 2020-03-17 13:02 334K