This directory contains phylogenetic trees of fully public SARS-CoV-2 sequences
updated daily:

* public-latest.all.masked.pb[.gz]
  Protobuf file for use with usher --load-mutation-annotated-tree

* public-latest.all.masked.vcf.gz
  Variant Call Format (VCF) file containing mutations in public sequences,
  generated from public-latest.all.masked.pb with matUtils extract.
  Missing or ambiguous bases have been imputed by usher to the most parsimonious
  base value [ACGT] at the time each sequence was placed in the tree.

* public-latest.all.nwk.gz
  Newick tree file (usher's uncondensed-final-tree.nh output)

* public-latest.metadata.tsv.gz
  Information about each public sequence e.g. collection date, location,
  Nextstrain clade and Pango lineage.  Dates and locations are not available
  for some sequences.

* public-latest.version.txt
  A brief description including date, sources and number of sequences.

Previous versions of the files are available in year/month/day directories:
  http://hgdownload.soe.ucsc.edu/goldenPath/wuhCor1/UShER_SARS-CoV-2/2021/

The trees encoded in the Newick and protobuf files are derived from releases of
Rob Lanfear's sarscov2phylo (https://github.com/roblanf/sarscov2phylo), pruned
to include only public sequences aggregated from GenBank, COG-UK, and the
China National Center for Bioinformation, mapped to GISAID EPI_ISL_* IDs used
in the sarscov2phylo tree files.  The trees have also been re-rooted to
Wuhan/Hu-1 (GenBank MN908947.3, RefSeq NC_045512.2), and nodes with no
associated mutations have been collapsed.  Sequences released after the final
sarscov2phylo release (Nov. 13, 2020) were added to the tree using UShER.

A file that maps GISAID EPI_ISL_* IDs to public sequence IDs may be downloaded
from
  https://github.com/CDCgov/SARS-CoV-2_Sequencing/blob/master/files/epiToPublic.tsv.gz

GenBank sequences and metadata may be downloaded from
  https://www.ncbi.nlm.nih.gov/labs/virus/vssi/#/virus?SeqType_s=Nucleotide&VirusLineage_ss=SARS-CoV-2,%20taxid:2697049

COG-UK sequences and metadata may be downloaded from the "Latest Sequence Data"
section of
  https://www.cogconsortium.uk/tools-analysis/public-data-analysis-2/

The China National Center for Bioinformation offers additional sequences and
metadata:
  https://bigd.big.ac.cn/ncov/release_genome

A more extensive archive of public sequence trees is available in year/month/day
directories (note that methods changed slightly over time and the trees may be
lower quality than the more recent trees on hgdownload.soe.ucsc.edu):
  https://hgwdev.gi.ucsc.edu/~angie/UShER_SARS-CoV-2/2020/
  https://hgwdev.gi.ucsc.edu/~angie/UShER_SARS-CoV-2/2021/

The DD-MM-YY release labels used in 2020/10/ and 2020/11/ subdirectories
correspond to sarscov2phylo releases:

  https://github.com/roblanf/sarscov2phylo/releases

This work is made possible by the open sharing of genetic data by research
groups from all over the world.  We gratefully acknowledge the authors and the
originating laboratories where the clinical specimen or virus isolate was first
obtained and the submitting laboratories, where sequence data have been
generated and submitted to public databases, on which this research is based.

If you use usher with these files please cite

Lanfear, Rob (2020). A global phylogeny of SARS-CoV-2 sequences from GISAID.
Zenodo DOI: 10.5281/zenodo.3958883

Turakhia et al. (2021).  Ultrafast Sample Placement on Existing Trees (UShER)
Empowers Real-Time Phylogenetics for the SARS-CoV-2 Pandemic.
https://www.nature.com/articles/s41588-021-00862-7

Please also acknowledge the submitters of SARS-CoV-2 sequences to public
databases.
      Name                            Last modified      Size  Description
Parent Directory - 2021/ 2021-10-01 18:54 - public-latest.all.masked.pb 2021-10-24 19:13 179M public-latest.all.masked.pb.gz 2021-10-24 19:36 42M public-latest.all.masked.vcf.gz 2021-10-24 19:19 149M public-latest.all.nwk.gz 2021-10-24 19:36 29M public-latest.metadata.tsv.gz 2021-10-24 19:34 29M public-latest.version.txt 2021-10-24 19:34 126