Description

This track represents parts of the SARS-CoV-2 analysis efforts of the GalaxyProject [1]. This project aims at fully open and transparent, high-quality reanalysis of public raw sequencing data deposited in INSDC databases on ready-to-use public infrastructure [2]. It restricts itself to data deposited by national genome surveillance projects that are providing sufficient sample metadata (along with the submitted data or through personal communication) to allow for best-practice analysis and reporting (for examples see [3, 4, 5]).

Required metadata are:

Analysis is performed on public Galaxy servers with only open-source tools orchestrated through public, community-developed, reproducible workflows available from WorkflowHub and Dockstore and includes mutation calling for all samples, generation of per-sample and batch-level mutation reports and plots, generation of consensus sequences and pangolin lineage assignments. Key results and metadata are hosted on a public FTP server provided by the Centre for Genomic Regulation and the Barcelona Supercomputing Centre and form the basis of these UCSC genome browser tracks. The project web site has more information about available results data.

Display Conventions and Configuration

Track structure

The GalaxyProject SARS-CoV-2 mutations tracking effort comes as a supertrack containing four subtracks that represent mutation data from SARS-CoV-2 samples collected in different 3-months periods of the Covid-19 pandemic. The quarters are redefined with each data update with the latest/current quarter starting 3 months prior to the day of the update. The end date displayed on the current quarter track corresponds to the collection date of the most recent analyzed sample on the day of the update.

Each quarter's subtrack is, in turn, composed of separate mutation data tracks for the five most common pangolin lineages observed in the data for that quarter.

Together the tracks can be used to explore the change of dominating lineages (and their associated mutation patterns) over time and, for lineages dominant over multiple quarters, to search for evidence of emerging within-lineage mutations.

Mutation feature display

To facilitate such search the shading of mutation features reflects the mutation's observed frequency among the samples of a given lineage in the given quarter, which means that lineage defining mutations should be displayed in dark grey/black, while newly emerging mutations or non-systematic variant calling artefacts should appear in lighter shades of grey.

Mutation features are labeled with their effects at the amino acid level and, for SNV mutations, the feature as a whole will extend across the base triplet encoding the affected amino acid, while the thick part of the feature will indicate the precise base that gets changed by the mutation. For deletions, the whole feature will have a thick rendering, while insertions will be displayed all thin.

Mutation details

Hovering over any mutation feature (in dense or full display mode of the track) will reveal details of the mutation and the associated statistics, in particular:

Filtering Mutations

Mutation features displayed in each subtrack can be filtered by

Methods

For analyses, batches of raw sequencing data get downloaded from public databases (in particular, from the FTP server of the European Nucleotide Archive) onto one of several public Galaxy instances. The data gets processed with a sequencing platform-specific variation analysis workflow (one for paired-end Illumina data, another one for ONT data), which performs QC, read mapping, mapped reads postprocessing including primer trimming, variant calling and annotation and results in a collection of VCF files, one for each sample in the batch. This output gets picked up by a reporting workflow, which generates per-sample and per-batch mutation reports and a per-batch allele-frequency plot for a quick overview over variant patterns in the batch. In parallel, the outputs of the variation analysis workflow are also used by a consensus workflow to produce a FASTA consensus sequence for every sample in the batch. Sequencing data downloads, execution of the three types of workflows, and export of key results files are orchestrated by bot scripts, which can be used together with the public workflows to set up the complete analysis system on any Galaxy server. The bot accounts on participating Galaxy servers are checked on a roughly weekly basis for newly finished analysis histories, then

  1. those histories are made publicly accessible on their server
  2. batch information, i.e., samples analyzed and their metadata, links to the histories, etc. are added to
    ftp://xfer13.crg.eu/gx-surveillance.json
  3. pangolin lineage assignment is (re)performed for the entire collection of samples ever analyzed
  4. the genome browser tracks get recalculated by
    1. parsing all analyzed data on the ftp server
    2. determining the five most frequently observed pangolin lineages for each of the last four quarters, starting from the current date
    3. extracting all mutations seen in each quarter for each of the five top lineages in that quarter
    4. rebuilding the bigbed files and track files

Credits

The analysis behind these tracks is the result of joint efforts of the Galaxy community at large, the usegalaxy.org and usegalaxy.eu teams, the IUC, and the IWC.

The infrastructure and development work behind the project was made possible by generous support from funding agencies around the world.

For questions regarding SARS-CoV-2 data analysis and its automation with Galaxy, please join us in the GalaxyProject Public Health matrix channel.

The project would not be possible without the sequencing data provided by genome surveillance initiatives that have decided to make their data and metadata publically available by depositing it in INSDC databases. In particular we would like to thank:

References

  1. Baker, D.; van den Beek, M.; Blankenberg, D.; Bouvier, D.; Chilton, J.; Coraor, N.; Coppens, F.; Eguinoa, I.; Gladman, S.; GrĂ¼ning, B.; Keener, N.; Lariviere, D.; Lonie, A.; Kosakovsky Pond, S.; Maier, W.; Nekrutenko, A.; Taylor, J. & Weaver, S. (2020): No more business as usual: Agile and effective responses to emerging pathogen threats require open data and open analytics. PLoS Pathogens 16(8):e1008643. DOI: 10.1371/journal.ppat.1008643
  2. Maier, W.; Bray, S.; van den Beek, M.; Bouvier, D.; Coraor, N.; Miladi, M.; Singh, B.; Argila, J. R. D.; Baker, D.; Roach, N.; Gladman, S.; Coppens, F.; Martin, D. P.; Lonie, A.; GrĂ¼ning, B.; Pond, S. L. K. & Nekrutenko, A. (2021): Ready-to-use public infrastructure for global SARS-CoV-2 monitoring. Nature Biotechnology 39, 1178-1179. DOI: 10.1038/s41587-021-01069-1