Description

LiftOver alignments are used to map annotations from one human assembly to another one. The subtracks of this track show two different types of liftOver: the first type was created by the T2T consortium using the minimap2 aligner and strong filters; it maps CHM13 coordinates to the human assemblies hg19 and hg38. The second type was created by the UCSC Genome Browser Group using the lastz aligner and relaxed filters; it maps CHM13 coordinates to hg19, hg38 and HG002 maternal and HG002 paternal. (HG002 is one of the first Human Pangenome Reference Consortium (HPRC) assemblies and both haplotypes are available for this genome.)

The approaches of the two pipelines are different: T2T used the minimap2 aligner which outputs long alignments that do not require "chaining" of alignments into longer ones, then removed alignments that go to other chromosomes and removed all alignments to alternate haplotypes, fixes (corrections to the assembly) and unplaced contig sequences. UCSC used the lastz aligner with their chains/net pipeline (see below), and from these kept all possible liftOver alignments, without a chromosome or sequence type filter.

This means that the T2T alignments are tuned for high specificity, where as the UCSC alignments are much more sensitive. The T2T alignments are probably best used for mapping annotations to hg38 in automated pipelines and in cases where the final processing on hg38 does not use alts/fixes/unplaced sequences and when one wants to be sure that annotations that are mapped are as reliable as possible. The UCSC alignments are probably best used for manual inspection, when one is wondering "is there anything that could map to this in hg38?" or when trying to get an idea where a particular sequence in CHM13 could come from or when the final analysis is using the entire hg38 assembly, including alts/fixes/unplaced sequences.

Here are some examples to illustrate the differences:

Example 1: The acrocentric arms of chromosomes 13, 14, 15, 21, 22 and Y where not sequenced in hg38 at all but they are present in CHM13. The T2T liftOver shows that little is mappable there, as the sequence is entirely new. The UCSC liftOver pipeline shows a lot of alignable pieces in them, because these repeats and rDNA sequences have similar copies in hg38. It depends on the particular analysis which approach is more appropriate.
A note on genes: As the link above shows, even though the sequence is new, the T2T group mapped hg38 Gencode 35 gene models into these regions using CAT/LiftOff. This is because CAT and liftOff are using approaches for their lifting of genes / mapping that are not based on the liftOver alignments but sequence homology, like the UCSC liftOver pipeline.
Example 2: The ABO gene in CHM13 is a longer haplotype than in hg38. In the T2T liftOver alignments, it looks as if the sequence was entirely missing in hg38. The UCSC liftOver alignments show that this sequence has been patched post-release into hg38 via chr9_KN196479v1_fix. Analyses ignoring alts/fixes/unplaced sequences should use the T2T alignments here.

Also, we created dot plots from these alignments as an experiment to further demonstrate the differences betweem them:

Display Conventions

The track displays boxes joined together by either single or double lines, with the boxes represent aligning regions, single lines indicating gaps that are largely due to a deletion in the CHM13 v2.0 assembly or an insertion in the GRCh38 or GRCh37, and double lines representing more complex gaps that involve substantial sequence in both assembly.

LiftOver chain file downloads

One-to-one liftOver chain files to and from GRCh38/hg38 and GRCh37/hg19 are available here:

T2T CHM13 v2.0 to GRCh38/hg38 chm13v2-hg38.over.chain.gz
T2T CHM13 v2.0 to GRCh37/hg19 chm13v2-hg19.over.chain.gz
GRCh38/hg38 to T2T CHM13 v2.0 hg38-chm13v2.over.chain.gz
GRCh37/hg19 to T2T CHM13 v2.0 hg19-chm13v2.over.chain.gz

The mask file for GRCh38/hg38 is hg38.liftover-mask.bed.

Methods

T2T GRCh38/hg38 pre-processing

To prevent ambiguous alignments, all false duplications, as determined by the Genome in a Bottle Consortium (GCA_000001405.15_GRCh38_GRC_exclusions_T2Tv2.bed), as well as the GRCh38 modeled centromeres, were masked from the GRCh38/hg38 primary assembly. In addition, unlocalized and unplaced (random) contigs were removed.

T2T GRCh37/hg19 pre-processing

Unlocalized and unplaced (random) contigs were removed from the GRCh37/hg19 assembly.

T2T Alignment and Chain Creation

For the minimap2-based pipeline, the initial chain file was generated using nf-LO v1.5.1 with minimap2 v2.24 alignments. These chains were then split at all locations that contained unaligned segments greater than 1kbp or gaps greater than 10kbp. Split chain files were then converted to PAF format with extended CIGAR strings using chaintools (https://doi.org/10.5281/zenodo.6342391, v0.1), and alignments between nonhomologous chromosomes were removed. The trim-paf operation of rustybam (https://zenodo.org/record/6342176, v0.1.29) was next used to remove overlapping alignments in the query sequence, and then the target sequence, to create 1:1 alignments. PAF alignments were converted back to the chain format with paf2chain commit f68eeca, and finally, chaintools was used to generate the inverted chain file.

Full commands with parameters used were:


    nextflow run main.nf --source GRCh38.fa --target chm13v2.0.fasta --outdir dir -profile local --aligner minimap2
    python chaintools/src/split.py -c input.chain -o input-split.chain
    python chaintools/src/to_paf.py -c input-split.chain -t target.fa -q query.fa -o input-split.paf
    awk '$1==$6' input-split.paf | rb break-paf --max-size 10000  | rb trim-paf -r | rb invert | rb trim-paf -r | rb invert > out.paf
    paf2chain -i out.paf > out.chain
    python chaintools/src/invert.py -c out.chain -o out_inverted.chain

The above process does not add chain ids or scores. The UCSC utilities chainMergeSort and chainScore are used to update the chains:


    chainMergeSort out.chain | chainScore stdin chm13v2.0.2bit hg38.2bit chm13v2.0-hg38.chain
    chainMergeSort out_inverted.chain | chainScore stdin hg38.2bit chm13v2.0.2bit hg38-chm13v2.0.chain

Rustybam trim-paf uses dynamic programming and the CIGAR string to find an optimal splitting point between overlapping alignments in the query sequence. It starts its trimming with the largest overlap and then recursively trims smaller overlaps.

Results were validated by using chaintools to confirm that there were no overlapping sequences with respect to both CHM13v2.0 and GRCh38 in the released chain file. In addition, trimmed alignments were visually inspected with SafFire to confirm their quality.

UCSC pipeline methods

The UCSC pipeline is driven by DoBlastzChainNet.pl, a script used for all current human assemblies and hundreds of other genome comparisons. It has been developed over 20 years and goes through these steps: Alignment and chaining of alignments: The query genome was aligned to target genome with lastz. The resulting alignments were converted into axt format using the lavToAxt program. The axt alignments were fed into axtChain, which organizes all alignments between a single query chromosome and a single target chromosome into a group and creates a kd-tree out of the gapless subsections (blocks) of the alignments. A dynamic program was then run over the kd-trees to find the maximally scoring chains of these blocks.

Netting of alignments: Chains were derived from lastz alignments, using the methods described on the chain tracks description pages, and sorted with the highest-scoring chains in the genome ranked first. The program chainNet was then used to place the chains one at a time, trimming them as necessary to fit into sections not already covered by a higher-scoring chain. During this process, a natural hierarchy emerged in which a chain that filled a gap in a higher-scoring chain was placed underneath that chain. The program netSyntenic was used to fill in information about the relationship between higher- and lower-level chains, such as whether a lower-level chain was syntenic or inverted relative to the higher-level chain. The program netClass was then used to fill in how much of the gaps and chains contained Ns (sequencing gaps) in one or both species and how much was filled with transposons inserted before and after the two organisms diverged.

LiftOver alignment extraction: Finally, the chains referred to by the top-level nets are used to create a liftOver alignment file using netChainSubset and chainStitchId.

Credits

The T2T v1_nflo liftOver chains were generated by Nae-Chyun Chen<naechyun.chen@gmail.com> and Mitchell Vollger<mvollger@uw.edu>. The UCSC liftOver chains and the dot-plots were created by Hiram Clawson.

lastz was developed by Robert Harris, Pennsylvania State University.

The axtChain program was developed at the University of California at Santa Cruz by Jim Kent with advice from Webb Miller and David Haussler.

The browser display and database storage of the chains and nets were created by Robert Baertsch and Jim Kent.

The chainNet, netSyntenic, and netClass programs were developed at the University of California Santa Cruz by Jim Kent.

References

Nurk S, Koren S, Rhie A, Rautiainen M, et al. The complete sequence of a human genome. bioRxiv, 2021.

Harris, R.S. (2007) Improved pairwise alignment of genomic DNA Ph.D. Thesis, The Pennsylvania State University

Chiaromonte F, Yap VB, Miller W. Scoring pairwise genomic sequence alignments. Pac Symp Biocomput. 2002:115-26. PMID: 11928468

Kent WJ, Baertsch R, Hinrichs A, Miller W, Haussler D. Evolution's cauldron: duplication, deletion, and rearrangement in the mouse and human genomes. Proc Natl Acad Sci U S A. 2003 Sep 30;100(20):11484-9. PMID: 14500911; PMC: PMC208784

Schwartz S, Kent WJ, Smit A, Zhang Z, Baertsch R, Hardison RC, Haussler D, Miller W. Human-mouse alignments with BLASTZ. Genome Res. 2003 Jan;13(1):103-7. PMID: 12529312; PMC: PMC430961