Description

TOGA2 (Tool to infer Orthologs from Genome Alignments 2) [1] is the next-generation version of the original TOGA method [2].
TOGA2 is a homology-based method that integrates gene annotation, inferring orthologs and classifying genes as intact or lost.

Methods

As TOGA, TOGA2 uses as input the gene annotation of a well-annotated reference species and a pairwise whole genome alignment (alignment chains) between the reference and query genome. Orthologous genomic loci are inferred primarily by alignments of intronic and intergenic regions using machine learning to accurately distinguish orthologous from paralogous or processed pseudogene loci.

To annotate genes, CESAR 2.0 [3] is used to determine the positions and boundaries of coding exons of a reference transcript in the orthologous genomic locus in the query species.

TOGA2 differs from TOGA1 in the following major aspects.

  1. It introduces an exon-wise annotation procedure that leverages exon-level orthology. This increases annotation accuracy, especially for very short exons, and reduces memory usage and runtime.
  2. It leverages pre-computed deep learning-based splice site predictions generated by SpliceAI [4] to achieve a higher precision in identifying the correct exon boundaries. These splice site predictions enable TOGA2 also to handle evolutionary changes in exon–intron structure, including splice site shifts, intron deletions, and “exonization of introns”.
  3. A new gene tree–based reconciliation step refines orthology inference and identifies additional 1:1 orthologs.
  4. It identifies not only coding exons but also predicts untranslated exons and exonic regions.

Reference species used by TOGA2

For placental mammals, TOGA2 uses as references

For birds, TOGA2 uses as references

Display Conventions and Configuration

Each annotated transcript is named after the reference transcript, gene symbol and the chain identifier: transcriptID#geneID#chainID.
Transcripts ending with #retro are retrogene candidates (processed pseudogenes retaining an intact reading frame).
Transcripts ending with #paralog are classified as paralogous by TOGA2’s machine learning classifier; they only annotated if the respective query locus does not have an orthologous projection.

Each annotated transcript is shown in a color-coded classification as

Clicking on a transcript provides additional information about the orthology classification, inactivating mutations, the query's nucleotide/protein sequence, and protein/exon alignments.

Credits

This data was prepared by the Michael Hiller's Lab

References

The TOGA2 software is available from github.com/hillerlab/TOGA2

[1] Malovichko Y, Bein B, Hilgers L, Stephens A, Yi X, Stadager T, Hoppach L, Koch L, Maschiner M, Hiller M. TOGA2 improves speed and accuracy of comparative gene annotation and orthology inference. In preparation

[2] Kirilenko BM, Munegowda C, Osipova E, Jebb D, Sharma V, Blumer M, Morales AE, Ahmed AW, Kontopoulos DG, Hilgers L, Lindblad-Toh K, Karlsson EK, Zoonomia Consortium, Hiller M. Integrating gene annotation with orthology inference at scale. Science. 2023 Apr 28;380(6643):eabn3107. PMID: 37104600; PMC: PMC10193443

[3] Sharma V, Schwede P, Hiller M. CESAR 2.0 substantially improves speed and accuracy of comparative gene annotation. Bioinformatics. 2017 Dec 15;33(24):3985-3987. PMID: 28961744

[4] Jaganathan K, Kyriazopoulou Panagiotopoulou S, McRae JF, Darbandi SF, Knowles D, Li YI, Kosmicki JA, Arbelaez J, Cui W, Schwartz GB, Chow ED, Kanterakis E, Gao H, Kia A, Batzoglou S, Sanders SJ, Farh KK-H. Predicting splicing from primary sequence with deep learning. Cell. 2019 Jan 24;176(3):535-548.e24. PMID: 30661751