Description
TOGA2
(Tool to infer Orthologs from Genome Alignments 2) [1]
is the next-generation version of the original TOGA method [2].
TOGA2 is a homology-based method that integrates gene annotation, inferring
orthologs and classifying genes as intact or lost.
Methods
As TOGA, TOGA2 uses as input the gene annotation of a well-annotated reference species and
a pairwise whole genome alignment (alignment chains) between the reference and query genome.
Orthologous genomic loci are inferred primarily by alignments of intronic
and intergenic regions using machine learning to accurately distinguish
orthologous from paralogous or processed pseudogene loci.
To annotate genes, CESAR 2.0 [3] is used to determine the positions and boundaries of coding exons of a
reference transcript in the orthologous genomic locus in the query species.
TOGA2 differs from TOGA1 in the following major aspects.
- It introduces an exon-wise annotation procedure that leverages exon-level orthology. This increases
annotation accuracy, especially for very short exons, and reduces memory usage and runtime.
- It leverages pre-computed deep learning-based splice site predictions generated by SpliceAI [4]
to achieve a higher precision in identifying the correct exon boundaries. These splice site predictions
enable TOGA2 also to handle evolutionary changes in exon–intron structure, including splice site
shifts, intron deletions, and “exonization of introns”.
- A new gene tree–based reconciliation step refines orthology inference and identifies additional
1:1 orthologs.
- It identifies not only coding exons but also predicts untranslated exons and exonic regions.
Reference species used by TOGA2
For placental mammals, TOGA2 uses as references
- human (hg38 assembly)
- mouse (mm10 assembly)
- cow (HLbosTau10=GCF_002263795.3 assembly)
- elephant (HLeleMaxInd3A=GCF_024166365.1 assembly)
For birds, TOGA2 uses as references
- chicken (HLgalGal7=GCF_016699485.2 assembly)
- crow (HLcorHaw3=GCF_020740725.1 assembly)
- zebrafinch (HLtaeGut5=GCF_003957565.2 assembly)
- kittiwake (HLrisTri2=GCF_028500815.1 assembly)
- emu (HLdroNov3=GCF_036370855.1 assembly)
Display Conventions and Configuration
Each annotated transcript is named after the reference transcript, gene symbol and the chain identifier: transcriptID#geneID#chainID.
Transcripts ending with #retro are retrogene candidates (processed pseudogenes retaining an intact reading frame).
Transcripts ending with #paralog are classified as paralogous by TOGA2’s machine learning classifier; they only annotated if the respective query locus does not have an orthologous projection.
Each annotated transcript is shown in a color-coded classification as
-
"fully intact": This status is new in TOGA2 and indicates that the
projection has a completely intact reading frame, without any inactivating mutations. These transcripts likely
encode functional proteins.
-
"intact": middle 80% of the CDS
(coding sequence) is present and exhibits no gene-inactivating mutation. However, mutations can be
present in the N- or C-terminal 10% of the reading frame, and potentially indicate alterations in the
protein's termini. These transcripts likely encode functional proteins.
-
"partially intact": >50% of the CDS
is present in the query genome and the middle 80% of the CDS exhibits no
inactivating mutation. These transcripts may also encode functional
proteins, but the evidence is weaker as parts of the CDS are missing,
often due to assembly gaps.
-
"missing": <50% of the CDS is present
in the query and the middle 80% of the CDS exhibits no inactivating
mutation. There is currently no evidence for transcript loss; however, the uncertainty is higher
as more than half of the CDS is missing. Note that Missing transcripts can also arise if no genome alignment
chain spans the transcript.
-
"uncertain loss": there is at least one
inactivating mutation in the middle 80% of the CDS, but evidence is not
strong enough to classify the transcript as lost. These transcripts may
or may not encode a functional protein.
-
"lost": typically several inactivating
mutations are present, thus there is strong evidence that the transcript
is unlikely to encode a functional protein.
-
"paralogous": Special category. Transcript is classified as paralogous
by TOGA2’s machine learning classifier and these are only retained if the respective query locus does not have
an orthologous projection. Transcripts in this color have enough inactivating mutations or missing sequence
such that there loss status is "missing" or "deleted".
Clicking on a transcript provides additional information about the orthology
classification, inactivating mutations, the query's nucleotide/protein sequence, and protein/exon
alignments.
Credits
This data was prepared by the Michael Hiller's Lab
References
The TOGA2 software is available from
github.com/hillerlab/TOGA2
[1] Malovichko Y, Bein B, Hilgers L, Stephens A, Yi X, Stadager T, Hoppach L, Koch L, Maschiner M, Hiller M. TOGA2 improves speed and accuracy of comparative gene annotation and orthology inference. In preparation
[2] Kirilenko BM, Munegowda C, Osipova E, Jebb D, Sharma V, Blumer M, Morales AE, Ahmed AW, Kontopoulos DG, Hilgers L, Lindblad-Toh K, Karlsson EK, Zoonomia Consortium, Hiller M.
Integrating gene annotation with orthology inference at scale.
Science. 2023 Apr 28;380(6643):eabn3107.
PMID: 37104600; PMC: PMC10193443
[3]
Sharma V, Schwede P, Hiller M. CESAR 2.0 substantially improves speed and accuracy of comparative gene annotation. Bioinformatics. 2017 Dec 15;33(24):3985-3987. PMID: 28961744
[4]
Jaganathan K, Kyriazopoulou Panagiotopoulou S, McRae JF, Darbandi SF, Knowles D, Li YI, Kosmicki JA, Arbelaez J, Cui W, Schwartz GB, Chow ED, Kanterakis E, Gao H, Kia A, Batzoglou S, Sanders SJ, Farh KK-H. Predicting splicing from primary sequence with deep learning. Cell. 2019 Jan 24;176(3):535-548.e24. PMID: 30661751