The Variant Annotation Integrator (VAI) is a research tool for associating annotations from the UCSC database with your uploaded set of variant calls. It uses gene annotations to predict functional effects of variants on transcripts. For example, a variant might be located in the coding sequence of one transcript, but in the intron of an alternatively spliced transcript of the same gene; the VAI will return the predicted functional effect for each transcript. The VAI can optionally add several other types of relevant information: the dbSNP identifier if the variant is found in dbSNP, protein damage scores for missense variants from the Database of Non-synonymous Functional Predictions (dbNSFP), and conservation scores computed from multi-species alignments. The VAI can optionally filter results to retain only specific functional effect categories, variant properties and multi-species conservation status.
NOTE:
The VAI is only a research tool, meant to be used by those who have been
properly trained in the interpretation of genetic data,
and should never be used to make any kind of medical decision.
We urge users seeking information about a personal medical or genetic
condition to consult with a qualified physician for diagnosis and for
answers to personal questions.
In order to use the VAI, you must provide variant calls in either the Personal Genome SNP (pgSnp) or VCF format. pgSnp-formatted variants may be uploaded as a Custom Track. Compressed and indexed VCF files must be on a web server (HTTP, HTTPS or FTP) and configured as Custom Tracks, or if you happen to have a Track Hub, as hub tracks.
Any gene prediction track in the UCSC Genome Browser database or in a track hub can be selected as the VAI's source of transcript annotations for prediction of functional effects. Sequence Ontology (SO) terms are used to describe the effect of each variant on genes in terms of transcript structure as follows:
SO term | description |
---|---|
intergenic_variant | A sequence variant located in the intergenic region, between genes. |
upstream_gene_variant | A sequence variant located 5' of a gene. (VAI searches within 5,000 bases.) |
downstream_gene_variant | A sequence variant located 3' of a gene. (VAI searches within 5,000 bases.) |
5_prime_UTR_variant | A variant located in the 5' untranslated region (UTR) of a gene. |
3_prime_UTR_variant | A variant located in the 3' untranslated region (UTR) of a gene. |
synonymous_variant | A sequence variant where there is no resulting change to the encoded amino acid. |
missense_variant | A sequence variant, that changes one or more bases, resulting in a different amino acid sequence but where the length is preserved. |
inframe_insertion | An inframe non synonymous variant that inserts bases into in the coding sequence. |
inframe_deletion | An inframe non synonymous variant that deletes bases from the coding sequence. |
frameshift_variant | A sequence variant which causes a disruption of the translational reading frame, because the number of nucleotides inserted or deleted is not a multiple of three. |
initiator_codon_variant | A codon variant that changes at least one base of the first codon of a transcript. |
incomplete_terminal_codon_variant | A sequence variant where at least one base of the final codon of an incompletely annotated transcript is changed. |
stop_lost | A sequence variant where at least one base of the terminator codon (stop) is changed, resulting in an elongated transcript. |
stop_retained_variant | A sequence variant where at least one base in the terminator codon is changed, but the terminator remains. |
exon_loss | A sequence variant whereby an exon is lost from the transcript. (VAI assigns this term when an entire exon is deleted.) |
stop_gained | A sequence variant whereby at least one base of a codon is changed, resulting in a premature stop codon, leading to a shortened transcript. |
NMD_transcript_variant | A variant in a transcript that is already the target of nonsense-mediated decay (NMD), i.e. stop codon is not in last exon nor within 50 bases of the end of the second-to-last exon. |
intron_variant | A transcript variant occurring within an intron. |
splice_donor_variant | A splice variant that changes the 2-base region at the 5' end of an intron. |
splice_acceptor_variant | A splice variant that changes the 2 base region at the 3' end of an intron. |
splice_region_variant | A sequence variant in which a change has occurred within the region of the splice site, either within 1-3 bases of the exon or 3-8 bases of the intron. |
complex_transcript_variant | A transcript variant with a complex insertion or deletion (indel) that spans an exon/intron border or a coding sequence/UTR border. |
non_coding_exon_variant | A sequence variant that changes exon sequence of a non-coding gene. |
dbNSFP provides scores and predictions from several tools that use various machine learning techniques to estimate the likelihood that a single-nucleotide missense variant would damage a protein's structure and function:
In addition, dbNSFP provides InterPro protein domains where available (Hunter et al., 2012) and two measures of conservation computed by GERP++ (Davydov et al., 2010).
If the selected genome assembly has a SNPs track (derived from dbSNP), when a variant has the same start and end coordinates as a variant in dbSNP, the VAI includes the reference SNP (rs#) identifier in the output. Currently, the VAI does not compare alleles due to the frequency of strand anomalies in dbSNP.
If the selected genome assembly has a Conservation track with phyloP scores and/or phastCons scores and conserved elements, those can be included in the output. Both phastCons and phyloP are part of the PHAST package; see the Conservation track description in the Genome Browser for more details.
The volume of unrestricted output can be quite large, making it difficult to identify variants of particular interest. Several filters can be applied to keep only those variants that have specific properties.
By default, all variants are included in the output regardless of predicted functional effect. If you would like to keep only variants that have a particular type of effect, you can uncheck the checkboxes of other effect types. The detailed functional effect predictions are categorized as follows:
(applicable only to assemblies that have "Common SNPs" and "Mult SNPs" tracks) By default, all variants appear in output regardless of overlap with known dbSNP variants that map to multiple locations (a possible red flag), or that have a global minor allele frequency (MAF) of 1% or higher. Those categories of known variants can be used to exclude overlapping variants from output by unchecking the corresponding checkbox.
(applicable only to assemblies that have "Conservation" tracks) If desired, output can be restricted to only those variants that overlap conserved elements computed by phastCons.
Currently, the VAI produces output comparable to Ensembl's Variant Effect Predictor (VEP), in either tab-separated text format or HTML. Columns are described here. When text output is selected, entering an output file name causes output to be saved in a local file instead of appearing in the browser, optionally compressed by gzip (compression reduces file size and network traffic, which results in faster downloads). When HTML is selected, output always appears in the browser window and the output file name is ignored.
Anyone familiar with Ensembl's Variant Effect Predictor (VEP) will doubtless notice similarities in options and interface. In collaboration with our colleagues at Ensembl, we have made an effort to limit the differences between the tools by using Sequence Ontology terms to describe variants' functional effects and by creating a "VEP" output format. Any bugs in the VAI, however, are in the VAI only.