================================================================ to download all of the files from one of these admin/exe/ directories, for example: admin/exe/linux.x86_64/ using the rsync command to your current directory: rsync -aP rsync://hgdownload.soe.ucsc.edu/genome/admin/exe/linux.x86_64/ ./ ================================================================ ======== addCols ==================================== ================================================================ ### kent source version 462 ### addCols - Sum columns in a text file. usage: addCols adds all columns in the given file, outputs the sum of each column. can be the name: stdin to accept input from stdin. Options: -maxCols=N - maximum number of colums (defaults to 16) ================================================================ ======== ameme ==================================== ================================================================ ameme - find common patterns in DNA usage ameme good=goodIn.fa [bad=badIn.fa] [numMotifs=2] [background=m1] [maxOcc=2] [motifOutput=fileName] [html=output.html] [gif=output.gif] [rcToo=on] [controlRun=on] [startScanLimit=20] [outputLogo] [constrainer=1] where goodIn.fa is a multi-sequence fa file containing instances of the motif you want to find, badIn.fa is a file containing similar sequences but lacking the motif, numMotifs is the number of motifs to scan for, background is m0,m1, or m2 for various levels of Markov models, maxOcc is the maximum occurrences of the motif you expect to find in a single sequence and motifOutput is the name of a file to store just the motifs in. rcToo=on searches both strands. If you include controlRun=on in the command line, a random set of sequences will be generated that match your foreground data set in size, and your background data set in nucleotide probabilities. The program will then look for motifs in this random set. If the scores you get in a real run are about the same as those you get in a control run, then the motifs Improbizer has found are probably not significant. ================================================================ ======== autoDtd ==================================== ================================================================ ### kent source version 462 ### autoDtd - Give this a XML document to look at and it will come up with a DTD to describe it. usage: autoDtd in.xml out.dtd out.stats options: -tree=out.tree - Output tag tree. -atree=out.atree - Output attributed tag tree. ================================================================ ======== autoSql ==================================== ================================================================ ### kent source version 462 ### autoSql - create SQL and C code for permanently storing a structure in database and loading it back into memory based on a specification file usage: autoSql specFile outRoot {optional: -dbLink -withNull -json} This will create outRoot.sql outRoot.c and outRoot.h based on the contents of specFile. options: -dbLink - optionally generates code to execute queries and updates of the table. -addBin - Add an initial bin field and index it as (chrom,bin) -withNull - optionally generates code and .sql to enable applications to accept and load data into objects with potential 'missing data' (NULL in SQL) situations. -defaultZeros - will put zero and or empty string as default value -django - generate method to output object as django model Python code -json - generate method to output the object in JSON (JavaScript) format. ================================================================ ======== autoXml ==================================== ================================================================ autoXml - Generate structures code and parser for XML file from DTD-like spec usage: autoXml file.dtdx root This will generate root.c, root.h options: -textField=xxx what to name text between start/end tags. Default 'text' -comment=xxx Comment to appear at top of generated code files -picky Generate parser that rejects stuff it doesn't understand -main Put in a main routine that's a test harness -prefix=xxx Prefix to add to structure names. By default same as root -positive Don't write out optional attributes with negative values ================================================================ ======== ave ==================================== ================================================================ ave - Compute average and basic stats usage: ave file options: -col=N Which column to use. Default 1 -tableOut - output by columns (default output in rows) -noQuartiles - only calculate min,max,mean,standard deviation - for large data sets that will not fit in memory. ================================================================ ======== aveCols ==================================== ================================================================ aveCols - average together columns usage: aveCols file adds all columns (up to 16 columns) in the given file, outputs the average (sum/#ofRows) of each column. can be the name: stdin to accept input from stdin. ================================================================ ======== axtChain ==================================== ================================================================ axtChain - Chain together axt alignments. usage: axtChain [options] -linearGap=loose in.axt tNibDir qNibDir out.chain Where tNibDir/qNibDir are either directories full of nib files, the name of a .2bit file, or a single fasta file with additional -faQ or -faT options. options: -psl Use psl instead of axt format for input -faQ The specified qNibDir is a fasta file with multiple sequences for query -faT The specified tNibDir is a fasta file with multiple sequences for target NOTE: will not work with gzipped fasta files -minScore=N Minimum score for chain, default 1000 -details=fileName Output some additional chain details -scoreScheme=fileName Read the scoring matrix from a blastz-format file -linearGap= Specify type of linearGap to use. *Must* specify this argument to one of these choices. loose is chicken/human linear gap costs. medium is mouse/human linear gap costs. Or specify a piecewise linearGap tab delimited file. sample linearGap file (loose) tablesize 11 smallSize 111 position 1 2 3 11 111 2111 12111 32111 72111 152111 252111 qGap 325 360 400 450 600 1100 3600 7600 15600 31600 56600 tGap 325 360 400 450 600 1100 3600 7600 15600 31600 56600 bothGap 625 660 700 750 900 1400 4000 8000 16000 32000 57000 ================================================================ ======== axtSort ==================================== ================================================================ axtSort - Sort axt files usage: axtSort in.axt out.axt options: -query - Sort by query position, not target -byScore - Sort by score ================================================================ ======== axtSwap ==================================== ================================================================ axtSwap - Swap source and query in an axt file usage: axtSwap source.axt target.sizes query.sizes dest.axt options: -xxx=XXX ================================================================ ======== axtToMaf ==================================== ================================================================ ### kent source version 462 ### axtToMaf - Convert from axt to maf format usage: axtToMaf in.axt tSizes qSizes out.maf Where tSizes and qSizes is a file that contains the sizes of the target and query sequences. Very often this with be a chrom.sizes file Options: -qPrefix=XX. - add XX. to start of query sequence name in maf -tPrefex=YY. - add YY. to start of target sequence name in maf -tSplit Create a separate maf file for each target sequence. In this case output is a dir rather than a file In this case in.maf must be sorted by target. -score - recalculate score -scoreZero - recalculate score if zero ================================================================ ======== axtToPsl ==================================== ================================================================ axtToPsl - Convert axt to psl format usage: axtToPsl in.axt tSizes qSizes out.psl Where tSizes and qSizes are tab-delimited files with columns. options: -xxx=XXX ================================================================ ======== bamToPsl ==================================== ================================================================ ### kent source version 462 ### bamToPsl - Convert a bam file to a psl and optionally also a fasta file that contains the reads. usage: bamToPsl [options] in.bam out.psl options: -fasta=output.fa - output query sequences to specified file -chromAlias=file - specify a two-column file: 1: alias, 2: other name for target name translation from column 1 name to column 2 name names not found are passed through intact -nohead - do not output the PSL header, default has header output -dots=N - output progress dot(.) every N alignments processed Note: a chromAlias file can be obtained from a UCSC database, e.g.: hgsql -N -e 'select alias,chrom from chromAlias;' hg38 > hg38.chromAlias.tab Or from the downloads server: wget https://hgdownload.soe.ucsc.edu/goldenPath/hg38/database/chromAlias.txt.gz See also our tool chromToUcsc ================================================================ ======== barChartMaxLimit ==================================== ================================================================ Can't open file '-verbose=2' for reading ================================================================ ======== bedClip ==================================== ================================================================ ### kent source version 462 ### bedClip - Remove lines from bed file that refer to off-chromosome locations. usage: bedClip [options] input.bed chrom.sizes output.bed chrom.sizes is a two-column file/URL: If the assembly is hosted by UCSC, chrom.sizes can be a URL like http://hgdownload.soe.ucsc.edu/goldenPath//bigZips/.chrom.sizes or you may use the script fetchChromSizes to download the chrom.sizes file. If not hosted by UCSC, a chrom.sizes file can be generated by running twoBitInfo on the assembly .2bit file. options: -truncate - truncate items that span ends of chrom instead of the default of dropping the items -verbose=2 - set to get list of lines clipped and why ================================================================ ======== bedCommonRegions ==================================== ================================================================ ### kent source version 462 ### bedCommonRegions - Create a bed file (just bed3) that contains the regions common to all inputs. Regions are common only if exactly the same chromosome, starts, and end. Overlap is not enough. Each region must be in each input at most once. Output is stdout. usage: bedCommonRegions file1 file2 file3 ... fileN ================================================================ ======== bedCoverage ==================================== ================================================================ bedCoverage - Analyse coverage by bed files - chromosome by chromosome and genome-wide. usage: bedCoverage database bedFile Note bed file must be sorted by chromosome -restrict=restrict.bed Restrict to parts in restrict.bed ================================================================ ======== bedExtendRanges ==================================== ================================================================ ### kent source version 462 ### bedExtendRanges - extend length of entries in bed 6+ data to be at least the given length, taking strand directionality into account. usage: bedExtendRanges database length files(s) options: -host mysql host -user mysql user -password mysql password -tab Separate by tabs rather than space -verbose=N - verbose level for extra information to STDERR example: bedExtendRanges hg18 250 stdin bedExtendRanges -user=genome -host=genome-mysql.soe.ucsc.edu hg18 250 stdin will transform: chr1 500 525 . 100 + chr1 1000 1025 . 100 - to: chr1 500 750 . 100 + chr1 775 1025 . 100 - ================================================================ ======== bedGeneParts ==================================== ================================================================ ### kent source version 462 ### bedGeneParts - Given a bed, spit out promoter, first exon, or all introns. usage: bedGeneParts part in.bed out.bed Where part is either 'exons' or 'firstExon' or 'introns' or 'promoter' or 'firstCodingSplice' or 'secondCodingSplice' options: -proStart=NN - start of promoter relative to txStart, default -100 -proEnd=NN - end of promoter relative to txStart, default 50 ================================================================ ======== bedGraphPack ==================================== ================================================================ ### kent source version 462 ### bedGraphPack v1 - Pack together adjacent records representing same value. usage: bedGraphPack in.bedGraph out.bedGraph The input needs to be sorted by chrom and this is checked. To put in a pipe use stdin and stdout in the command line in place of file names. ================================================================ ======== bedGraphToBigWig ==================================== ================================================================ ### kent source version 462 ### bedGraphToBigWig v 2.10 - Convert a bedGraph file to bigWig format (bbi version: 4). usage: bedGraphToBigWig in.bedGraph chrom.sizes out.bw where in.bedGraph is a four column file in the format: and chrom.sizes is a two-column file/URL: and out.bw is the output indexed big wig file. If the assembly is hosted by UCSC, chrom.sizes can be a URL like http://hgdownload.soe.ucsc.edu/goldenPath//bigZips/.chrom.sizes or you may use the script fetchChromSizes to download the chrom.sizes file. If not hosted by UCSC, a chrom.sizes file can be generated by running twoBitInfo on the assembly .2bit file. The input bedGraph file must be sorted, use the unix sort command: LC_ALL=C sort -k1,1 -k2,2n unsorted.bedGraph > sorted.bedGraph The LC_ALL=C variable activates case-sensitive sorting. options: -blockSize=N - Number of items to bundle in r-tree. Default 256 -itemsPerSlot=N - Number of data points bundled at lowest level. Default 1024 -sizesIsBb -- If set, the chrom.sizes file is assumed to be a bigBed file. -unc - If set, do not use compression. ================================================================ ======== bedIntersect ==================================== ================================================================ ### kent source version 462 ### bedIntersect - Intersect two bed files usage: bed columns four(name) and five(score) are optional bedIntersect a.bed b.bed output.bed options: -aHitAny output all of a if any of it is hit by b -minCoverage=0.N min coverage of b to output match (or if -aHitAny, of a). Not applied to 0-length items. Default 0.000010 -bScore output score from b.bed (must be at least 5 field bed) -tab chop input at tabs not spaces -allowStartEqualEnd Don't discard 0-length items of a or b (e.g. point insertions) ================================================================ ======== bedItemOverlapCount ==================================== ================================================================ ### kent source version 462 ### bedItemOverlapCount - count number of times a base is overlapped by the items in a bed file. Output is bedGraph 4 to stdout. usage: sort bedFile.bed | bedItemOverlapCount [options] stdin To create a bigWig file from this data to use in a custom track: sort -k1,1 bedFile.bed | bedItemOverlapCount [options] stdin \ > bedFile.bedGraph bedGraphToBigWig bedFile.bedGraph chrom.sizes bedFile.bw where the chrom.sizes is obtained with the script: fetchChromSizes See also: http://genome-test.gi.ucsc.edu/~kent/src/unzipped/utils/userApps/fetchChromSizes options: -zero add blocks with zero count, normally these are ommitted -bed12 expect bed12 and count based on blocks Without this option, only the first three fields are used. -max if counts per base overflows set to max (4294967295) instead of exiting -outBounds output min/max to stderr -chromSize=sizefile Read chrom sizes from file instead of database sizefile contains two white space separated fields per line: chrom name and size -host=hostname mysql host used to get chrom sizes -user=username mysql user -password=password mysql password Notes: * You may want to separate your + and - strand items before sending into this program as it only looks at the chrom, start and end columns of the bed file. * Program requires a connection to lookup chrom sizes for a sanity check of the incoming data. Even when the -chromSize argument is used the must be present, but it will not be used. * The bed file *must* be sorted by chrom * Maximum count per base is 4294967295. Recompile with new unitSize to increase this ================================================================ ======== bedJoinTabOffset ==================================== ================================================================ bedJoinTabOffset - Add file offset and length of line in a text file with the same name as the BED name to each row of BED. usage: bedJoinTabOffset inTabFile inBedFile outBedFile Given a bed file and tab file where each have a column with matching values: 1. first get the value of column0, the offset and line length from inTabFile. 2. Then go over the bed file, use the -bedKey (defaults to the name field) field and append its offset and length to the bed file as two separate fields. Write the new bed file to outBed. options: -bedKey=integer 0-based index key of the bed file to use to match up with the tab file. Default is 3 for the name field. ================================================================ ======== bedJoinTabOffset.py ==================================== ================================================================ Usage: bedJoinTabOffset.py [options] inTabFile inBedFile outBedFile - given a bed file and tab file where each have a column with matching values: first get the value of column0, the offset and line length from inTabFile. Then go over the bed file, use the name field and append its offset and length to the bed file as two separate fields. Write the new bed file to outBed. bedJoinTabOffset.py: error: no such option: -v ================================================================ ======== bedMergeAdjacent ==================================== ================================================================ ### kent source version 462 ### bedMergeAdjacent - merge adjacent blocks in a BED 12 usage: bedMergeAdjacent inBed outBed options: ================================================================ ======== bedPartition ==================================== ================================================================ ### kent source version 462 ### bedPartition - split BED ranges into non-overlapping ranges usage: bedPartition [options] bedFile rangesBed Split ranges in a BED into non-overlapping sets for use in cluster jobs. Output is a BED 3 of the ranges. The bedFile maybe compressed and no ordering is assumed. options: -partSize=1 - will combine non-overlapping partitions, up to this number of ranges. per set of overlapping records. -parallel=n - use this many cores for parallel sorting ================================================================ ======== bedPileUps ==================================== ================================================================ ### kent source version 462 ### bedPileUps - Find (exact) overlaps if any in bed input usage: bedPileUps in.bed Where in.bed is in one of the ascii bed formats. The in.bed file must be sorted by chromosome,start, to sort a bed file, use the unix sort command: sort -k1,1 -k2,2n unsorted.bed > sorted.bed Options: -name - include BED name field 4 when evaluating uniqueness -tab - use tabs to parse fields -verbose=2 - show the location and size of each pileUp ================================================================ ======== bedRemoveOverlap ==================================== ================================================================ ### kent source version 462 ### bedRemoveOverlap - Remove overlapping records from a (sorted) bed file. Gets rid of `the smaller of overlapping records. usage: bedRemoveOverlap in.bed out.bed options: -xxx=XXX ================================================================ ======== bedRestrictToPositions ==================================== ================================================================ ### kent source version 462 ### bedRestrictToPositions - Filter bed file, restricting to only ones that match chrom/start/ends specified in restrict.bed file. usage: bedRestrictToPositions in.bed restrict.bed out.bed options: -xxx=XXX ================================================================ ======== bedSort ==================================== ================================================================ bedSort - Sort a .bed file by chrom,chromStart usage: bedSort in.bed out.bed in.bed and out.bed may be the same. ================================================================ ======== bedToBigBed ==================================== ================================================================ ### kent source version 462 ### bedToBigBed v. 2.10 - Convert bed file to bigBed. (bbi version: 4) usage: bedToBigBed in.bed chrom.sizes out.bb Where in.bed is in one of the ascii bed formats, but not including track lines and chrom.sizes is a two-column file/URL: and out.bb is the output indexed big bed file. If the assembly is hosted by UCSC, chrom.sizes can be a URL like http://hgdownload.soe.ucsc.edu/goldenPath//bigZips/.chrom.sizes or you may use the script fetchChromSizes to download the chrom.sizes file. If you have bed annotations on patch sequences from NCBI, a more inclusive chrom.sizes file can be found using a URL like http://hgdownload.soe.ucsc.edu/goldenPath//database/chromInfo.txt.gz If not hosted by UCSC, a chrom.sizes file can be generated by running twoBitInfo on the assembly .2bit file or the 2bit file or used directly if the -sizesIs2Bit option is specified. The chrom.sizes file may also be a chromAlias bigBed file, or a URL to such a file, by specifying the -sizesIsChromAliasBb option. When using a chromAlias bigBed file, the input BED file may have chromosome names matching any of the sequence name aliases in the chromAlias file. For UCSC provided genomes, the chromAlias files can be found under: https://hgdownload.soe.ucsc.edu/goldenPath//bigZips/.chromAlias.bb For UCSC GenArk assembly hubs, the chrom aliases are namedd in the form: https://hgdownload.soe.ucsc.edu/hubs/GCF/006/542/625/GCF_006542625.1/GCF_006542625.1.chromAlias.bb For a description of generating chromAlias files for your own assembly hub, see: http://genomewiki.ucsc.edu/index.php/Chrom_Alias The in.bed file must be sorted by chromosome,start, to sort a bed file, use the unix sort command: sort -k1,1 -k2,2n unsorted.bed > sorted.bed Sequences must be sorted by name so all sequences with the same name are collected together, but they don't need to be in any particular order. options: -type=bedN[+[P]] : N is between 3 and 15, optional (+) if extra "bedPlus" fields, optional P specifies the number of extra fields. Not required, but preferred. Examples: -type=bed6 or -type=bed6+ or -type=bed6+3 (see http://genome.ucsc.edu/FAQ/FAQformat.html#format1) -as=fields.as - If you have non-standard "bedPlus" fields, it's great to put a definition of each field in a row in AutoSql format here. -blockSize=N - Number of items to bundle in r-tree. Default 256 -itemsPerSlot=N - Number of data points bundled at lowest level. Default 512 -unc - If set, do not use compression. -tab - If set, expect fields to be tab separated, normally expects white space separator. -extraIndex=fieldList - If set, make an index on each field in a comma separated list extraIndex=name and extraIndex=name,id are commonly used. -sizesIs2Bit -- If set, the chrom.sizes file is assumed to be a 2bit file. -sizesIsChromAliasBb -- If set, then chrom.sizes file is assumed to be a chromAlias bigBed file or a URL to a such a file (see above). -sizesIsBb -- Obsolete name for -sizesIsChromAliasBb. -udcDir=/path/to/udcCacheDir -- sets the UDC cache dir for caching of remote files. -allow1bpOverlap -- allow exons to overlap by at most one base pair -maxAlloc=N -- Set the maximum memory allocation size to N bytes ================================================================ ======== bedToExons ==================================== ================================================================ ### kent source version 462 ### bedToExons - Split a bed up into individual beds. One for each internal exon. usage: bedToExons originalBeds.bed splitBeds.bed options: -cdsOnly - Only output the coding portions of exons. ================================================================ ======== bedToGenePred ==================================== ================================================================ ### kent source version 462 ### bedToGenePred - convert bed format files to genePred format usage: bedToGenePred bedFile genePredFile Convert a bed file to a genePred file. If BED has at least 12 columns, then a genePred with blocks is created. Otherwise single-exon genePreds are created. ================================================================ ======== bedToPsl ==================================== ================================================================ ### kent source version 462 ### bedToPsl - convert bed format files to psl format usage: bedToPsl [options] chromSizes bedFile pslFile Convert a BED file to a PSL file. This the result is an alignment. It is intended to allow processing by tools that operate on PSL. If the BED has at least 12 columns, then a PSL with blocks is created. Otherwise single-exon PSLs are created. Options: -tabs - use tab as a separator -keepQuery - instead of creating a fake query, create PSL with identical query and target specs. Useful if bed features are to be lifted with pslMap and one wants to keep the source location in the lift result. ================================================================ ======== bedWeedOverlapping ==================================== ================================================================ ### kent source version 462 ### bedWeedOverlapping - Filter out beds that overlap a 'weed.bed' file. usage: bedWeedOverlapping weeds.bed input.bed output.bed options: -maxOverlap=0.N - maximum overlapping ratio, default 0 (any overlap) -invert - keep the overlapping and get rid of everything else ================================================================ ======== bigBedInfo ==================================== ================================================================ ### kent source version 462 ### bigBedInfo - Show information about a bigBed file. usage: bigBedInfo file.bb options: -udcDir=/dir/to/cache - place to put cache for remote bigBed/bigWigs -chroms - list all chromosomes and their sizes -zooms - list all zoom levels and their sizes -as - get autoSql spec -extraIndex - list all the extra indexes ================================================================ ======== bigBedNamedItems ==================================== ================================================================ ### kent source version 462 ### bigBedNamedItems - Extract item of given name from bigBed usage: bigBedNamedItems file.bb name output.bed options: -nameFile - if set, treat name parameter as file full of space delimited names -field=fieldName - use index on field name, default is "name" -header - output a autoSql-style header (starts with '#'). ================================================================ ======== bigBedSummary ==================================== ================================================================ ### kent source version 462 ### bigBedSummary - Extract summary information from a bigBed file. usage: bigBedSummary file.bb chrom start end dataPoints Get summary data from bigBed for indicated region, broken into dataPoints equal parts. (Use dataPoints=1 for simple summary.) options: -type=X where X is one of: coverage - % of region that is covered (default) mean - average depth of covered regions min - minimum depth of covered regions max - maximum depth of covered regions -fields - print out information on fields in file. If fields option is used, the chrom, start, end, dataPoints parameters may be omitted -udcDir=/dir/to/cache - place to put cache for remote bigBed/bigWigs ================================================================ ======== bigBedToBed ==================================== ================================================================ ### kent source version 462 ### bigBedToBed v1 - Convert from bigBed to ascii bed format. usage: bigBedToBed input.bb output.bed options: -chrom=chr1 - if set restrict output to given chromosome -start=N - if set, restrict output to only that over start -end=N - if set, restrict output to only that under end -bed=in.bed - restrict output to all regions in a BED file -udcDir=/dir/to/cache - place to put cache for remote bigBed/bigWigs -header - output a autoSql-style header (starts with '#'). -tsv - output a TSV header (without '#'). ================================================================ ======== bigChainBreaks ==================================== ================================================================ ### kent source version 462 ### bigChainBreaks - output a set of rearrangement breakpoints usage: bigChainBreaks bigChain.bb label breaks.txt options: -xxx=XXX ================================================================ ======== bigChainToChain ==================================== ================================================================ ### kent source version 462 ### bigChainToChain - convert bigChain files back into a chain file usage: bigChainToChain bigChain.bb bigLinks.bb output.chain options: -xxx=XXX ================================================================ ======== bigGenePredToGenePred ==================================== ================================================================ ### kent source version 462 ### bigGenePredToGenePred - convert bigGenePred file to genePred. usage: bigGenePredToGenePred bigGenePred.bb genePred.gp ================================================================ ======== bigGuessDb ==================================== ================================================================ Usage: bigGuessDb [options] inFile - given a bigBed or "bigWig file or URL, guess the assembly based on the chrom names and sizes. Must have bigBedInfo and bigWigInfo in PATH. Also requires a bigGuessDb.txt.gz, an alpha version of which can be downloaded at https://hgwdev.gi.ucsc.edu/~max/bigGuessDb/bigGuessDb.txt.gz Example run: $ wget https://hgwdev.gi.ucsc.edu/~max/bigGuessDb/bigGuessDb.txt.gz $ bigGuessDb --best https://ftp.ncbi.nlm.nih.gov/geo/samples/GSM1014nnn/GSM1014177/suppl/GSM1014177_mm9_wgEncodeUwDnaseNih3t3NihsMImmortalSigRep2.bigWig mm9 bigGuessDb: error: no such option: -v ================================================================ ======== bigHeat ==================================== ================================================================ Usage: bigHeat [options] locationBed locationMatrixFnames chromSizes outDir - create one feature Duplicate BED features and color by them by the values in locationMatrix. Creates new bigBed files in outDir and creates a basic trackDb.ra file there. BED file looks like this: chr1 1 1000 myGene 0 + 1 1000 0,0,0 chr2 1 1000 myGene2 0 + 1 1000 0,0,0 locationMatrix looks like this: gene sample1 sample2 sample3 myGene 1 2 3 myGene2 0.1 3 10 myGene2_probe2 0.1 3 10 This will create a composite with three subtracks (sample1, sample2, sample). Each subtrack will have myGene, and colored in intensity with sample3 more intense than sample1 and sample2. Same for myGene2. Also can add a bigWig with a summary of all these values, one per nucleotide bigHeat: error: no such option: -v ================================================================ ======== bigMafToMaf ==================================== ================================================================ ### kent source version 462 ### bigMafToMaf - convert bigMaf to maf file usage: bigMafToMaf bigMaf.bb file.maf options: ================================================================ ======== bigPslToPsl ==================================== ================================================================ ### kent source version 462 ### bigPslToPsl - convert bigPsl file to psl usage: bigPslToPsl bigPsl.bb output.psl options: -collapseStrand if target strand is '+', don't output it ================================================================ ======== bigWigAverageOverBed ==================================== ================================================================ ### kent source version 462 ### bigWigAverageOverBed v2 - Compute average score of big wig over each bed, which may have introns. usage: bigWigAverageOverBed in.bw in.bed out.tab The output columns are: name - name field from bed, which should be unique size - size of bed (sum of exon sizes covered - # bases within exons covered by bigWig sum - sum of values over all bases covered mean0 - average over bases with non-covered bases counting as zeroes mean - average over just covered bases Options: -stats=stats.ra - Output a collection of overall statistics to stat.ra file -bedOut=out.bed - Make output bed that is echo of input bed but with mean column appended -sampleAroundCenter=N - Take sample at region N bases wide centered around bed item, rather than the usual sample in the bed item. -minMax - include two additional columns containing the min and max observed in the area. ================================================================ ======== bigWigCat ==================================== ================================================================ ### kent source version 462 ### bigWigCat v 4 - merge non-overlapping bigWig files directly into bigWig format usage: bigWigCat out.bw in1.bw in2.bw ... Where in*.bw is in big wig format and out.bw is the output indexed big wig file. options: -itemsPerSlot=N - Number of data points bundled at lowest level. Default 1024 Note: must use wigToBigWig -fixedSummaries -keepAllChromosomes (perhaps in parallel cluster jobs) to create the input files. Note: By non-overlapping we mean the entire span of each file, from first data point to last data point, must not overlap with that of other files. ================================================================ ======== bigWigCluster ==================================== ================================================================ ### kent source version 462 ### bigWigCluster - Cluster bigWigs using a hacTree usage: bigWigCluster input.list chrom.sizes output.json output.tab where: input.list is a list of bigWig file names chrom.sizes is tab separated for assembly for bigWigs output.json is json formatted output suitable for graphing with D3 output.tab is tab-separated file of of items ordered by tree with the fields label - label from -labels option or from file name with no dir or extention pos - number from 0-1 representing position according to tree and distance red - number from 0-255 representing recommended red component of color green - number from 0-255 representing recommended green component of color blue - number from 0-255 representing recommended blue component of color path - file name from input.list including directory and extension options: -labels=fileName - label files from tabSeparated file with fields path - path to bigWig file label - a string with no tabs -precalc=precalc.tab - tab separated file with columns. -threads=N - number of threads to use, default 10 -tmpDir=/tmp/path - place to put temp files, default current dir ================================================================ ======== bigWigCorrelate ==================================== ================================================================ ### kent source version 462 ### bigWigCorrelate - Correlate bigWig files, optionally only on target regions. usage: bigWigCorrelate a.bigWig b.bigWig or bigWigCorrelate listOfFiles options: -restrict=restrict.bigBed - restrict correlation to parts covered by this file -threshold=N.N - clip values to this threshold -rootNames - if set just report the root (minus directory and suffix) of file names when using listOfFiles -ignoreMissing - if set do not correlate where either side is missing data Normally missing data is treated as zeros ================================================================ ======== bigWigInfo ==================================== ================================================================ ### kent source version 462 ### bigWigInfo - Print out information about bigWig file. usage: bigWigInfo file.bw options: -udcDir=/dir/to/cache - place to put cache for remote bigBed/bigWigs -chroms - list all chromosomes and their sizes -zooms - list all zoom levels and their sizes -minMax - list the min and max on a single line ================================================================ ======== bigWigMerge ==================================== ================================================================ ### kent source version 462 ### bigWigMerge v2 - Merge together multiple bigWigs into a single output bedGraph. You'll have to run bedGraphToBigWig to make the output bigWig. The signal values are just added together to merge them usage: bigWigMerge in1.bw in2.bw .. inN.bw out.bedGraph options: -threshold=0.N - don't output values at or below this threshold. Default is 0.0 -adjust=0.N - add adjustment to each value -clip=NNN.N - values higher than this are clipped to this value -inList - input file are lists of file names of bigWigs -max - merged value is maximum from input files rather than sum ================================================================ ======== bigWigSummary ==================================== ================================================================ ### kent source version 462 ### bigWigSummary - Extract summary information from a bigWig file. usage: bigWigSummary file.bigWig chrom start end dataPoints Get summary data from bigWig for indicated region, broken into dataPoints equal parts. (Use dataPoints=1 for simple summary.) NOTE: start and end coordinates are in BED format (0-based) options: -type=X where X is one of: mean - average value in region (default) min - minimum value in region max - maximum value in region std - standard deviation in region coverage - % of region that is covered -udcDir=/dir/to/cache - place to put cache for remote bigBed/bigWigs ================================================================ ======== bigWigToBedGraph ==================================== ================================================================ ### kent source version 462 ### bigWigToBedGraph - Convert from bigWig to bedGraph format. usage: bigWigToBedGraph in.bigWig out.bedGraph options: -chrom=chr1 - if set restrict output to given chromosome -start=N - if set, restrict output to only that over start -end=N - if set, restict output to only that under end -udcDir=/dir/to/cache - place to put cache for remote bigBed/bigWigs ================================================================ ======== bigWigToWig ==================================== ================================================================ ### kent source version 462 ### bigWigToWig - Convert bigWig to wig. This will keep more of the same structure of the original wig than bigWigToBedGraph does, but still will break up large stepped sections into smaller ones. usage: bigWigToWig in.bigWig out.wig options: -chrom=chr1 - if set restrict output to given chromosome -start=N - if set, restrict output to only that over start -end=N - if set, restict output to only that under end -udcDir=/dir/to/cache - place to put cache for remote bigBed/bigWigs ================================================================ ======== binFromRange ==================================== ================================================================ ### kent source version 462 ### binFromRange - Translate a 0-based half open start and end into a bin range sql expression. usage: binFromRange start end ================================================================ ======== blastToPsl ==================================== ================================================================ ### kent source version 462 ### blastToPsl - Convert blast alignments to PSLs. usage: blastToPsl [options] blastOutput psl Options: -scores=file - Write score information to this file. Format is: strands qName qStart qEnd tName tStart tEnd bitscore eVal -verbose=n - n >= 3 prints each line of file after parsing. n >= 4 dumps the result of each query -eVal=n n is e-value threshold to filter results. Format can be either an integer, double or 1e-10. Default is no filter. -pslx - create PSLX output (includes sequences for blocks) Output only results of last round from PSI BLAST ================================================================ ======== blastXmlToPsl ==================================== ================================================================ ### kent source version 462 ### blastXmlToPsl - convert blast XML output to PSLs usage: blastXmlToPsl [options] blastXml psl options: -scores=file - Write score information to this file. Format is: strands qName qStart qEnd tName tStart tEnd bitscore eVal qDef tDef -verbose=n - n >= 3 prints each line of file after parsing. n >= 4 dumps the result of each query -eVal=n n is e-value threshold to filter results. Format can be either an integer, double or 1e-10. Default is no filter. -pslx - create PSLX output (includes sequences for blocks) -convertToNucCoords - convert protein to nucleic alignments to nucleic to nucleic coordinates -qName=src - define element used to obtain the qName. The following values are support: o query-ID - use contents of the element if it exists, otherwise use o query-def0 - use the first white-space separated word of the element if it exists, otherwise the first word of . Default is query-def0. -tName=src - define element used to obtain the tName. The following values are support: o Hit_id - use contents of the element. o Hit_def0 - use the first white-space separated word of the element. o Hit_accession - contents of the element. Default is Hit-def0. -forcePsiBlast - treat as output of PSI-BLAST. blast-2.2.16 and maybe others indentify psiblast as blastp. Output only results of last round from PSI BLAST ================================================================ ======== blat ==================================== ================================================================ ### kent source version 462 ### blat - Standalone BLAT v. 37x1 fast sequence search command line tool usage: blat database query [-ooc=11.ooc] output.psl where: database and query are each either a .fa, .nib or .2bit file, or a list of these files with one file name per line. -ooc=11.ooc tells the program to load over-occurring 11-mers from an external file. This will increase the speed by a factor of 40 in many cases, but is not required. output.psl is the name of the output file. Subranges of .nib and .2bit files may be specified using the syntax: /path/file.nib:seqid:start-end or /path/file.2bit:seqid:start-end or /path/file.nib:start-end With the second form, a sequence id of file:start-end will be used. options: -t=type Database type. Type is one of: dna - DNA sequence prot - protein sequence dnax - DNA sequence translated in six frames to protein The default is dna. -q=type Query type. Type is one of: dna - DNA sequence rna - RNA sequence prot - protein sequence dnax - DNA sequence translated in six frames to protein rnax - DNA sequence translated in three frames to protein The default is dna. -prot Synonymous with -t=prot -q=prot. -ooc=N.ooc Use overused tile file N.ooc. N should correspond to the tileSize. -tileSize=N Sets the size of match that triggers an alignment. Usually between 8 and 12. Default is 11 for DNA and 5 for protein. -stepSize=N Spacing between tiles. Default is tileSize. -oneOff=N If set to 1, this allows one mismatch in tile and still triggers an alignment. Default is 0. -minMatch=N Sets the number of tile matches. Usually set from 2 to 4. Default is 2 for nucleotide, 1 for protein. -minScore=N Sets minimum score. This is the matches minus the mismatches minus some sort of gap penalty. Default is 30. -minIdentity=N Sets minimum sequence identity (in percent). Default is 90 for nucleotide searches, 25 for protein or translated protein searches. -maxGap=N Sets the size of maximum gap between tiles in a clump. Usually set from 0 to 3. Default is 2. Only relevant for minMatch > 1. -noHead Suppresses .psl header (so it's just a tab-separated file). -makeOoc=N.ooc Make overused tile file. Target needs to be complete genome. -repMatch=N Sets the number of repetitions of a tile allowed before it is marked as overused. Typically this is 256 for tileSize 12, 1024 for tile size 11, 4096 for tile size 10. Default is 1024. Typically comes into play only with makeOoc. Also affected by stepSize: when stepSize is halved, repMatch is doubled to compensate. -noSimpRepMask Suppresses simple repeat masking. -mask=type Mask out repeats. Alignments won't be started in masked region but may extend through it in nucleotide searches. Masked areas are ignored entirely in protein or translated searches. Types are: lower - mask out lower-cased sequence upper - mask out upper-cased sequence out - mask according to database.out RepeatMasker .out file file.out - mask database according to RepeatMasker file.out -qMask=type Mask out repeats in query sequence. Similar to -mask above, but for query rather than target sequence. -repeats=type Type is same as mask types above. Repeat bases will not be masked in any way, but matches in repeat areas will be reported separately from matches in other areas in the psl output. -minRepDivergence=NN Minimum percent divergence of repeats to allow them to be unmasked. Default is 15. Only relevant for masking using RepeatMasker .out files. -dots=N Output dot every N sequences to show program's progress. -trimT Trim leading poly-T. -noTrimA Don't trim trailing poly-A. -trimHardA Remove poly-A tail from qSize as well as alignments in psl output. -fastMap Run for fast DNA/DNA remapping - not allowing introns, requiring high %ID. Query sizes must not exceed 5000. -out=type Controls output file format. Type is one of: psl - Default. Tab-separated format, no sequence pslx - Tab-separated format with sequence axt - blastz-associated axt format maf - multiz-associated maf format sim4 - similar to sim4 format wublast - similar to wublast format blast - similar to NCBI blast format blast8- NCBI blast tabular format blast9 - NCBI blast tabular format with comments -fine For high-quality mRNAs, look harder for small initial and terminal exons. Not recommended for ESTs. -maxIntron=N Sets maximum intron size. Default is 750000. -extendThroughN Allows extension of alignment through large blocks of Ns. ================================================================ ======== calc ==================================== ================================================================ ### kent source version 462 ### calc - Little command line calculator usage: calc this + that * theOther / (a + b) Options: -h - output result as a human-readable integer numbers, with k/m/g/t suffix ================================================================ ======== catDir ==================================== ================================================================ catDir - concatenate files in directory to stdout. For those times when too many files for cat to handle. usage: catDir dir(s) options: -r Recurse into subdirectories -suffix=.suf This will restrict things to files ending in .suf '-wild=*.???' This will match wildcards. -nonz Prints file name of non-zero length files ================================================================ ======== catUncomment ==================================== ================================================================ catUncomment - Concatenate input removing lines that start with '#' Output goes to stdout usage: catUncomment file(s) ================================================================ ======== chainAntiRepeat ==================================== ================================================================ ### kent source version 462 ### chainAntiRepeat - Get rid of chains that are primarily the results of repeats and degenerate DNA usage: chainAntiRepeat tNibDir qNibDir inChain outChain options: -minScore=N - minimum score (after repeat stuff) to pass -noCheckScore=N - score that will pass without checks (speed tweak) ================================================================ ======== chainBridge ==================================== ================================================================ ### kent source version 462 ### chainBridge - Attempt to extend alignments through double-sided gaps of similar size usage: chainBridge in.chain target.2bit query.2bit out.chain options: -diffTolerance=D Don't try to extend when a 2-sided gap's longer side is Dx longer than its shorter side (default: 0.3, i.e. 30% longer) -maxGap=N Maximum size of double-sided gap to try to bridge (default: 6000) Note: there is no size limit for exact sequence matches -scoreScheme=fileName Read the scoring matrix from a blastz-format file -linearGap= Specify type of linearGap to use. loose is chicken/human linear gap costs. medium is mouse/human linear gap costs. Or specify a piecewise linearGap tab delimited file. (default: loose) ================================================================ ======== chainCleaner ==================================== ================================================================ ### kent source version 462 ### chainCleaner - Remove chain-breaking alignments from chains that break nested chains. NOTATION: The "breaking chain" contains a local alignment block (called "chain-breaking alignment" (CBA) or "suspect") that breaks a nested chain ("broken chain") into two nets. usage: chainCleaner in.chain tNibDir qNibDir out.chain out.bed -net=in.net OR chainCleaner in.chain tNibDir qNibDir out.chain out.bed -tSizes=/dir/to/target/chrom.sizes -qSizes=/dir/to/query/chrom.sizes First option: you have netted the chains and specify the net file via -net=netFile Second option: you have not netted the chains. Then chainCleaner will net them. In this case, you must specify the chrom.sizes file for the target and query with -tSizes/-qSizes tNibDir/qNibDir are either directories with nib files, or the name of a .2bit file output: out.chain output file in chain format containing the untouched chains, the original broken chain and the modified breaking chains. NOTE: this file is chainSort-ed. out.bed output file in bed format containing the coords and information about the removed chain-breaking alignments. Most important options for deciding which chain-breaking alignments (CBA) to remove: -LRfoldThreshold=N threshold for removing local alignment blocks if the score of the left and right fill of brokenChain / CBA score is at least this fold threshold. Default 2.5 -doPairs flag: if set, do test if pairs of CBAs can be removed -LRfoldThresholdPairs=N threshold for removing local alignment blocks if the score of the left and right fill of brokenChain / CBA score is at least this fold threshold. Default 10.0 -maxPairDistance=N only consider pairs of CBAs where the distance between the end of the upstream CBA and the start of the downstream CBA is at most that many bp (Default 10000) -scoreScheme=fileName Read the scoring matrix from a blastz-format file -linearGap= Specify type of linearGap to use. *Must* specify this argument to one of these choices. loose is chicken/human linear gap costs. medium is mouse/human linear gap costs. Or specify a piecewise linearGap tab delimited file. sample linearGap file (loose) tablesize 11 smallSize 111 position 1 2 3 11 111 2111 12111 32111 72111 152111 252111 qGap 325 360 400 450 600 1100 3600 7600 15600 31600 56600 tGap 325 360 400 450 600 1100 3600 7600 15600 31600 56600 bothGap 625 660 700 750 900 1400 4000 8000 16000 32000 57000 Other options for deciding which suspects to remove: -foldThreshold=N threshold for removing local alignment blocks if the brokenChain score / suspect score is at least this fold threshold. Default 0.0 -maxSuspectBases=N threshold for number of target bases in aligning blocks of the suspect subChain. If higher, do not remove suspect. Default 2147483647 -maxSuspectScore=N threshold for score of suspect subChain. If higher, do not remove suspect. Default 100000 -minBrokenChainScore=N threshold for minimum score of the entire broken chain. If the broken chain scores lower, it is less likely to be a real alignment and we will not remove the suspect. Default 50000 -minLRGapSize=N threshold for min size of left/right gap (how far the suspect is away from other blocks in the breaking chain). If lower, do not remove suspect (suspect to close to left or right part of breaking chain). Default 0 Debug and testing options: -newChainIDDict=fileName output 'newChainID{tab}breakingChainID' to this file. Gives a dictionary of the new IDs of chains representing removed suspects and the chain ID of the breaking chain that had the suspect before. -suspectDataFile=fileName output all the data for suspects to this file in bed format. If set, we do not clean any suspect as this would lead to updating the suspect values (updating the L/R fill region). -debug produces output chain files with the suspect and broken chains, and a bed file with information about all possible suspects. For debugging. ================================================================ ======== chainFilter ==================================== ================================================================ ### kent source version 462 ### chainFilter - Filter chain files. Output goes to standard out. usage: chainFilter file(s) options: -q=chr1,chr2 - restrict query side sequence to those named -notQ=chr1,chr2 - restrict query side sequence to those not named -t=chr1,chr2 - restrict target side sequence to those named -notT=chr1,chr2 - restrict target side sequence to those not named -id=N - only get one with ID number matching N -minScore=N - restrict to those scoring at least N -maxScore=N - restrict to those scoring less than N -qStartMin=N - restrict to those with qStart at least N -qStartMax=N - restrict to those with qStart less than N -qEndMin=N - restrict to those with qEnd at least N -qEndMax=N - restrict to those with qEnd less than N -tStartMin=N - restrict to those with tStart at least N -tStartMax=N - restrict to those with tStart less than N -tEndMin=N - restrict to those with tEnd at least N -tEndMax=N - restrict to those with tEnd less than N -qOverlapStart=N - restrict to those where the query overlaps a region starting here -qOverlapEnd=N - restrict to those where the query overlaps a region ending here -tOverlapStart=N - restrict to those where the target overlaps a region starting here -tOverlapEnd=N - restrict to those where the target overlaps a region ending here -strand=? -restrict strand (to + or -) -long -output in long format -zeroGap -get rid of gaps of length zero -minGapless=N - pass those with minimum gapless block of at least N -qMinGap=N - pass those with minimum gap size of at least N -tMinGap=N - pass those with minimum gap size of at least N -qMaxGap=N - pass those with maximum gap size no larger than N -tMaxGap=N - pass those with maximum gap size no larger than N -qMinSize=N - minimum size of spanned query region -qMaxSize=N - maximum size of spanned query region -tMinSize=N - minimum size of spanned target region -tMaxSize=N - maximum size of spanned target region -noRandom - suppress chains involving '_random' chromosomes -noHap - suppress chains involving '_hap|_alt' chromosomes ================================================================ ======== chainMergeSort ==================================== ================================================================ ### kent source version 462 ### chainMergeSort - Combine sorted files into larger sorted file usage: chainMergeSort file(s) Output goes to standard output options: -saveId - keep the existing chain ids. -inputList=somefile - somefile contains list of input chain files. -tempDir=somedir/ - somedir has space for temporary sorting data, default ./ ================================================================ ======== chainNet ==================================== ================================================================ ### kent source version 462 ### chainNet - Make alignment nets out of chains usage: chainNet in.chain target.sizes query.sizes target.net query.net where: in.chain is the chain file sorted by score target.sizes contains the size of the target sequences query.sizes contains the size of the query sequences target.net is the output over the target genome query.net is the output over the query genome options: -minSpace=N - minimum gap size to fill, default 25 -minFill=N - default half of minSpace -minScore=N - minimum chain score to consider, default 2000.0 -verbose=N - Alter verbosity (default 1) -inclHap - include query sequences name in the form *_hap*|*_alt*. Normally these are excluded from nets as being haplotype pseudochromosomes ================================================================ ======== chainPreNet ==================================== ================================================================ ### kent source version 462 ### chainPreNet - Remove chains that don't have a chance of being netted usage: chainPreNet in.chain target.sizes query.sizes out.chain options: -dots=N - output a dot every so often -pad=N - extra to pad around blocks to decrease trash (default 1) -inclHap - include query sequences name in the form *_hap*|*_alt*. Normally these are excluded from nets as being haplotype pseudochromosomes ================================================================ ======== chainScore ==================================== ================================================================ ### kent source version 462 ### chainScore - score chains usage: chainScore in.chain t.2bit q.2bit out.chain options: -faQ q.2bit is read as a fasta file -minScore=N Minimum score for chain, default 1000 -scoreScheme=fileName Read the scoring matrix from a blastz-format file -linearGap=filename Read piecewise linear gap from tab delimited file sample linearGap file tablesize 11 smallSize 111 position 1 2 3 11 111 2111 12111 32111 72111 152111 252111 qGap 350 425 450 600 900 2900 22900 57900 117900 217900 317900 tGap 350 425 450 600 900 2900 22900 57900 117900 217900 317900 bothGap 750 825 850 1000 1300 3300 23300 58300 118300 218300 318300 ================================================================ ======== chainSort ==================================== ================================================================ ### kent source version 462 ### chainSort - Sort chains. By default sorts by score. Note this loads all chains into memory, so it is not suitable for large sets. Instead, run chainSort on multiple small files, followed by chainMergeSort. usage: chainSort inFile outFile Note that inFile and outFile can be the same options: -target sort on target start rather than score -query sort on query start rather than score -index=out.tab build simple two column index file where is score, target, or query depending on the sort. ================================================================ ======== chainSplit ==================================== ================================================================ ### kent source version 462 ### chainSplit - Split chains up by target or query sequence usage: chainSplit outDir inChain(s) options: -q - Split on query (default is on target) -lump=N Lump together so have only N split files. ================================================================ ======== chainStitchId ==================================== ================================================================ ### kent source version 462 ### chainStitchId - Join chain fragments with the same chain ID into a single chain per ID. Chain fragments must be from same original chain but must not overlap. Chain fragment scores are summed. usage: chainStitchId in.chain out.chain ================================================================ ======== chainSwap ==================================== ================================================================ chainSwap - Swap target and query in chain usage: chainSwap in.chain out.chain ================================================================ ======== chainToAxt ==================================== ================================================================ ### kent source version 462 ### chainToAxt - Convert from chain to axt file usage: chainToAxt in.chain tNibDirOr2bit qNibDirOr2bit out.axt options: -maxGap=maximum gap sized allowed without breaking, default 100 -maxChain=maximum chain size allowed without breaking, default 1073741823 -minScore=minimum score of chain -minId=minimum percentage ID within blocks -bed Output bed instead of axt ================================================================ ======== chainToPsl ==================================== ================================================================ chainToPsl - Convert chain file to psl format usage: chainToPsl in.chain tSizes qSizes target.lst query.lst out.psl Where tSizes and qSizes are tab-delimited files with columns. The target and query lists can either be fasta files, nib files, 2bit files or a list of fasta, 2bit and/or nib files one per line ================================================================ ======== chainToPslBasic ==================================== ================================================================ ### kent source version 462 ### chainToPslBasic - Basic conversion chain file to psl format usage: chainToPsl in.chain out.psl If you need match and mismatch stats updated, pipe output through pslRecalcMatch ================================================================ ======== checkAgpAndFa ==================================== ================================================================ ### kent source version 462 ### checkAgpAndFa - takes a .agp file and .fa file and ensures that they are in synch usage: checkAgpAndFa in.agp in.fa options: -exclude=seq - Ignore seq (e.g. chrM for which we usually get sequence from GenBank but don't have AGP) in.fa can be a .2bit file. If it is .fa then sequences must appear in the same order in .agp and .fa. ================================================================ ======== checkCoverageGaps ==================================== ================================================================ ### kent source version 462 ### checkCoverageGaps - Check for biggest gap in coverage for a list of tracks. For most tracks coverage of 10,000,000 or more will indicate that there was a mistake in generating the track. usage: checkCoverageGaps database track1 ... trackN Note: for bigWig and bigBeds, the biggest gap is rounded to the nearest 10,000 or so options: -allParts If set then include _hap and _random and other wierd chroms -female If set then don't check chrY -noComma - Don't put commas in biggest gap output ================================================================ ======== checkHgFindSpec ==================================== ================================================================ ### kent source version 462 ### checkHgFindSpec - test and describe search specs in hgFindSpec tables. usage: checkHgFindSpec database [options | termToSearch] If given a termToSearch, displays the list of tables that will be searched and how long it took to figure that out; then performs the search and the time it took. options: -showSearches Show the order in which tables will be searched in general. [This will be done anyway if no termToSearch or options are specified.] -checkTermRegex For each search spec that includes a regular expression for terms, make sure that all values of the table field to be searched match the regex. (If not, some of them could be excluded from searches.) -checkIndexes Make sure that an index is defined on each field to be searched. ================================================================ ======== checkTableCoords ==================================== ================================================================ ### kent source version 462 ### checkTableCoords - check invariants on genomic coords in table(s). usage: checkTableCoords database [tableName] Searches for illegal genomic coordinates in all tables in database unless narrowed down using options. Uses ~/.hg.conf to determine genome database connection info. For psl/alignment tables, checks target coords only. options: -table=tableName Check this table only. (Default: all tables) -daysOld=N Check tables that have been modified at most N days ago. -hoursOld=N Check tables that have been modified at most N hours ago. (days and hours are additive) -exclude=patList Exclude tables matching any pattern in comma-separated patList. patList can contain wildcards (*?) but should be escaped or single-quoted if it does. patList can contain "genbank" which will be expanded to all tables generated by the automated genbank build process. -ignoreBlocks To save time (but lose coverage), skip block coord checks. -verboseBlocks Print out more details about illegal block coords, since they can't be found by simple SQL queries. ================================================================ ======== chopFaLines ==================================== ================================================================ chopFaLines - Read in FA file with long lines and rewrite it with shorter lines usage: chopFaLines in.fa out.fa ================================================================ ======== chromGraphFromBin ==================================== ================================================================ ### kent source version 462 ### chromGraphFromBin - Convert chromGraph binary to ascii format. usage: chromGraphFromBin in.chromGraph out.tab options: -chrom=chrX - restrict output to single chromosome ================================================================ ======== chromGraphToBin ==================================== ================================================================ ### kent source version 462 ### chromGraphToBin - Make binary version of chromGraph. usage: chromGraphToBin in.tab out.chromGraph options: -xxx=XXX ================================================================ ======== chromToUcsc ==================================== ================================================================ Usage: chromToUcsc [options] filename - change NCBI or Ensembl chromosome names to UCSC names in tabular or wiggle files, using a chromAlias table. Supports these UCSC file formats: BED, genePred, PSL, wiggle (all formats), bedGraph, VCF, SAM, GTF, Chain ... or any other csv or tsv format where the sequence (chromosome) name is a separate field. Requires a .chromAlias.tsv file which can be downloaded like this: chromToUcsc --get hg19 # download the file hg19.chromAlias.tsv into current directory Which also works for GenArk assemblies and can take an output directory: chromToUcsc --get GCF_000001735.3 -o /tmp/ # for GenArk assemblies, will translate to NCBI sequence names (accessions) If you do not want to use the --get option to retrieve the mapping tables, you can also download the alias mapping files yourself, e.g. for mm10 with 'wget https://hgdownload.soe.ucsc.edu/goldenPath/mm10/database/chromAlias.txt.gz' Then the script can be run like this: chromToUcsc -i in.bed -o out.bed -a hg19.chromAlias.tsv chromToUcsc -i in.bed -o out.bed -a https://hgdownload.soe.ucsc.edu/goldenPath/hg19/database/chromAlias.txt.gz Or in pipes, like this: cat test.bed | chromToUcsc -a mm10.chromAlias.tsv > test.ucsc.bed For BAM files use this program in a pipe with samtools: samtools view -h in.bam | ./chromToUcsc -a mm10.chromAlias.tsv | samtools -bS > out.bam By default, this script expects the chromosome name in the first field. The default works for BED, bedGraph, GTF, wiggle, VCF. For the following file formats, you will need to set the -k option to these values manually: genePred: 2 -- PSL: 10 (query) or 14 (target) -- chain: 2 (target) or 7 (query) -- SAM: 2 (If a line starts with @ (SAM format), -k is automatically set to 2.) Options: -h, --help show this help message and exit --get=DOWNLOADDB download a chrom alias table from UCSC for the genomeDb into the current directory or directory provided by -o and exit -a ALIASFNAME, --chromAlias=ALIASFNAME a UCSC chromAlias file in tab-sep format or the http/https URL to one -i INFNAME, --in=INFNAME input filename, default: /dev/stdin -o OUTFNAME, --out=OUTFNAME output filename, default: /dev/stdout -d, --debug show debug messages -s, --skipUnknown skip unknown sequence rather than generate an error. -k FIELDNO, --field=FIELDNO Index of field (1-based) that contains the chromosome name. No other field is touched by this program, unless the SAM format is detected. Default is 1 (first field). ================================================================ ======== clusterGenes ==================================== ================================================================ ### kent source version 462 ### clusterGenes - Cluster genes from genePred tracks usage: clusterGenes [options] outputFile database table1 ... tableN clusterGenes [options] -trackNames outputFile database track1 table1 ... trackN tableN Where outputFile is a tab-separated file describing the clustering, database is a genome database such as mm4 or hg16, and the table parameters are either tables in genePred format in that database or genePred tab seperated files. If the input is all from files, the argument can be `no'. options: -verbose=N - Print copious debugging info. 0 for none, 3 for loads -chrom=chrN - Just work this chromosome, maybe repeated. -cds - cluster only on CDS exons, Non-coding genes are dropped. -trackNames - If specified, input are pairs of track names and files. This is useful when the file names don't reflact the desired track names. -ignoreStrand - cluster postive and negative strand together -clusterBed=bed - output BED file for each cluster. If -cds is specified, this only contains bounds based on CDS -clusterTxBed=bed - output BED file for each cluster. If -cds is specified, this contains bounds based on full transcript range of genes, not just CDS -flatBed=bed - output BED file that contains the exons of all genes flattned into a single record. If -cds is specified, this only contains bounds based on CDS -joinContained - join genes that are contained within a larger loci into that loci. Intended as a way to handled fragments and exon-level predictsions, as genes-in-introns on the same strand are very rare. -conflicted - detect conflicted loci. Conflicted loci are loci that contain genes that share no sequence. This option greatly increases size of output file. -ignoreBases=N - ignore this many based to the start and end of each transcript when determine overlap. This prevents small amounts of overlap from merging transcripts. If -cds is specified, this amount of the CDS. is ignored. The default is 0. The cdsConflicts and exonConflicts columns contains `y' if the cluster has conficts. A conflict is a cluster where all of the genes don't share exons. Conflicts maybe either internal to a table or between tables. ================================================================ ======== clusterMatrixToBarChartBed ==================================== ================================================================ ### kent source version 462 ### clusterMatrixToBarChartBed - Compute a barchart bed file from a gene matrix and a gene bed file and a way to cluster samples. NOTE: consider using matrixClusterColumns and matrixToBarChartBed instead usage: clusterMatrixToBarChartBed sampleClusters.tsv geneMatrix.tsv geneset.bed output.bed where: sampleClusters.tsv is a two column tab separated file with sampleId and clusterId geneMatrix.tsv has a row for each gene. The first row uses the same sampleId as above geneset.bed has the maps the genes in the matrix (from it's first column) to the genome geneset.bed needs 6 standard bed fields. Unless name2 is set it also needs a name2 field as the last field output.bed is the resulting bar chart, with one column per cluster options: -simple - don't store the position of gene in geneMatrix.tsv file in output -median - use median (instead of mean) -name2=twoColFile.tsv - get name2 from file where first col is same ase geneset.bed's name ================================================================ ======== colTransform ==================================== ================================================================ colTransform - Add and/or multiply column by constant. usage: colTransform column input.tab addFactor mulFactor output.tab where: column is the column to transform, starting with 1 input.tab is the tab delimited input file addFactor is what to add. Use 0 here to not change anything mulFactor is what to multiply by. Use 1 here not to change anything output.tab is the tab delimited output file ================================================================ ======== countChars ==================================== ================================================================ countChars - Count the number of occurrences of a particular char usage: countChars char file(s) Char can either be a two digit hexadecimal value or a single letter literal character ================================================================ ======== cpg_lh ==================================== ================================================================ cpg_lh - calculate CpG Island data for cpgIslandExt tracks usage: cpg_lh where is fasta sequence, must be more than 200 bases of legitimate sequence, not all N's To process the output into the UCSC bed file format: cpglh fastaInput.fa \ | awk '{$2 = $2 - 1; width = $3 - $2; printf("%s\t%d\t%s\t%s %s\t%s\t%s\t%0.0f\t%0.1f\t%s\t%s\n", $1, $2, $3, $5, $6, width, $6, width*$7*0.01, 100.0*2*$6/width, $7, $9);}' \ | sort -k1,1 -k2,2n > output.bed The original cpg.c was written by Gos Miklem from the Sanger Center. LaDeana Hillier added some modifications --> cpg_lh.c, and UCSC hass added some further modifications to cpg_lh.c, so that its expected number of CpGs in an island is calculated as described in Gardiner-Garden, M. and M. Frommer, 1987 CpG islands in vertebrate genomes. J. Mol. Biol. 196:261-282 Expected = (Number of C's * Number of G's) / Length Instead of a sliding-window search for CpG islands, this cpg program uses a running-sum score where a 'C' followed by a 'G' increases the score by 17 and anything else decreases the score by 1. When the score transitions from positive to 0 (and at the end of the sequence), the sequence in the current span is evaluated to see if it qualifies as a CpG island (>200 bp length, >50% GC, >0.6 ratio of observed CpG to expected). Then the search recurses on the span from the position with the max running score up to the current position. ================================================================ ======== crTreeIndexBed ==================================== ================================================================ ### kent source version 462 ### crTreeIndexBed - Create an index for a bed file. usage: crTreeIndexBed in.bed out.cr options: -blockSize=N - number of children per node in index tree. Default 1024 -itemsPerSlot=N - number of items per index slot. Default is half block size -noCheckSort - Don't check sorting order of in.tab ================================================================ ======== crTreeSearchBed ==================================== ================================================================ ### kent source version 462 ### crTreeSearchBed - Search a crTree indexed bed file and print all items that overlap query. usage: crTreeSearchBed file.bed index.cr chrom start end ================================================================ ======== dbDbToHubTxt ==================================== ================================================================ ### kent source version 462 ### dbDbToHubTxt - Reformat dbDb line to hub and genome stanzas for assembly hubs usage: dbDbToHubTxt database email groups hubAndGenome.txt options: -xxx=XXX ================================================================ ======== dbSnoop ==================================== ================================================================ ### kent source version 462 ### dbSnoop - Produce an overview of a database. usage: dbSnoop database output options: -unsplit - if set will merge together tables split by chromosome -noNumberCommas - if set will leave out commas in big numbers -justSchema - only schema parts, no contents -skipTable=tableName - if set skip a given table name -profile=profileName - use profile for connection settings, default = 'db' -sortAlpha - if set changes output order of fields to make comparisons between databases which have field order swapped easier. ================================================================ ======== dbTrash ==================================== ================================================================ ### kent source version 462 ### dbTrash - drop tables from a database older than specified N hours usage: dbTrash -age=N [-drop] [-historyToo] [-db=] [-verbose=N] options: -age=N - number of hours old to qualify for drop. N can be a float. -drop - actually drop the tables, default is merely to display tables. -dropLimit=N - ERROR out if number of tables to drop is greater than limit, - default is to drop all expired tables -db= - Specify a database to work with, default is customTrash. -historyToo - also consider the table called 'history' for deletion. - default is to leave 'history' alone no matter how old. - this applies to the table 'metaInfo' also. -extFile - check extFile for lines that reference files - no longer in trash -extDel - delete lines in extFile that fail file check - otherwise just verbose(2) lines that would be deleted -topDir - directory name to prepend to file names in extFile - default is /usr/local/apache/trash - file names in extFile are typically: "../trash/ct/..." -tableStatus - use 'show table status' to get size data, very inefficient -delLostTable - delete tables that exist but are missing from metaInfo - this operation can be even slower than -tableStatus - if there are many tables to check. -verbose=N - 2 == show arguments, dates, and dropped tables, - 3 == show date information for all tables. ================================================================ ======== endsInLf ==================================== ================================================================ endsInLf - Check that last letter in files is end of line usage: endsInLf file(s) options: -zeroOk ================================================================ ======== estOrient ==================================== ================================================================ ### kent source version 462 ### estOrient - convert ESTs so that orientation matches directory of transcription usage: estOrient [options] db estTable outPsl Read ESTs from a database and determine orientation based on estOrientInfo table or direction in gbCdnaInfo table. Update PSLs so that the strand reflects the direction of transcription. By default, PSLs where the direction can't be determined are dropped. Options: -chrom=chr - process this chromosome, maybe repeated -keepDisoriented - don't drop ESTs where orientation can't be determined. -disoriented=psl - output ESTs that where orientation can't be determined to this file. -inclVer - add NCBI version number to accession if not already present. -fileInput - estTable is a psl file -estOrientInfo=file - instead of getting the orientation information from the estOrientInfo table, load it from this file. This data is the output of polyInfo command. If this option is specified, the direction will not be looked up in the gbCdnaInfo table and db can be `no'. -info=infoFile - write information about each EST to this tab separated file qName tName tStart tEnd origStrand newStrand orient where orient is < 0 if PSL was reverse, > 0 if it was left unchanged and 0 if the orientation couldn't be determined (and was left unchanged). ================================================================ ======== expMatrixToBarchartBed ==================================== ================================================================ usage: expMatrixToBarchartBed [-h] [--autoSql AUTOSQL] [--groupOrderFile GROUPORDERFILE] [--useMean] [--verbose] sampleFile matrixFile bedFile outputFile Generate a barChart bed6+5 file from a matrix, meta data, and coordinates. positional arguments: sampleFile Two column no header, the first column is the samples which should match the matrix, the second is the grouping (cell type, tissue, etc) matrixFile The input matrix file. The samples in the first row should exactly match the ones in the sampleFile. The labels (ex ENST*****) in the first column should exactly match the ones in the bed file. bedFile Bed6+1 format. File that maps the column labels from the matrix to coordinates. Tab separated; chr, start coord, end coord, label, score, strand, gene name. The score column is ignored. outputFile The output file, bed 6+5 format. See the schema in kent/src/hg/lib/barChartBed.as. optional arguments: -h, --help show this help message and exit --autoSql AUTOSQL Optional autoSql description of extra fields in the input bed. --groupOrderFile GROUPORDERFILE Optional file to define the group order, list the groups in a single column in the order desired. The default ordering is alphabetical. --useMean Calculate the group values using mean rather than median. --verbose Show runtime messages. ================================================================ ======== faAlign ==================================== ================================================================ ### kent source version 462 ### faAlign - Align two fasta files usage: faAlign target.fa query.fa output.axt options: -dna - use DNA scoring scheme ================================================================ ======== faCmp ==================================== ================================================================ ### kent source version 462 ### faCmp - Compare two .fa files usage: faCmp [options] a.fa b.fa options: -softMask - use the soft masking information during the compare Differences will be noted if the masking is different. -sortName - sort input files by name before comparing -peptide - read as peptide sequences default: no masking information is used during compare. It is as if both sequences were not masked. Exit codes: - 0 if files are the same - 1 if files differ - 255 on an error ================================================================ ======== faCount ==================================== ================================================================ ### kent source version 462 ### faCount - count base statistics and CpGs in FA files. usage: faCount file(s).fa -summary show only summary statistics -dinuc include statistics on dinucletoide frequencies -strands count bases on both strands ================================================================ ======== faFilter ==================================== ================================================================ ### kent source version 462 ### faFilter - Filter fa records, selecting ones that match the specified conditions usage: faFilter [options] in.fa out.fa Options: -name=wildCard - Only pass records where name matches wildcard * matches any string or no character. ? matches any single character. anything else etc must match the character exactly (these will will need to be quoted for the shell) -namePatList=filename - A list of regular expressions, one per line, that will be applied to the fasta name the same as -name -v - invert match, select non-matching records. -minSize=N - Only pass sequences at least this big. -maxSize=N - Only pass sequences this size or smaller. -maxN=N Only pass sequences with fewer than this number of N's -uniq - Removes duplicate sequence ids, keeping the first. -i - make -uniq ignore case so sequence IDs ABC and abc count as dupes. All specified conditions must pass to pass a sequence. If no conditions are specified, all records will be passed. ================================================================ ======== faFilterN ==================================== ================================================================ faFilterN - Get rid of sequences with too many N's usage: faFilterN in.fa out.fa maxPercentN options: -out=in.fa.out -uniq=self.psl ================================================================ ======== faFrag ==================================== ================================================================ faFrag - Extract a piece of DNA from a .fa file. usage: faFrag in.fa start end out.fa options: -mixed - preserve mixed-case in FASTA file ================================================================ ======== faNoise ==================================== ================================================================ faNoise - Add noise to .fa file usage: faNoise inName outName transitionPpt transversionPpt insertPpt deletePpt chimeraPpt options: -upper - output in upper case ================================================================ ======== faOneRecord ==================================== ================================================================ faOneRecord - Extract a single record from a .FA file usage: faOneRecord in.fa recordName ================================================================ ======== faPolyASizes ==================================== ================================================================ ### kent source version 462 ### faPolyASizes - get poly A sizes usage: faPolyASizes in.fa out.tab output file has four columns: id seqSize tailPolyASize headPolyTSize options: ================================================================ ======== faRandomize ==================================== ================================================================ ### kent source version 462 ### faRandomize - Program to create random fasta records usage: faRandomize [-seed=N] in.fa randomized.fa Use optional -seed argument to specify seed (integer) for random number generator (rand). Generated sequence has the same base frequency as seen in original fasta records. ================================================================ ======== faRc ==================================== ================================================================ faRc - Reverse complement a FA file usage: faRc in.fa out.fa In.fa and out.fa may be the same file. options: -keepName - keep name identical (don't prepend RC) -keepCase - works well for ACGTUN in either case. bizarre for other letters. without it bases are turned to lower, all else to n's -justReverse - prepends R unless asked to keep name -justComplement - prepends C unless asked to keep name (cannot appear together with -justReverse) ================================================================ ======== faSize ==================================== ================================================================ ### kent source version 462 ### faSize - print total base count in fa files. usage: faSize file(s).fa Command flags -detailed outputs name and size of each record has the side effect of printing nothing else -tab output statistics in a tab separated format -veryDetailed outputs name, size, #Ns, #real, #upper, #lower of each record ================================================================ ======== faSomeRecords ==================================== ================================================================ ### kent source version 462 ### faSomeRecords - Extract multiple fa records usage: faSomeRecords in.fa listFile out.fa options: -exclude - output sequences not in the list file. ================================================================ ======== faSplit ==================================== ================================================================ ### kent source version 462 ### faSplit - Split an fa file into several files. usage: faSplit how input.fa count outRoot where how is either 'about' 'byname' 'base' 'gap' 'sequence' or 'size'. Files split by sequence will be broken at the nearest fa record boundary. Files split by base will be broken at any base. Files broken by size will be broken every count bases. Examples: faSplit sequence estAll.fa 100 est This will break up estAll.fa into 100 files (numbered est001.fa est002.fa, ... est100.fa Files will only be broken at fa record boundaries faSplit base chr1.fa 10 1_ This will break up chr1.fa into 10 files faSplit size input.fa 2000 outRoot This breaks up input.fa into 2000 base chunks faSplit about est.fa 20000 outRoot This will break up est.fa into files of about 20000 bytes each by record. faSplit byname scaffolds.fa outRoot/ This breaks up scaffolds.fa using sequence names as file names. Use the terminating / on the outRoot to get it to work correctly. faSplit gap chrN.fa 20000 outRoot This breaks up chrN.fa into files of at most 20000 bases each, at gap boundaries if possible. If the sequence ends in N's, the last piece, if larger than 20000, will be all one piece. Options: -verbose=2 - Write names of each file created (=3 more details) -maxN=N - Suppress pieces with more than maxN n's. Only used with size. default is size-1 (only suppresses pieces that are all N). -oneFile - Put output in one file. Only used with size -extra=N - Add N extra bytes at the end to form overlapping pieces. Only used with size. -out=outFile Get masking from outfile. Only used with size. -lift=file.lft Put info on how to reconstruct sequence from pieces in file.lft. Only used with size and gap. -minGapSize=X Consider a block of Ns to be a gap if block size >= X. Default value 1000. Only used with gap. -noGapDrops - include all N's when splitting by gap. -outDirDepth=N Create N levels of output directory under current dir. This helps prevent NFS problems with a large number of file in a directory. Using -outDirDepth=3 would produce ./1/2/3/outRoot123.fa. -prefixLength=N - used with byname option. create a separate output file for each group of sequences names with same prefix of length N. ================================================================ ======== faToFastq ==================================== ================================================================ ### kent source version 462 ### faToFastq - Convert fa to fastq format, just faking quality values. usage: faToFastq in.fa out.fastq options: -qual=X quality letter to use. Default is '<' which is good I think.... ================================================================ ======== faToTab ==================================== ================================================================ faToTab - convert fa file to tab separated file usage: faToTab infileName outFileName options: -type=seqType sequence type, dna or protein, default is dna -keepAccSuffix - don't strip dot version off of sequence id, keep as is ================================================================ ======== faToTwoBit ==================================== ================================================================ ### kent source version 462 ### faToTwoBit - Convert DNA from fasta to 2bit format usage: faToTwoBit in.fa [in2.fa in3.fa ...] out.2bit options: -long use 64-bit offsets for index. Allow for twoBit to contain more than 4Gb of sequence. NOT COMPATIBLE WITH OLDER CODE. -noMask Ignore lower-case masking in fa file. -stripVersion Strip off version number after '.' for GenBank accessions. -ignoreDups Convert first sequence only if there are duplicate sequence names. Use 'twoBitDup' to find duplicate sequences. ================================================================ ======== faToVcf ==================================== ================================================================ ### kent source version 462 ### faToVcf - Convert a FASTA alignment file to Variant Call Format (VCF) single-nucleotide diffs usage: faToVcf in.fa out.vcf options: -ambiguousToN Treat all IUPAC ambiguous bases (N, R, V etc) as N (no call). -excludeFile=file Exclude sequences named in file which has one sequence name per line -includeNoAltN Include base positions with no alternate alleles observed, but at least one N (missing base / no-call) -includeRef Include the reference in the genotype columns (default: omitted as redundant) -maskSites=file Exclude variants in positions recommended for masking in file (typically https://github.com/W-L/ProblematicSites_SARS-CoV2/raw/master/problematic_sites_sarsCov2.vcf) -maxDiff=N Exclude sequences with more than N mismatches with the reference (if -windowSize is used, sequences are masked accordingly first) -minAc=N Ignore alternate alleles observed fewer than N times -minAf=F Ignore alternate alleles observed in less than F of non-N bases -minAmbigInWindow=N When -windowSize is provided, mask any base for which there are at least this many N, ambiguous or gap characters within the window. (default: 2) -noGenotypes Output 8-column VCF, without the sample genotype columns -ref=seqName Use seqName as the reference sequence; must be present in faFile (default: first sequence in faFile) -resolveAmbiguous For IUPAC ambiguous characters like R (A or G), if the character represents two bases and one is the reference base, convert it to the non-reference base. Otherwise convert it to N. -startOffset=N Add N bases to each position (for trimmed alignments) -vcfChrom=seqName Use seqName for the CHROM column in VCF (default: ref sequence) -windowSize=N Mask any base for which there are at least -minAmbigWindow bases in a window of +-N bases around the base. Masking approach adapted from https://github.com/roblanf/sarscov2phylo/ file scripts/mask_seq.py Use -windowSize=7 for same results. in.fa must contain a series of sequences with different names and the same length. Both N and - are treated as missing information. ================================================================ ======== faTrans ==================================== ================================================================ ### kent source version 462 ### faTrans - Translate DNA .fa file to peptide usage: faTrans in.fa out.fa options: -stop stop at first stop codon (otherwise puts in Z for stop codons) -offset=N start at a particular offset. -cdsUpper - cds is in upper case ================================================================ ======== fastqStatsAndSubsample ==================================== ================================================================ ### kent source version 462 ### fastqStatsAndSubsample v2 - Go through a fastq file doing sanity checks and collecting stats and also producing a smaller fastq out of a sample of the data. The fastq input may be compressed with gzip or bzip2. Paired-end samples: run on both files, the seed is fixed so it will chose the paired reads usage: fastqStatsAndSubsample in.fastq out.stats out.fastq options: -sampleSize=N - default 100000 -seed=N - Use given seed for random number generator. Default 0. -smallOk - Not an error if less than sampleSize reads. out.fastq will be entire in.fastq -json - out.stats will be in json rather than text format Use /dev/null for out.fastq and/or out.stats if not interested in these outputs ================================================================ ======== fastqToFa ==================================== ================================================================ ### kent source version 462 ### # no name checks will be made on lines beginning with @ # ignore quality scores # using default Phread quality score algorithm # all errors will cause exit fastqToFa - Convert from fastq to fasta format. usage: fastqToFa [options] in.fastq out.fa options: -nameVerify='string' - for multi-line fastq files, 'string' must match somewhere in the sequence names in order to correctly identify the next sequence block (e.g.: -nameVerify='Supercontig_') -qual=file.qual.fa - output quality scores to specifed file (default: quality scores are ignored) -qualSizes=qual.sizes - write sizes file for the quality scores -noErrors - warn only on problems, do not error out (specify -verbose=3 to see warnings -solexa - use Solexa/Illumina quality score algorithm (instead of Phread quality) -verbose=2 - set warning level to get some stats output during processing ================================================================ ======== featureBits ==================================== ================================================================ ### kent source version 462 ### featureBits - Correlate tables via bitmap projections. usage: featureBits database table(s) This will return the number of bits in all the tables bitwise ANDed together Pipe warning: output goes to stderr. Options: -bed=output.bed Put intersection into bed format. Can use stdout. -fa=output.fa Put sequence in intersection into .fa file -faMerge For fa output merge overlapping features. -minSize=N Minimum size to output (default 1) -chrom=chrN Restrict to one chromosome -chromSize=sizefile Read chrom sizes from file instead of database. (chromInfo three column format) -or Bitwise OR tables together instead of ANDing them. -not Output negation of resulting bit set. -countGaps Count gaps in denominator -countBlocks Count blocks in bed12 files rather than entire extent. -noRandom Don't include _random (or Un) chromosomes -noHap Don't include _hap|_alt chromosomes -primaryChroms Primary assembly (chroms without '_' in name) -dots=N Output dot every N chroms (scaffolds) processed -minFeatureSize=n Don't include bits of the track that are smaller than minFeatureSize, useful for differentiating between alignment gaps and introns. -bin=output.bin Put bin counts in output file -binSize=N Bin size for generating counts in bin file (default 500000) -binOverlap=N Bin overlap for generating counts in bin file (default 250000) -bedRegionIn=input.bed Read in a bed file for bin counts in specific regions and write to bedRegionsOut -bedRegionOut=output.bed Write a bed file of bin counts in specific regions from bedRegionIn -enrichment Calculates coverage and enrichment assuming first table is reference gene track and second track something else Enrichment is the amount of table1 that covers table2 vs. the amount of table1 that covers the genome. It's how much denser table1 is in table2 than it is genome-wide. '-where=some sql pattern' Restrict to features matching some sql pattern You can include a '!' before a table name to negate it. To prevent your shell from interpreting the '!' you will need to use the backslash \!, for example the gap table: \!gap Some table names can be followed by modifiers such as: :exon:N Break into exons and add N to each end of each exon :cds Break into coding exons :intron:N Break into introns, remove N from each end :utr5, :utr3 Break into 5' or 3' UTRs :upstream:N Consider the region of N bases before region :end:N Consider the region of N bases after region :score:N Consider records with score >= N :upstreamAll:N Like upstream, but doesn't filter out genes that have txStart==cdsStart or txEnd==cdsEnd :endAll:N Like end, but doesn't filter out genes that have txStart==cdsStart or txEnd==cdsEnd The tables can be bed, psl, or chain files, or a directory full of such files as well as actual database tables. To count the bits used in dir/chrN_something*.bed you'd do: featureBits database dir/_something.bed File types supported are BED, bigBed, PSL, and chain. The suffix of the file is used to determine the type and MUST be .bed, .bb, .psl, or .chain respectively. NB: by default, featureBits omits gap regions from its calculation of the total number of bases. This requires connecting to a database server using credentials from a .hg.conf file (or similar). If such a connection is not available, you will need to specify -countGaps (which skips the database connection) in addition to providing all tables as files or directories. ================================================================ ======== fetchChromSizes ==================================== ================================================================ fetchChromSizes - script to grab chrom.sizes from UCSC via either of: mysql, wget or ftp usage: fetchChromSizes > .chrom.sizes used to fetch chrom.sizes information from UCSC for the given - name of UCSC database, e.g.: hg38, hg18, mm9, etc ... This script expects to find one of the following commands: wget, mysql, or ftp in order to fetch information from UCSC. Route the output to the file .chrom.sizes as indicated above. This data is available at the URL: http://hgdownload.soe.ucsc.edu/goldenPath//bigZips/.chrom.sizes Example: fetchChromSizes hg38 > hg38.chrom.sizes ================================================================ ======== findMotif ==================================== ================================================================ ### kent source version 462 ### findMotif - find specified motif in sequence usage: findMotif [options] -motif= sequence where: sequence is a .fa , .nib or .2bit file or a file which is a list of sequence files. options: -motif= - search for this specified motif (case ignored, [acgt] only) NOTE: motif must be at least 4 characters, less than 32 -chr= - process only this one chrN from the sequence -strand=<+|-> - limit to only one strand. Default is both. -bedOutput - output bed format (this is the default) -wigOutput - output wiggle data format instead of bed file -misMatch=N - allow N mismatches (0 default == perfect match) -verbose=N - set information level [1-4] -verbose=4 - will display gaps as bed file data lines to stderr * libpopcnt.h - C/C++ library for counting the number of 1 bits (bit * population count) in an array as quickly as possible using * specialized CPU instructions i.e. POPCNT, AVX2, AVX512, NEON. * * Copyright (c) 2016 - 2020, Kim Walisch * Copyright (c) 2016 - 2018, Wojciech Muła * * All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions are met: * * 1. Redistributions of source code must retain the above copyright notice, * this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright notice, * this list of conditions and the following disclaimer in the documentation * and/or other materials provided with the distribution. * * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS 'AS IS' * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE * ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE * LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR * CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS) * INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN * CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE * POSSIBILITY OF SUCH DAMAGE. ================================================================ ======== fixStepToBedGraph.pl ==================================== ================================================================ fixStepToBedGraph.pl - read fixedStep wiggle input data, output four column bedGraph format data usage: fixStepToBedGraph.pl run in a pipeline like this: usage: zcat fixedStepData.gz | fixStepToBedGraph.pl | gzip > bedGraph.gz reading input data from stdin ... Can't open -verbose=2: No such file or directory at ./fixStepToBedGraph.pl line 28. ================================================================ ======== fixTrackDb ==================================== ================================================================ ### kent source version 462 ### fixTrackDb - check for data accessible for everything in trackDb table usage: fixTrackDb database trackDbTable options: -gbdbList=list - list of files to confirm existance of bigDataUrl files ================================================================ ======== gapToLift ==================================== ================================================================ ### kent source version 462 ### gapToLift - create lift file from gap table(s) usage: gapToLift [options] db liftFile.lft uses gap table(s) from specified db. Writes to liftFile.lft generates lift file segements separated by non-bridged gaps. options: -chr=chrN - work only on given chrom -minGap=M - examine gaps only >= than M -insane - do *not* perform coordinate sanity checks on gaps -bedFile=fileName.bed - output segments to fileName.bed -allowBridged - consider any type of gap not just the non-bridged gaps -verbose=N - N > 1 see more information about procedure ================================================================ ======== gencodeVersionForGenes ==================================== ================================================================ ### kent source version 462 ### gencodeVersionForGenes - Figure out which version of a gencode gene set a set of gene identifiers best fits usage: gencodeVersionForGenes genes.txt geneSymVer.tsv where: genes.txt is a list of gene symbols or identifiers, one per line geneSymVer.tsv is output of gencodeGeneSymVer, usually /hive/data/inside/geneSymVerTx.tsv options: -bed=output.bed - Create bed file for mapping genes to genome via best gencode fit -upperCase - Force genes to be upper case -allBed=outputDir - Output beds for all versions in geneSymVer.tsv -geneToId=geneToId.tsv - Output two column file with symbol from gene.txt and gencode gene names as second. Symbols with no gene found are omitted -miss=output.txt - unassigned genes are put here, one per line -target=ucscDb - something like hg38 or hg19. If set this will use most recent version of each gene that exists for the assembly in symbol mode ================================================================ ======== genePredCheck ==================================== ================================================================ ### kent source version 462 ### genePredCheck - validate genePred files or tables usage: genePredCheck [options] fileTbl .. If fileTbl is an existing file, then it is checked. Otherwise, if -db is provided, then a table by this name in db is checked. options: -db=db - If specified, then this database is used to get chromosome sizes, and perhaps the table to check. -chromSizes=file.chrom.sizes - use chrom sizes from tab separated file (name,size) instead of from chromInfo table in specified db. ================================================================ ======== genePredFilter ==================================== ================================================================ ### kent source version 462 ### genePredFilter - filter a genePred file usage: genePredFilter [options] genePredIn genePredOut Filter a genePredFile, dropping invalid entries options: -db=db - If specified, then this database is used to get chromosome sizes. -chromSizes=file.chrom.sizes - use chrom sizes from tab separated file (name,size) instead of from chromInfo table in specified db. -verbose=2 - level >= 2 prints out errors for each problem found. ================================================================ ======== genePredHisto ==================================== ================================================================ ### kent source version 462 ### genePredHisto - get data for generating histograms from a genePred file. usage: genePredHisto [options] what genePredFile histoOut Options: -ids - a second column with the gene name, useful for finding outliers. The what arguments indicates the type of output. The output file is a list of numbers suitable for input to textHistogram or similar The following values are current implemented exonLen- length of exons 5utrExonLen- length of 5'UTR regions of exons cdsExonLen- length of CDS regions of exons 3utrExonLen- length of 3'UTR regions of exons exonCnt- count of exons 5utrExonCnt- count of exons containing 5'UTR cdsExonCnt- count of exons count CDS 3utrExonCnt- count of exons containing 3'UTR ================================================================ ======== genePredSingleCover ==================================== ================================================================ ### kent source version 462 ### genePredSingleCover - create single-coverage genePred files usage: genePredSingleCover [options] inGenePred outGenePred Create a genePred file that have single CDS coverage of the genome. UTR is allowed to overlap. The default is to keep the gene with the largest numberr of CDS bases. Options: -scores=file - read scores used in selecting genes from this file. It consists of tab seperated lines of name chrom txStart score where score is a real or integer number. Higher scoring genes will be choosen over lower scoring ones. Equaly scoring genes are choosen by number of CDS bases. If this option is supplied, all genes must be in the file ================================================================ ======== genePredToBed ==================================== ================================================================ ### kent source version 462 ### genePredToBed - Convert from genePred to bed format. Does not yet handle genePredExt usage: genePredToBed in.genePred out.bed options: -tab - genePred fields are separated by tab instead of just white space -fillSpace - when tab input, fill space chars in 'name' with underscore: _ -score=N - set score to N in bed output (default 0) ================================================================ ======== genePredToBigGenePred ==================================== ================================================================ ### kent source version 462 ### genePredToBigGenePred - converts genePred or genePredExt to bigGenePred input (bed format with extra fields) usage: genePredToBigGenePred [-known] [-score=scores] [-geneNames=geneNames] [-colors=colors] file.gp stdout | sort -k1,1 -k2,2n > file.bgpInput NOTE: In order to visualize on Genome Browser, the bigGenePred file needs to be converted to a bigBed such as the following: wget https://genome.ucsc.edu/goldenpath/help/examples/bigGenePred.as bedToBigBed -type=bed12+8 -tab -as=bigGenePred.as file.bgpInput chrom.sizes output.bb options: -known input file is a genePred in knownGene format -score=scores scores is two column file with id's mapping to scores -geneNames=geneNames geneNames is a three column file with id's mapping to two gene names -colors=colors colors is a four column file with id's mapping to r,g,b -cds=cds cds is a five column file with id's mapping to cds status codes and exonFrames (see knownCds.as) -geneType=geneType geneType is a two column file with id's mapping to geneType ================================================================ ======== genePredToFakePsl ==================================== ================================================================ ### kent source version 462 ### genePredToFakePsl - Create a psl of fake-mRNA aligned to gene-preds from a file or table. usage: genePredToFakePsl [options] db fileTbl pslOut cdsOut If fileTbl is an existing file, then it is used. Otherwise, the table by this name is used. pslOut specifies the fake-mRNA output psl filename. cdsOut specifies the output cds tab-separated file which contains genbank-style CDS records showing cdsStart..cdsEnd e.g. NM_123456 34..305 options: -chromSize=sizefile Read chrom sizes from file instead of database sizefile contains two white space separated fields per line: chrom name and size -qSizes=qSizesFile Read in query sizes to fixup qSize and qStarts ================================================================ ======== genePredToGtf ==================================== ================================================================ ### kent source version 462 ### genePredToGtf - Convert genePred table or file to gtf. usage: genePredToGtf database genePredTable output.gtf If database is 'file' then track is interpreted as a file rather than a table in database. options: -utr - Add 5UTR and 3UTR features -honorCdsStat - use cdsStartStat/cdsEndStat when defining start/end codon records -source=src set source name to use -addComments - Add comments before each set of transcript records. allows for easier visual inspection Note: use a refFlat table or extended genePred table or file to include the gene_name attribute in the output. This will not work with a refFlat table dump file. If you are using a genePred file that starts with a numeric bin column, drop it using the UNIX cut command: cut -f 2- in.gp | genePredToGtf file stdin out.gp ================================================================ ======== genePredToMafFrames ==================================== ================================================================ ### kent source version 462 ### genePredToMafFrames - create mafFrames tables from a genePreds usage: genePredToMafFrames [options] targetDb maf mafFrames geneDb1 genePred1 [geneDb2 genePred2...] Create frame annotations for one or more components of a MAF. It is significantly faster to process multiple gene sets in the same"run, as 95% of the CPU time is spent reading the MAF Arguments: o targetDb - db of target genome o maf - input MAF file o mafFrames - output file o geneDb1 - db in MAF that corresponds to genePred's organism. o genePred1 - genePred file. Overlapping annotations ahould have be removed. This file may optionally include frame annotations Options: -bed=file - output a bed of for each mafFrame region, useful for debugging. -verbose=level - enable verbose tracing, the following levels are implemented: 3 - print information about data used to compute each record. 4 - dump information about the gene mappings that were constructed 5 - dump information about the gene mappings after split processing 6 - dump information about the gene mappings after frame linking ================================================================ ======== genePredToProt ==================================== ================================================================ ### kent source version 462 ### genePredToProt - create protein sequences by translating gene annotations usage: genePredToProt genePredFile genomeSeqs protFa This honors frame if genePred has frames, dropping partial codons. genomeSeqs is a 2bit or directory of nib files. options: -cdsFa=fasta - output FASTA with CDS that was used to generate protein. This will not include dropped partial codons. -protIdSuffix=str - add this string to the end of the name for protein FASTA -cdsIdSuffix=str - add this string to the end of the name for CDS FASTA -translateSeleno - assume internal TGA code for selenocysteine and translate to `U'. -includeStop - If the CDS ends with a stop codon, represent it as a `*' -starForInframeStops - use `*' instead of `X' for in-frame stop codons. This will result in selenocysteine's being `*', with only codons containing `N' being translated to `X'. This doesn't include terminal stop ================================================================ ======== gensub2 ==================================== ================================================================ gensub2 - version 12.19 Generate condor submission file from template and two file lists. Usage: gensub2