This directory contains data from the December 2007 ENCODE Multi-Species Sequence Analysis (MSA) sequence freeze, along with multiple sequence alignments based on these sequences. The freeze consists of sequence from regions orthologous to the human ENCODE regions in 36 vertebrate species, and are based on comparative sequence data generated at the NHGRI Intramural Sequencing Center (NISC) for the ENCODE project, as well as whole-genome assemblies residing at UCSC, as listed: New species in this freeze are: orangutan Species from previous freezes not present in this freeze are: xenopus, fugu, zebrafish, tetraodon NISC sequences are present in additional regions, and WGS genomes have all been updated to the most current. * human (March 2006, hg18) * armadillo (NISC) * baboon (NISC) * cat (NISC) * chicken (galGal3) * chimp (Mar 2006, panTro2) * colobus_monkey (NISC) * cow (Aug 2006, bosTau3) * dog (May 2005, canFam2) * dusky_titi (NISC) * elephant (NISC) * flying_fox (NISC) * galago (NISC) * gibbon (NISC) * guinea_pig (NISC) * hedgehog (NISC) * horse (NISC) * macaque (Jan 2006, rheMac2) * marmoset (NISC) * monodelphis (Jan 2006, monDom4) * mouse (Jul 2007, mm9) * mouse_lemur (NISC) * orangutan (Jul 2007, ponAbe2) * owl_monkey (NISC) * platypus (NISC) * rabbit (NISC) * rat (Nov 2004, rn4) * rfbat (NISC) * rock_hyrax (NISC) * sbbat (NISC) * shrew (NISC) * squirrel_monkey (NISC) * st_squirrel (NISC) * tenrec (NISC) * tree shrew (NISC) * vervet (NISC) DIRECTORY STRUCTURE: sequences/${ENCODE_REGION}/${COMMON_NAME}.${ENCODE_REGION}.fa sequences/metadata.txt description of all of the sequences; same as header lines DEC-2007.tar.gz tarfile of the contents of the sequences directory encode_36way.gif phylogenetic tree image species36.nh phylogeny in newick tree format tree_4d.tba.nh phylogeny with branch lengths, based on 4-fold degenerate sites alignments/ multiple sequence alignments Each FASTA file will have all the sequence entries for a given species/region. Description of the FASTA Header lines and the metadata.txt file: >${COMMON_NAME}|${ENCODE_REGION}|${FREEZE_DATE}|${NCBI_TAXON_ID}|${ASSEMBLY_PROVIDER}|${ASSEMBLY_DATE}|${ASSEMBLY_ID}|${CHROMOSOME}|${CHROMOSOME_START}|${CHROMOSOME_END}|${CHROMOSOME_LENGTH}|${STRAND}|${ACCESSION}.${VERSION}|${NUM_BASES}|${NUM_N}|${THIS_CONTIG_NUM}|${TOTAL_NUM_CONTIGS}|${COMMENT} Where: ${COMMON_NAME} like 'baboon' or 'dusky_titi' ${ENCODE_REGION} like 'ENm001' or 'ENr223' ${FREEZE_DATE} like 'AUG-2004'; latest date for inclusion in this freeze of the set of sequences encompassing the ENCODE regions ${NCBI_TAXON_ID} like '9555' or '9523' ${ASSEMBLY_PROVIDER} like 'NISC' or 'RGSC' ${ASSEMBLY_DATE} like 'NOV-2003' or '21-JUN-2003'; Date associated with the specific sequence assembly represented in this ENCODE freeze ${ASSEMBLY_ID} like 'rn4' or 'panTro2' ${CHROMOSOME} like 'chr1' or 'chr19_random' ${CHROMOSOME_START} [1 based] ${CHROMOSOME_END} [1 based] ${CHROMOSOME_LENGTH} length of entire ${CHROMOSOME} ${STRAND} as in '+' or '-' indicating whether the sequence came from the top or bottom DNA strand ${ACCESSION}.${VERSION} like 'NT_107546.1' or internal identifier for assemblies that have not been accessioned yet. ${NUM_BASES} Total number of called bases in the sequence entry, including N's ${NUM_N} Total number of N's in the sequence entry ${THIS_CONTIG_NUM} ID of sequence contig (see next variable). ${TOTAL_NUM_CONTIGS} Total number of sequence contigs syntenic to a human region. ${COMMENT} This is an example I hope we all agree on. (Currently '.' for all entries.) >rat|ENm001|May-2005|10116|Baylor HGSC v. 3.1|01-Jun-2003|rn3|chr4|42742602|44711183|187371129|+|NT_107460.3|1968582|143786|1|1|. Some fields are optional. For example when ${ASSEMBLY_PROVIDER} == NISC, there will be no ${ASSEMBLY_ID} or chrom:start-stop coordinates. Unused fields are filled with a period ('.') or zero ('0') for ease in parsing. The FASTA sequence have been repeat masked with default RepeatMasker options and with the Tandem Repeat Finder. Repeat sequences are indicated in lowercase, while non-repeat sequences are in uppercase. These are the RepeatMasker library options that were used here: armadillo => mammal baboon => mammal cat => cat chicken => chicken chimp => mammal colobus_monkey => mammal cow => cow dog => dog dusky_titi => cow elephant => mammal flying_fox => mammal galago => mammal gibbon => mammal guinea_pig => mammal hedgehog => mammal horse => mammal human => human macaque => mammal marmoset => mammal monodelphis => mammal mouse => mus mouse_lemur => mammal orangutan => mammal owl_monkey => mammal platypus => mammal rabbit => rodentia rat => rat rfbat => mammal rock_hyrax => mammal sbbat => mammal shrew => mammal squirrel_monkey st_squirrel => mammal tenrec => mammal tree_shrew => mammal vervet => mammal There are also a set of RECON libraries that have been prepared by Damian Keefe at EBI. ftp://ftp.ebi.ac.uk/pub/databases/ensembl/encode/repeat_libraries/ Data Release Terms ------------------ All data in this directory and any subdirectories are subject to the terms of the ENCODE Project Data Release Policy of the National Human Genome Research Institute. This policy is posted at: http://www.genome.gov/12513440 http://genome.ucsc.edu/encode/terms.html
Name Last modified Size Description
Parent Directory - DEC-2007.tar.gz 2008-02-20 09:51 296M encode_36way.gif 2008-10-22 16:33 6.4K tree_4d.tba.nh 2008-08-21 10:53 1.0K species36.nh 2008-10-22 16:23 376 sequences/ 2008-06-25 10:05 - alignments/ 2008-08-21 12:14 -