DIRECTORY STRUCTURE: sequences/${ENCODE_REGION}/${COMMON_NAME}.${ENCODE_REGION}.fa RepeatMasker/${ACCESSION}.${VERSION}.out trf/${ACCESSION}.${VERSION}.bed alignments/${ALIGNER}/*.maf alignments/${ALIGNER}/${CONSERVATION}/ Each FASTA file will have all the sequence entries for a given species/region. HEADER STRUCTURE: >${COMMON_NAME}|${ENCODE_REGION}|${FREEZE_DATE}|${NCBI_TAXON_ID}|${ASSEMBLY_PROVIDER}|${FREEZE_DATE}|${ASSEMBLY_ID}|${CHROMOSOME}|${CHROMOSOME_START}|${CHROMOSOME_END}|${ACCESSION}.${VERSION}|${NUM_BASES}|${NUM_N}|${THIS_CONTIG_NUM}|${TOTAL_NUM_CONTIGS}|${COMMENT} Where: ${COMMON_NAME} like 'baboon' or 'dusky_titi' ${ENCODE_REGION} like 'ENm001' or 'ENr223' ${FREEZE_DATE} like 'AUG-2004'; latest date for inclusion in this freeze of the set of sequences encompassing the ENCODE regions ${NCBI_TAXON_ID} like '9555' or '9523' ${ASSEMBLY_PROVIDER} like 'NISC' or 'RGSC' ${ASSEMBLY_DATE} like 'NOV-2003' or '21-JUN-2003'; Date associated with the specific sequence assembly represented in this ENCODE freeze ${ASSEMBLY_ID} like 'rn3' or 'panTro1' ${CHROMOSOME} like 'chr1' or 'chr19_random' ${CHROMOSOME_START} [1 based] ${CHROMOSOME_END} [1 based] ${ACCESSION}.${VERSION} like 'NT_107546.1' ${NUM_BASES} Total number of called bases in the sequence entry, including N's ${NUM_N} Total number of N's in the sequence entry ${THIS_CONTIG_NUM} ID of sequence contig (see next variable). ${TOTAL_NUM_CONTIGS} Total number of sequence contigs syntenic to a human region. ${COMMENT} This is an example I hope we all agree on. >rat|ENr223|AUG-2004|10116|RGSC|NOV-2003|rn3|chr8|83281297|83487179|NT_107495.1|205883|133587|1|2|This is an example I hope we all agree on. Not all fields need to contain information. For example when ${ASSEMBLY_PROVIDER} = NISC, there will be no ${ASSEMBLY_ID} or chrom:start-stop coordinates. Data Release Terms ------------------ All data in this directory and any subdirectories is subject to the terms of the ENCODE Project Data Release Policy of the National Human Genome Research Institute. This policy is posted at: http://www.genome.gov/12513440 http://genome.ucsc.edu/encode/terms.html