| File Formats |
 |
 |
This page serves as a simple reference on file formats commonly used by sequence analysis programs. Each entry is accompanied by properly formatted DNA and amino acid sequences where appropriate.
ASN.1
Abstract Syntax Notation 1 form, the computer-readable form of the data used by NCBI. All databases entries are available from Entrez in this format.
EMBL Swiss Prot
EMBL is the nucleotide database of EBI. Swiss Prot is the amino acid database of EBI.
FASTA
The definition line and sequence character format used by NCBI. All database entries from Entrez are available in this format. A sequence in Fasta format begins with a single-line description, followed by lines of sequence data. The description line is distinguished from the sequence data by a greater-than (">") symbol in the first column. It is recommended that all lines of text be shorter than 80 characters in length to faciliate viewing and editing.
Sequences are expected to be represented in the standard
IUB/IUPAC amino acid and nucleic acid codes, with these
exceptions: lower-case letters are accepted and are mapped
into upper-case; a single hyphen or dash can be used to represent
a gap of indeterminate length; and in amino acid sequences, U and *
are acceptable letters (see below). Before submitting a request,
any numerical digits in the query sequence should either be
removed or replaced by appropriate letter codes (e.g., N for
unknown nucleic acid residue or X for unknown amino acid residue).
GCG
GCG-MSF format is recognised by one of the following:
- the word PileUp at the start of the file.
- the word !!AA_MULTIPLE_ALIGNMENT or !!NA_MULTIPLE_ALIGNMENT at the start of the file.
- the word MSF on the first line of the line, and the characters at the end of this line.
These file names usually have a .msf extension.
GCG-RSF format is recognised by the word !!RICH_SEQUENCE at the beginning of the file. These files have a .rsf extension.
GenBank/GenPept
The nucleotide (GenBank) and protein (Gen Pept) database entries are available from Entrez in this format.
NEXUS
This is the file format used by many popular programs like GARLI, GDA, MacClade, Mesquite, ModelTest, MrBayes and PAUP*. Nexus file names often have a .nxs or .nex extension.
A formal description of the NEXUS format can be found in Maddison et al. (1997).
Conversion of an interleaved NEXUS file to a non-interleaved NEXUS file: execute the file in PAUP*, and export the file as non-interleaved NEXUS file. You can also type the commands:
export file=yourfile.nex format=nexus interleaved=no;
PHYLIP
The PHYLIP format came from Joe Felsenstein's phylogeny inference package and is now used by several phylogenetics programs. PHYLIP file names often have have a .phy or .ph extension.
NBRF and PIR
National Biomedical Research Foundation (NBRF) maintains nucleotide and protein sequence databases. PIR file names often have a .pir extension. The header line of this file begins with a greater than sign ">" followed by DL.
Protein information resource (PIR) is an annotated, non-redundant and cross-referenced database of protein sequences at the NBRF. PIR file names often have a .pir extension. The header line of this file begins with a greater than sign ">" followed by P1.
|