DNA Sequence formats

[Plain] [EMBL] [FASTA] [GCG] [GenBank] [IG] [IUPAC]

Plain sequence format

A sequence in plain format may contain only IUPAC characters and spaces (no numbers!).

Note: A file in plain sequence format may only contain one sequence, while most other formats accept several sequences in one file.

An example sequence in plain format is:

AACCTGCGGAAGGATCATTACCGAGTGCGGGTCCTTTGGGCCCAACCTCCCATCCGTGTCTATTGTACCC
TGTTGCTTCGGCGGGCCCGCCGCTTGTCGGCCGCCGGGGGGGCGCCTCTGCCCCCCGGGCCCGTGCCCGC
CGGAGACCCCAACACGAACACTGTCTGAAAGCGTGCAGTCTGAGTTGATTGAATGCAATCAGTTAAAACT
TTCAACAATGGATCTCTTGGTTCCGGC

EMBL format

A sequence file in EMBL format can contain several sequences.
One sequence entry starts with an identifier line ("ID "), followed by further annotation lines. The start of the sequence is marked by a line starting with "SQ" and the end of the sequence is marked by two slashes ("//").

An example sequence in EMBL format is:

ID AA03518 standard; DNA; FUN; 237 BP.
XX
AC U03518;
XX
DE Aspergillus awamori internal transcribed spacer 1 (ITS1) and 18S
DE rRNA and 5.8S rRNA genes, partial sequence.
XX
SQ Sequence 237 BP; 41 A; 77 C; 67 G; 52 T; 0 other;
   aacctgcgga aggatcatta ccgagtgcgg gtcctttggg cccaacctcc catccgtgtc 60
   tattgtaccc tgttgcttcg gcgggcccgc cgcttgtcgg ccgccggggg ggcgcctctg 120 
   ccccccgggc ccgtgcccgc cggagacccc aacacgaaca ctgtctgaaa gcgtgcagtc 180
   tgagttgatt gaatgcaatc agttaaaact ttcaacaatg gatctcttgg ttccggc 237
//

FASTA format

A sequence file in FASTA format can contain several sequences.
One sequence in FASTA format begins with a single-line description, followed by lines of sequence data. The description line must begin with a greater-than (">") symbol in the first column.

An example sequence in FASTA format is:

>U03518 Aspergillus awamori internal transcribed spacer 1 (ITS1)
AACCTGCGGAAGGATCATTACCGAGTGCGGGTCCTTTGGGCCCAACCTCCCATCCGTGTCTATTGTACCC
TGTTGCTTCGGCGGGCCCGCCGCTTGTCGGCCGCCGGGGGGGCGCCTCTGCCCCCCGGGCCCGTGCCCGC
CGGAGACCCCAACACGAACACTGTCTGAAAGCGTGCAGTCTGAGTTGATTGAATGCAATCAGTTAAAACT
TTCAACAATGGATCTCTTGGTTCCGGC

GCG format

A sequence file in GCG format contains exactly one sequence, begins with annotation lines and the start of the sequence is marked by a line ending with two dot ("..") characters. This line also contains the sequence identifier, the sequence length and a checksum. This format should only be used if the file was created with the GCG package.

An example sequence in GCG format is:

ID AA03518 standard; DNA; FUN; 237 BP.
XX
AC U03518;
XX
DE Aspergillus awamori internal transcribed spacer 1 (ITS1) and 18S
DE rRNA and 5.8S rRNA genes, partial sequence.
XX
SQ Sequence 237 BP; 41 A; 77 C; 67 G; 52 T; 0 other;
AA03518 Length: 237 Check: 4514 ..

  1 aacctgcgga aggatcatta ccgagtgcgg gtcctttggg cccaacctcc catccgtgtc
 61 tattgtaccc tgttgcttcg gcgggcccgc cgcttgtcgg ccgccggggg ggcgcctctg 
121 ccccccgggc ccgtgcccgc cggagacccc aacacgaaca ctgtctgaaa gcgtgcagtc 
181 tgagttgatt gaatgcaatc agttaaaact ttcaacaatg gatctcttgg ttccggc

GCG-RSF (rich sequence format)

The new GCG-RSF can contain several sequences in one file. This format should only be used if the file was created with the GCG package.

GenBank format

A sequence file in GenBank format can contain several sequences.
One sequence in GenBank format starts with a line containing the word LOCUS and a number of annotation lines. The start of the sequence is marked by a line containing "ORIGIN" and the end of the sequence is marked by two slashes ("//"). (ref: Keys)

An example sequence in GenBank format is:

LOCUS       AAU03518 237 bp DNA PLN 04-FEB-1995
DEFINITION  Aspergillus awamori internal transcribed spacer 1 (ITS1) and 18S 
            rRNA and 5.8S rRNA genes, partial sequence.
ACCESSION   U03518
ORIGIN  
        1 aacctgcgga aggatcatta ccgagtgcgg gtcctttggg cccaacctcc catccgtgtc 
       61 tattgtaccc tgttgcttcg gcgggcccgc cgcttgtcgg ccgccggggg ggcgcctctg 
      121 ccccccgggc ccgtgcccgc cggagacccc aacacgaaca ctgtctgaaa gcgtgcagtc 
      181 tgagttgatt gaatgcaatc agttaaaact ttcaacaatg gatctcttgg ttccggc
//

IG format

A sequence file in IG format can contain several sequences, each consisting of a number of comment lines that must begin with a semicolon (";"), a line with the sequence name (it may not contain spaces!) and the sequence itself terminated with the termination character '1' for linear or '2' for circular sequences.

An example sequence in IG format is:

; comment
; comment
U03518
AACCTGCGGAAGGATCATTACCGAGTGCGGGTCCTTTGGGCCCAACCTCCCATCCGTGTCTATTGTACCC
TGTTGCTTCGGCGGGCCCGCCGCTTGTCGGCCGCCGGGGGGGCGCCTCTGCCCCCCGGGCCCGTGCCCGC
CGGAGACCCCAACACGAACACTGTCTGAAAGCGTGCAGTCTGAGTTGATTGAATGCAATCAGTTAAAACT
TTCAACAATGGATCTCTTGGTTCCGGC1

IUPAC nucleic acid codes

To represent ambiguity in DNA sequences the following letters can be used (following the rules of the International Union of Pure and Applied Chemistry (IUPAC)):

 A = adenine 
 C = cytosine 
 G = guanine 
 T = thymine 
 U = uracil
 R = G A (purine) 
 Y = T C (pyrimidine) 
 K = G T (keto) 
 M = A C (amino)
 S = G C 
 W = A T 
 B = G T C
 D = G A T
 H = A C T
 V = G C A
 N = A G C T (any)

NCBI accession ID conventions

Pre-fixes    Description

 Genbank
  AE | CP | CY : Genome projects (nucleotide)
   U | AF | AY : Direct submissions (nucleotide)
  DQ | EF | EU
  FJ | GQ | GU
  HM | HQ | JF
  JN | JQ | JX
  KC | KF | KJ
  KM | KP | KR
  KT | KU | KX

  AAAA - AZZZ : Whole genome shotgun sequences (nucleotide)
  JAAA - JZZZ,
  LAAA - LZZZ,
  MAAA - MZZZ, 
  NAAA - NZZZ,
  PAAA - PZZZ,
  QAAA - QZZZ,
  RAAA - RZZZ
  AAA-AZZ          : Protein ID
  EAA-EZZ, KAA-KZZ : WGS protein ID
  O/P/Q            : Swissprot (protein)

 RefSeq:
   AC_ : Genomic Complete genomic molecule, usually alternate assembly
   AP_ : Protein Annotated on AC_ alternate assembly
   NC_ : Genomic Complete genomic molecule, usually reference assembly, Curated
   NG_ : Curated, Incomplete genomic region
   NM_ : Curated, mRNA
   NR_ : Curated, ncRNA
   NP_ : Curated, Protein Associated with an NM_ or NC_ accession
   NS_ : Genomic Environmental sequence
   NT_ : Automated, Genomic, Contig or scaffold, clone-based or WGSa
   NZ_ : Genomic, Unfinished WGS
   NW_ : Automated, Genomic contig or scaffold, primarily WGSa
   XM_ : Automated, predicted mRNA model
   XP_ : Automated, predicted protein model
   XR_ : Automated, predicted ncRNA model
   YP_ : Protein	
   XP_ : Protein Predicted model, associated with an XM_ accession
   ZP_ : Protein Predicted model, annotated on NZ_ genomic records
   ZP_       :