DNA Sequence formats
[Plain]
[EMBL]
[FASTA]
[GCG]
[GenBank]
[IG]
[IUPAC]
A sequence in plain format may contain only
IUPAC
characters and spaces (no numbers!).
Note: A file in plain sequence format may only
contain
one sequence, while most other formats
accept several sequences in one file.
An example sequence in plain format is:
AACCTGCGGAAGGATCATTACCGAGTGCGGGTCCTTTGGGCCCAACCTCCCATCCGTGTCTATTGTACCC
TGTTGCTTCGGCGGGCCCGCCGCTTGTCGGCCGCCGGGGGGGCGCCTCTGCCCCCCGGGCCCGTGCCCGC
CGGAGACCCCAACACGAACACTGTCTGAAAGCGTGCAGTCTGAGTTGATTGAATGCAATCAGTTAAAACT
TTCAACAATGGATCTCTTGGTTCCGGC
A sequence file in EMBL format can contain several sequences.
One sequence entry starts with an identifier line ("ID "), followed by
further annotation lines. The start of the sequence is marked by a line
starting with "SQ" and the end of the sequence is marked by two slashes
("//").
An example sequence in EMBL format is:
ID AA03518 standard; DNA; FUN; 237 BP.
XX
AC U03518;
XX
DE Aspergillus awamori internal transcribed spacer 1 (ITS1) and 18S
DE rRNA and 5.8S rRNA genes, partial sequence.
XX
SQ Sequence 237 BP; 41 A; 77 C; 67 G; 52 T; 0 other;
aacctgcgga aggatcatta ccgagtgcgg gtcctttggg cccaacctcc catccgtgtc 60
tattgtaccc tgttgcttcg gcgggcccgc cgcttgtcgg ccgccggggg ggcgcctctg 120
ccccccgggc ccgtgcccgc cggagacccc aacacgaaca ctgtctgaaa gcgtgcagtc 180
tgagttgatt gaatgcaatc agttaaaact ttcaacaatg gatctcttgg ttccggc 237
//
A sequence file in FASTA format can contain several sequences.
One sequence in FASTA format begins with a single-line description,
followed by lines of sequence data. The description line must begin
with a greater-than (">") symbol in the first column.
An example sequence in FASTA format is:
>U03518 Aspergillus awamori internal transcribed spacer 1 (ITS1)
AACCTGCGGAAGGATCATTACCGAGTGCGGGTCCTTTGGGCCCAACCTCCCATCCGTGTCTATTGTACCC
TGTTGCTTCGGCGGGCCCGCCGCTTGTCGGCCGCCGGGGGGGCGCCTCTGCCCCCCGGGCCCGTGCCCGC
CGGAGACCCCAACACGAACACTGTCTGAAAGCGTGCAGTCTGAGTTGATTGAATGCAATCAGTTAAAACT
TTCAACAATGGATCTCTTGGTTCCGGC
A sequence file in GCG format contains exactly one sequence, begins
with annotation lines and the start of the sequence is marked by a line
ending with two dot ("..") characters. This line also contains the
sequence identifier, the sequence length and a checksum. This format
should only be used if the file was created with the GCG package.
An example sequence in GCG format is:
ID AA03518 standard; DNA; FUN; 237 BP.
XX
AC U03518;
XX
DE Aspergillus awamori internal transcribed spacer 1 (ITS1) and 18S
DE rRNA and 5.8S rRNA genes, partial sequence.
XX
SQ Sequence 237 BP; 41 A; 77 C; 67 G; 52 T; 0 other;
AA03518 Length: 237 Check: 4514 ..
1 aacctgcgga aggatcatta ccgagtgcgg gtcctttggg cccaacctcc catccgtgtc
61 tattgtaccc tgttgcttcg gcgggcccgc cgcttgtcgg ccgccggggg ggcgcctctg
121 ccccccgggc ccgtgcccgc cggagacccc aacacgaaca ctgtctgaaa gcgtgcagtc
181 tgagttgatt gaatgcaatc agttaaaact ttcaacaatg gatctcttgg ttccggc
The new GCG-RSF can contain several sequences in one file. This format
should only be used if the file was created with the GCG package.
A sequence file in GenBank format can contain several sequences.
One sequence in GenBank format starts with a line containing the word
LOCUS and a number of annotation lines. The start of the sequence is
marked by a line containing "ORIGIN" and the end of the sequence is
marked by two slashes ("//"). (ref:
Keys)
An example sequence in GenBank format is:
LOCUS AAU03518 237 bp DNA PLN 04-FEB-1995
DEFINITION Aspergillus awamori internal transcribed spacer 1 (ITS1) and 18S
rRNA and 5.8S rRNA genes, partial sequence.
ACCESSION U03518
ORIGIN
1 aacctgcgga aggatcatta ccgagtgcgg gtcctttggg cccaacctcc catccgtgtc
61 tattgtaccc tgttgcttcg gcgggcccgc cgcttgtcgg ccgccggggg ggcgcctctg
121 ccccccgggc ccgtgcccgc cggagacccc aacacgaaca ctgtctgaaa gcgtgcagtc
181 tgagttgatt gaatgcaatc agttaaaact ttcaacaatg gatctcttgg ttccggc
//
A sequence file in IG format can contain several sequences, each
consisting of a number of comment lines that must begin with a
semicolon (";"), a line with the sequence name (it may not contain
spaces!) and the sequence itself terminated with the termination
character '1' for linear or '2' for circular sequences.
An example sequence in IG format is:
; comment
; comment
U03518
AACCTGCGGAAGGATCATTACCGAGTGCGGGTCCTTTGGGCCCAACCTCCCATCCGTGTCTATTGTACCC
TGTTGCTTCGGCGGGCCCGCCGCTTGTCGGCCGCCGGGGGGGCGCCTCTGCCCCCCGGGCCCGTGCCCGC
CGGAGACCCCAACACGAACACTGTCTGAAAGCGTGCAGTCTGAGTTGATTGAATGCAATCAGTTAAAACT
TTCAACAATGGATCTCTTGGTTCCGGC1
To represent ambiguity in DNA sequences the following letters can be
used (following the rules of the
International Union of Pure
and Applied Chemistry (IUPAC)):
A = adenine
C = cytosine
G = guanine
T = thymine
U = uracil
R = G A (purine)
Y = T C (pyrimidine)
K = G T (keto)
M = A C (amino)
S = G C
W = A T
B = G T C
D = G A T
H = A C T
V = G C A
N = A G C T (any)
Pre-fixes Description
Genbank
AE | CP | CY : Genome projects (nucleotide)
U | AF | AY : Direct submissions (nucleotide)
DQ | EF | EU
FJ | GQ | GU
HM | HQ | JF
JN | JQ | JX
KC | KF | KJ
KM | KP | KR
KT | KU | KX
AAAA - AZZZ : Whole genome shotgun sequences (nucleotide)
JAAA - JZZZ,
LAAA - LZZZ,
MAAA - MZZZ,
NAAA - NZZZ,
PAAA - PZZZ,
QAAA - QZZZ,
RAAA - RZZZ
AAA-AZZ : Protein ID
EAA-EZZ, KAA-KZZ : WGS protein ID
O/P/Q : Swissprot (protein)
RefSeq:
AC_ : Genomic Complete genomic molecule, usually alternate assembly
AP_ : Protein Annotated on AC_ alternate assembly
NC_ : Genomic Complete genomic molecule, usually reference assembly, Curated
NG_ : Curated, Incomplete genomic region
NM_ : Curated, mRNA
NR_ : Curated, ncRNA
NP_ : Curated, Protein Associated with an NM_ or NC_ accession
NS_ : Genomic Environmental sequence
NT_ : Automated, Genomic, Contig or scaffold, clone-based or WGSa
NZ_ : Genomic, Unfinished WGS
NW_ : Automated, Genomic contig or scaffold, primarily WGSa
XM_ : Automated, predicted mRNA model
XP_ : Automated, predicted protein model
XR_ : Automated, predicted ncRNA model
YP_ : Protein
XP_ : Protein Predicted model, associated with an XM_ accession
ZP_ : Protein Predicted model, annotated on NZ_ genomic records
ZP_ :