European Molecular
Biology Computing Network - Biocomputing Tutorials DNA Sequence Analysis Sequence

Sequence Databases

Table of Contents

Databases Available
Sequence Formats
E/GCG Data Libraries
Database Subsections

Databases Available

The most commonly used sequence databases can be accessed from within the E/GCG packages. Databases are regularly updated where possible.

The sequence database compilers cooperate extensively; EMBL, DDBJ (DNA DataBank of Japan), and GenBank, exchange new sequences daily. The vast majority of the sequences in Genbank are also in EMBL.

Nucleic Acid Sequences

Peptide Sequences


Sequence Formats

Each sequence database has its own distinctive format, and all database formats are different in detail from the E/GCG sequence file format. Broadly speaking, though, ALL sequence files consist of commentary (header information), followed by sequence data. This similarity makes the inter-conversion of sequences relatively straightforward.

The DNA databases, in particular, have identical information for each sequence but organised differently. Compare the header information for the HSHEPSH sequence as stored in EMBL vs. Genbank.

EMBL Format

ID   HSHEPSH    standard; RNA; PRI; 2363 BP.
AC   X07732; M18930;
DT   16-JUL-1988 (Rel. 16, Created)
DT   22-SEP-1995 (Rel. 45, Last updated, Version 9)
DE   Human hepatoma mRNA for serine protease hepsin
KW   hepsin; membrane protein; serine protease; zymogen.
OS   Homo sapiens (human)
OC   Eukaryota; Animalia; Metazoa; Chordata; Vertebrata; Mammalia;
OC   Theria; Eutheria; Primates; Haplorhini; Catarrhini; Hominidae.

Genbank Format

LOCUS       HSHEPSH      2363 bp    RNA             PRI       22-SEP-1995
DEFINITION  Human hepatoma mRNA for serine protease hepsin.
ACCESSION   X07732 M18930
KEYWORDS    hepsin; membrane protein; serine protease; zymogen.
SOURCE      human.
  ORGANISM  Homo sapiens
            Eukaryotae; mitochondrial eukaryotes; Metazoa; Chordata;
            Vertebrata; Sarcopterygii; Mammalia; Eutheria; Primates;
            Catarrhini; Hominidae; Homo.


Accession Numbers

These are the unique, and therefore absolutely reliable, identifiers assigned to sequences in the databases. Each sequence has a unique accession number, used for that sequence in all the databases containing it. An accession number is permanently associated with its sequence. On occassion, two or more sequences are merged; this new sequence is likely to be given a new accession number. All the old accession numbers are retained with the new sequence, becoming secondary accession numbers.

In the example above, the accession number for HSHEPSH is X07732. It also has a secondary accession number, M18930, probably indicating another sequence was combined with HSHEPSH.

Data Libraries

E/GCG converts the commentary and sequence information available in these databases into (E)GCG format, and organises it into data libraries. Thus, any sequence you obtain using E/GCG programmes will automatically be in E/GCG format.

Each sequence database has a corresponding data library, usually named after the database. For example, EMBL, SwissProt, and GenBank are the names of databases, and are also the logical names of E/GCG data libraries. The GenEMBL data library represents a fusion of EMBL with Genbank.

All these data library names have short forms to save typing: em refers to EMBL, gb refers to GenBank, ge refers to GenEMBL, etc.

To specify a particular sequence in a particular data library, you give the logical name (or short form) of the data library together with the sequence identifier, separated by a colon. "gb:humrep2" specifies the humrep2 sequence from GenBank.

Subsections of the Databases

DNA Databases

The EMBL and Genbank sequence databases are split into many different subsections or divisions in the E/GCG data libraries. The main purpose of this is to allow the searching of only the most relevant sequences. These divisions may contain certain taxonomic categories, individual species, or even special classes of loci. What are the advantages?

Divisions ("Taxonomic" Categories) of EMBL and Genbank                                 
Logical NameAbbreviationSubsection Accessed
Bacterial:*ba:*Bacterial (prokaryotes)
Eukaryote:*or:*Eukaryote organelles
Organelle:*or:*Organelle sequences
Fungal:*fun:*Fungal (EMBL only)
Plant:*pl:*Plant (includes fungi in Genbank)
Human:*hu:*Human sequences
Rodent:*ro:*Rodent sequences
Primate:*pr:*Primate sequences
other_mammalian:*om:*Other Mammalian (not primate or rodent)
Other_vertebrate:*ov:*Other Vertebrate
sts:*sts:*Sequence-tagged site sequences (NEW)
est:*est:*Expressed sequence tags (NEW)
tags:*tags:*STSs and ESTs(NEW)
Structural:*st:*Structural RNA
Patent:*pat:*Patented sequences

There are three relatively new DNA database divisions available as E/GCG data libraries: sequence-tagged sites, expressed sequence tags, and the union of these two, called simply "tags". These subsections have grown so quickly in number that if you wish to include these sequences in a database search, you must now ask for them explicitly.

DNA Data Library Logical Names - A Quick Reminder
Data AccessedGenEMBLEMBLGenBank
Entire sequenceGenEMBLPlus:*EMBLPlus:*GenBankPlus:*
All sequencesgenembl:*embl:*genbank:*
except tagsge:*em:*gb:*
Only tagstags:*em_tags:*gb_tags:*


Protein Databases

The protein sequence databases SwissProt and PIR have not been as extensively sub-divided. The SwissProt data library may be queried for several taxonomic categories (e.g., human, chick, or mouse entries, via the logical names swissprot:*_human, swiss:*_chick, sw:*_mouse). Check the current possibilities. In PIR there are four subsections, corresponding to sequences of different "quality assurance".

Protein Data Library Logical Names
Data AccessedSwissProtPIRTREMBL
Entire sequence databaseswissprot:* protein:*not avail
(Annotated in PIR)swiss:*prot:*not avail
sw:*pir1:*not avail
PIR Preliminary sequencespir2:*
PIR Unverified seqspir3:*
PIR Unencoded/untranslated seqs pir4:*


Obtaining sequences

To find one or more sequences from the E/GCG data libraries, use the lookup programme. It first presents you with a menu of databases to search in, and then gives the list of searchable fields in which you can specify your query. The arrow keys move between the fields in this list, and <CTRL>D starts the search. Below is an example search for goldfish mRNA sequences in EMBL.

prompt> lookup
LookUp identifies sequences by name, accession number, author, organism,
keyword, title, reference, feature, definition, length, or date.  The output
is a list of sequences. 
The LookUp program is experimental in this release--please look carefully at
your results. 
 LOOKUP in what sequence libraries:
   a) sw_release
   b) pir
   c) embl
   d) genbank
   e) em_tags
   f) gb_tags
   g) gb_new
   h) em_new
   i) sw_new
   j) epd
   k) All libraries
   q) quit
 Please choose one or more (* k *):  c

... a new screen is written ...

 Complete the query form below:
                 All text:
               Definition:  mRNA
            Sequence name:
         Accession number:
                 Organism:  Carassius auratus
  On or after (dd-mmm-yy):               On or before (dd-mmm-yy):
 Shortest sequence length:                Longest sequence length:
     Inter-field operator:  AND             Form of output list:  Whole Entries
 Press <Ctrl>D to continue.
 Searching embl
 53 entries were found.
 Do you wish to:
   1) write out this list to a file
   2) preview the results
   3) refine the query
   4) choose different libraries
   q) quit
 Please choose one (* 1 *):  
 What should I call the output file (* lookup.list *) ?  
 53 entries were written to "lookup.list"


The resulting file "lookup.list" contains the set of EMBL database sequence entries, with comments describing the sequences indicated by an exclamation mark:

prompt> more lookup.list 

LOOKUP in: embl  of: "([SQ-DEF: mRNA*] & [SQ-ORG: Carassius auratus*])"
 53 entries  October 27, 1995 11:05 ..
EM_OV:CA07056 ! ID: a0000103
! DE   Carassius auratus homeobox protein mRNA, complete cds.
EM_OV:CA08016 ! ID: a1000103
! DE   Carassius auratus kainate receptor beta subunit mRNA, complete cds.
EM_OV:CA08017 ! ID: a2000103
! DE   Carassius auratus kainate receptor alpha subunit mRNA, complete
! DE   cds.
EM_OV:CA12018 ! ID: a3000103
! DE   Carassius auratus glutamate receptor 4 (glur4) mRNA, partial cds.

Exercise DNA Analysis - Sequence Databases 1: lookup some database sequences; get a local listfile
Search for rhodopsin sequences in EMBL, and send the sequence set to rhodopsin.list

prompt> lookup -out=rhodopsin.list

...Choose EMBL as the database ...

...Enter rhodopsin in the "All text:", "Definition:", & "Keyword:" fields,
OR as the "Inter-field operator:" ...

...Press <CTRL>D to continue, and accept the remaining defaults.

Send the list to the screen. Did you find the octopus rhodopsin pdrhod?

prompt> more rhodopsin.list




To copy a sequence entry from one of the E/GCG data libraries to a UNIX file, use the programme called fetch. It takes the database:entry you want as its argument. fetch responds by describing itself, and then prints the filename it has copied the database entry to.

prompt> fetch gb:hsef2
FETCH copies GCG sequences or data files from the GCG database 
into your directory or displays them on your terminal screen.

The name of the new UNIX file holding the E/GCG format sequence data is "hsef2.gb_pr". Because it is a normal UNIX file, you may use any normal UNIX commands on it. You can type it to the screen (using "more"), delete it (using "rm"), edit it (please use "seqed", NOT "pico, vi, emacs, etc."!), transfer it to your local site over the computer network, and use it as an input file to other E/GCG programs.

Exercise DNA Analysis - Sequence Databases 2: fetch a database sequence to a local file;
typedata a database sequence to the screen
Get the following sequences from GenEMBL, and display them to the screen: HSEF2, HSFAU, HSHT

prompt> fetch ge:hsef2
prompt> more hsef2.ge_pr
prompt> etc.

Get the same sequences and send them directly to the screen.

prompt> typedata ge:hsef2 | more
prompt> etc.




Long Sequences

The DNA sequence databases now contain sequences that exceed the allowable size limits for E/GCG programs. In the past these sequences were split into components of 350,000 bases. However, if a query sequence matched a region of these split sequences that spanned a break, the alignment may have been overlooked. The solution in force today with E/GCG data libraries is to split sequences longer than 350,000 bases into fragments of 110,000 bases, with a 10,000 base overlap between adjacent fragments. This overlap ensures that query sequence matches at split-points will not be overlooked.

This can be frustrating if you want to fetch long sequences, rather than search through data libraries! Retrieving complete long sequences is easier with specialist sequence retrieval programmes like SRS.

Table of Contents Please continue with Part 6 - Typical E/GCG Programmes   Typical (E)GCG Programmes

Comments? Questions? Accolades? Comments? Questions? Accolades?
Please send them to David Featherston Please   ( )
Updated on Thursday, 21 November, 1996
Copyright © 1995-1996 by Gary Williams, Peter Woollard, &David W. Featherston