European Molecular
Biology Computing Network - Biocomputing Tutorials

Sequence Databases

Databases Available
DNA
Protein
Other

Sequence Formats
EMBL Format
Genbank Format
Accession Numbers

E/GCG Data Libraries
Database Subsections
DNA
Protein
Obtaining Sequences
Exercise 1:lookup some database sequences; get a local listfile
Exercise 2: fetch a database sequence
Long Sequences

Databases Available

The most commonly used sequence databases can be accessed from within the E/GCG packages. Databases are regularly updated where possible.

The sequence database compilers cooperate extensively; EMBL, DDBJ (DNA DataBank of Japan), and GenBank, exchange new sequences daily. The vast majority of the sequences in Genbank are also in EMBL.

Nucleic Acid Sequences

EMBL (Compiled at the EBI, Europe)
Genbank (Compiled in the USA)
cDNA (HGMP-RC generated cDNA's)
EPD (Eukaryotic Promoter Database)

Peptide Sequences

SwissProt (Compiled at the EBI & Switzerland)
PIR (Protein Identification Resource)
TREMBL (Translation of EMBL coding sequences)

Other

REBASE (Restriction Enzymes)
PROSITE (Protein Motifs)
Many other data files, e.g. species specific translation tables.

Sequence Formats

Each sequence database has its own distinctive format, and all database formats are different in detail from the E/GCG sequence file format. Broadly speaking, though, ALL sequence files consist of commentary (header information), followed by sequence data. This similarity makes the inter-conversion of sequences relatively straightforward.

The DNA databases, in particular, have identical information for each sequence but organised differently. Compare the header information for the HSHEPSH sequence as stored in EMBL vs. Genbank.

EMBL Format

ID   HSHEPSH    standard; RNA; PRI; 2363 BP.
XX
AC   X07732; M18930;
XX
DT   16-JUL-1988 (Rel. 16, Created)
DT   22-SEP-1995 (Rel. 45, Last updated, Version 9)
XX
DE   Human hepatoma mRNA for serine protease hepsin
XX
KW   hepsin; membrane protein; serine protease; zymogen.
XX
OS   Homo sapiens (human)
OC   Eukaryota; Animalia; Metazoa; Chordata; Vertebrata; Mammalia;
OC   Theria; Eutheria; Primates; Haplorhini; Catarrhini; Hominidae.
...

Genbank Format

LOCUS       HSHEPSH      2363 bp    RNA             PRI       22-SEP-1995
DEFINITION  Human hepatoma mRNA for serine protease hepsin.
ACCESSION   X07732 M18930
KEYWORDS    hepsin; membrane protein; serine protease; zymogen.
SOURCE      human.
  ORGANISM  Homo sapiens
            Eukaryotae; mitochondrial eukaryotes; Metazoa; Chordata;
            Vertebrata; Sarcopterygii; Mammalia; Eutheria; Primates;
            Catarrhini; Hominidae; Homo.
...

Accession Numbers

These are the unique, and therefore absolutely reliable, identifiers assigned to sequences in the databases. Each sequence has a unique accession number, used for that sequence in all the databases containing it. An accession number is permanently associated with its sequence. On occassion, two or more sequences are merged; this new sequence is likely to be given a new accession number. All the old accession numbers are retained with the new sequence, becoming secondary accession numbers.

In the example above, the accession number for HSHEPSH is X07732. It also has a secondary accession number, M18930, probably indicating another sequence was combined with HSHEPSH.

Data Libraries

E/GCG converts the commentary and sequence information available in these databases into (E)GCG format, and organises it into data libraries. Thus, any sequence you obtain using E/GCG programmes will automatically be in E/GCG format.

Each sequence database has a corresponding data library, usually named after the database. For example, EMBL, SwissProt, and GenBank are the names of databases, and are also the logical names of E/GCG data libraries. The GenEMBL data library represents a fusion of EMBL with Genbank.

All these data library names have short forms to save typing: em refers to EMBL, gb refers to GenBank, ge refers to GenEMBL, etc.

To specify a particular sequence in a particular data library, you give the logical name (or short form) of the data library together with the sequence identifier, separated by a colon. "gb:humrep2" specifies the humrep2 sequence from GenBank.

Subsections of the Databases

DNA Databases

The EMBL and Genbank sequence databases are split into many different subsections or divisions in the E/GCG data libraries. The main purpose of this is to allow the searching of only the most relevant sequences. These divisions may contain certain taxonomic categories, individual species, or even special classes of loci. What are the advantages?

By specifying subsection(s) of a database, you can minimise your search time, and often obtain more pertinent/less noisy results.
You can refer to all the sequences in GenBank and EMBL collectively, by using the GenEMBL data library, or in separate groups, by using the smaller sets of taxonomic categories. These may be specified with their own logical names; Primate, Bacterial, etc. (See Divisions Table below)
You can also access single species categories, which tend to be indicated by the first letters of the scientific name. E.g., "em:hs*" specifies all human (Homo sapiens) entries in the EMBL database, and "gb:rn*" specifies all rat (Rattus norvegicus) entries in Genbank.
You can also specify taxonomic categories within a database. For instance, "em_ba:*" refers to EMBL bacterial sequences, and "gb_ro:*" refers to GenBank rodent sequences.

**Divisions ("Taxonomic" Categories) of EMBL and Genbank**
Logical Name	Abbreviation	Subsection Accessed
phage:*	ph:*	Bacteriophages
Viral:*	vi:*	Viral
Bacterial:*	ba:*	Bacterial (prokaryotes)
Eukaryote:*	or:*	Eukaryote organelles
Organelle:*	or:*	Organelle sequences
Fungal:*	fun:*	Fungal (EMBL only)
Plant:*	pl:*	Plant (includes fungi in Genbank)
Invertebrate:*	in:*	Invertebrates
Human:*	hu:*	Human sequences
Rodent:*	ro:*	Rodent sequences
Primate:*	pr:*	Primate sequences
other_mammalian:*	om:*	Other Mammalian (not primate or rodent)
Other_vertebrate:*	ov:*	Other Vertebrate
sts:*	sts:*	Sequence-tagged site sequences (NEW)
est:*	est:*	Expressed sequence tags (NEW)
tags:*	tags:*	STSs and ESTs(NEW)
Structural:*	st:*	Structural RNA
Synthetic:*	sy:*	Synthetic
Unclassified:*	un:*	Unclassified
Patent:*	pat:*	Patented sequences

There are three relatively new DNA database divisions available as E/GCG data libraries: sequence-tagged sites, expressed sequence tags, and the union of these two, called simply "tags". These subsections have grown so quickly in number that if you wish to include these sequences in a database search, you must now ask for them explicitly.

**DNA Data Library Logical Names - A Quick Reminder**
Data Accessed	GenEMBL	EMBL	GenBank

Entire sequence	GenEMBLPlus:*	EMBLPlus:*	GenBankPlus:*
database	geplus:*	emplus:*	gbplus:*
	gep:*	emp:*	gbp:*

All sequences	genembl:*	embl:*	genbank:*
except `tags`	ge:*	em:*	gb:*

Only `tags`	tags:*	em_tags:*	gb_tags:*

Protein Databases

The protein sequence databases SwissProt and PIR have not been as extensively sub-divided. The SwissProt data library may be queried for several taxonomic categories (e.g., human, chick, or mouse entries, via the logical names swissprot:*_human, swiss:*_chick, sw:*_mouse). Check the current possibilities. In PIR there are four subsections, corresponding to sequences of different "quality assurance".

**Protein Data Library Logical Names**
Data Accessed	SwissProt	PIR	TREMBL

Entire sequence database	swissprot:*	protein:*	not avail
(Annotated in PIR)	swiss:*	prot:*	not avail
	sw:*	pir1:*	not avail

PIR Preliminary sequences		pir2:*
PIR Unverified seqs		pir3:*
PIR Unencoded/untranslated seqs		pir4:*

Obtaining sequences

To find one or more sequences from the E/GCG data libraries, use the lookup programme. It first presents you with a menu of databases to search in, and then gives the list of searchable fields in which you can specify your query. The arrow keys move between the fields in this list, and <CTRL>D starts the search. Below is an example search for goldfish mRNA sequences in EMBL.

prompt> lookup
 
LookUp identifies sequences by name, accession number, author, organism,
keyword, title, reference, feature, definition, length, or date.  The output
is a list of sequences. 
 
The LookUp program is experimental in this release--please look carefully at
your results. 
 
 LOOKUP in what sequence libraries:
 
   a) sw_release
   b) pir
   c) embl
   d) genbank
   e) em_tags
   f) gb_tags
   g) gb_new
   h) em_new
   i) sw_new
   j) epd
   k) All libraries
 
   q) quit
 
 Please choose one or more (* k *):  c

... a new screen is written ...

 Complete the query form below:
 
                 All text:
               Definition:  mRNA
                   Author:
                  Keyword:
            Sequence name:
         Accession number:
                 Organism:  Carassius auratus
                Reference:
                    Title:
                  Feature:
  On or after (dd-mmm-yy):               On or before (dd-mmm-yy):
 Shortest sequence length:                Longest sequence length:
 
     Inter-field operator:  AND             Form of output list:  Whole Entries
 
 Press <Ctrl>D to continue.
 
 Searching embl
 
 53 entries were found.
 
 Do you wish to:
 
   1) write out this list to a file
   2) preview the results
   3) refine the query
   4) choose different libraries
 
   q) quit
 
 Please choose one (* 1 *):  
 
 What should I call the output file (* lookup.list *) ?  
 
 .
 53 entries were written to "lookup.list"

prompt>

The resulting file "lookup.list" contains the set of EMBL database sequence entries, with comments describing the sequences indicated by an exclamation mark: prompt> more lookup.list LOOKUP in: embl of: "([SQ-DEF: mRNA*] & [SQ-ORG: Carassius auratus*])" 53 entries October 27, 1995 11:05 .. EM_OV:CA07056 ! ID: a0000103 ! DE Carassius auratus homeobox protein mRNA, complete cds. EM_OV:CA08016 ! ID: a1000103 ! DE Carassius auratus kainate receptor beta subunit mRNA, complete cds. EM_OV:CA08017 ! ID: a2000103 ! DE Carassius auratus kainate receptor alpha subunit mRNA, complete ! DE cds. EM_OV:CA12018 ! ID: a3000103 ! DE Carassius auratus glutamate receptor 4 (glur4) mRNA, partial cds. ... Exercise DNA Analysis - Sequence Databases 1: lookup some database sequences; get a local listfile Search for rhodopsin sequences in EMBL, and send the sequence set to rhodopsin.list prompt> lookup -out=rhodopsin.list ...Choose EMBL as the database ... ...Enter rhodopsin in the "All text:", "Definition:", & "Keyword:" fields, selecting OR as the "Inter-field operator:" ... ...Press <CTRL>D to continue, and accept the remaining defaults. Send the list to the screen. Did you find the octopus rhodopsin pdrhod? prompt> more rhodopsin.list On-line help for lookup is available via the command prompt> genhelp lookup You may also check the manual web pages for complete details: lookup. To copy a sequence entry from one of the E/GCG data libraries to a UNIX file, use the programme called fetch. It takes the database:entry you want as its argument. fetch responds by describing itself, and then prints the filename it has copied the database entry to. prompt> fetch gb:hsef2 FETCH copies GCG sequences or data files from the GCG database into your directory or displays them on your terminal screen. hsef2.gb_pr The name of the new UNIX file holding the E/GCG format sequence data is "hsef2.gb_pr". Because it is a normal UNIX file, you may use any normal UNIX commands on it. You can type it to the screen (using "more"), delete it (using "rm"), edit it (please use "seqed", NOT "pico, vi, emacs, etc."!), transfer it to your local site over the computer network, and use it as an input file to other E/GCG programs. Exercise DNA Analysis - Sequence Databases 2: fetch a database sequence to a local file; typedata a database sequence to the screen Get the following sequences from GenEMBL, and display them to the screen: HSEF2, HSFAU, HSHT prompt> fetch ge:hsef2 prompt> more hsef2.ge_pr prompt> etc. Get the same sequences and send them directly to the screen. prompt> typedata ge:hsef2 | more prompt> etc. On-line help for fetch and typedata is available via the commands prompt> genhelp fetch prompt> genhelp typedata You may also check the manual web pages for complete details: fetch. Long Sequences The DNA sequence databases now contain sequences that exceed the allowable size limits for E/GCG programs. In the past these sequences were split into components of 350,000 bases. However, if a query sequence matched a region of these split sequences that spanned a break, the alignment may have been overlooked. The solution in force today with E/GCG data libraries is to split sequences longer than 350,000 bases into fragments of 110,000 bases, with a 10,000 base overlap between adjacent fragments. This overlap ensures that query sequence matches at split-points will not be overlooked. This can be frustrating if you want to fetch long sequences, rather than search through data libraries! Retrieving complete long sequences is easier with specialist sequence retrieval programmes like SRS. Please continue with Part 6 - Typical E/GCG Programmes

Comments? Questions? Accolades? Comments? Questions? Accolades?
Please send them to David Featherston Please ( dwf@biobase.dk )