European Molecular
Biology Computing Network - Biocomputing Tutorials

Searching Databases 1

Creating databases to search
Virtual databases via list files
Exercise 1: virtual database sub-sets with lookup, findpatterns, etc.

Actual databases via dataset
Exercise 2: actual mini-databases with dataset

Creating databases to search

It is often useful to specify a set of sequences - a "personal database" - for your particular research interests, or for a special series of analyses. A personal database can have members from different data libraries, as well as your unpublished results. Depending on your needs, a personal database can be either virtual, with the sequences still existing in the E/GCG data libraries, or actual, with the sequences stored in files in your own directories.

Virtual databases

The contents of a virtual personal database are described by list files, like the one produced by the lookup programme (see Sequence Databases). Since several E/GCG programmes that search data libraries also write list files, you can create virtual personal databases of high precision simply by running two or three different searches in tandem.

Virtual personal databases are easy to create using various searching programs, easy to amend, and use hardly any disk-space compared to their actual counterparts. They are, however, limited in scope to sequences that are found in E/GCG data libraries; a list file can usually only have references to valid data library sequences. Nonetheless, virtual personal databases are the recommended approach!

Actual databases

Actual personal databases are created with the dataset programme. These are full E/GCG data libraries, occupying space in your disk space. Analysing or manipulating sequences from an actual personal database will be slightly faster than from the E/GCG data libraries because the search time will be shorter. Further, you can select subsets of an actual personal database by using wildcards in the name, just as you can with Genbank or EMBL, etc.

Use an actual personal database if you have a large set of sequences that you will be processing often, which do not occur in the public databases, and which will not be changed, altered or added to.

Virtual databases via list files

There are many programs which write out list files. Some of them are:

lookup stringsearch wordsearch findpatterns -names fasta -noalign

tfasta -noalign

To illustrate the creation and refinement of a virtual database, we will find all the mRNA sequences for goldfish, filter out those lacking a particular restriction enzyme cutting site, and view the sequences on the screen.

Exercise DNA Analysis - Searching Databases 1. 1: create and refine a virtual database with lookup & findpatterns

Query the GenBank & EMBL data libraries for mRNA sequences from the goldfish. (see Sequence Databases to refresh your memory on lookup)

prompt> lookup -lib=gb,em -all=mRNA -org="Carassius auratus" -out=gofishmrna.list

(Or you may enter only lookup, and respond to all the prompts.)
"<CTRL> D" begins the search and "1" writes the list file.

Refine this set of sequences to hold only sequences containing two or more EcoRI recognition sites (GAATTC).

prompt> findpatterns @gofishmrna.list -pat=GAATTC -minc=2 -names -out=gofishmrnaecor1.list

The findpatterns programme is given the output list file from lookup as its input file, preceeded by an "@" symbol to indicate that gofishmrna.list is a list file. The "-names" switch tells findpatterns to write a list file as its output.

View the sequences.

prompt> typedata @gofishmrnaecor1.list | more
 
FETCH copies GCG sequences or data files from the GCG database 
into your directory or displays them on your terminal screen.
 
 crablu
LOCUS       CRABLU       1257 bp ss-mRNA            VRT       03-MAR-1993
DEFINITION  Carassius auratus blue cone opsin mRNA, complete cds.
ACCESSION   L11864
KEYWORDS    blue sensitive cone opsin; opsin.

 ...

Can any of these sequences be almost completely sub-cloned using only EcoRI? (Hint!)

Look!

lookup

findpatterns

stringsearch

wordsearch

prompt> genhelp lookup prompt> genhelp findpatterns prompt> genhelp stringsearch prompt> genhelp wordsearch

lookup

findpatterns

stringsearch

wordsearch

Actual databases via `dataset`

Some warnings about creating actual personal databases:

It is a very good way to fill up your file space.
They are best used for a large number of private sequences that will not change, and will be searched often.
Large personal databases are easily re-created at each login, if you have access to temporary file space.

NB: It is far better to use virtual personal databases via list files - these are more flexible and use far, far less disk space!

To illustrate the creation of an actual database, we will first make a list file, edit it to hold references to ~20 sequences, and use it as an input file for dataset.

Exercise DNA Analysis - Searching Databases 1. 2: create and refine a list file with lookup & findpatterns; create a personal database with dataset

Query the GenBank & EMBL data libraries for sequences having one of "jewel", "hippo", or "broom" in the header information.

prompt> lookup -lib=gb,em -all=hippo -out=hippo.list

If the number of entries is >>20 (I found 468 with "hippo"), use findpatterns to trim the list size. (E.g., find only sequences that have three EcoRI &/or XhoI sites.)

prompt> findpatterns @hippo.list -pat=GAATTC,CTCGAG -minc=3 -maxc=3 -names -out=hippo2.list

Use dataset to create a database named hippodb .

prompt> dataset @hippo2.list -out=hippodb -sn=hi

But contigcg.seq is also relevant to the hippodb database! Add this sequence.

prompt> dataset contigcg.seq -append -out=hippodb

Look at the human sequences in hippodb.

prompt> typedata hi:hs* | more

When through experimenting with the new personal database, delete it to conserve disk space. Check that you got ALL of it removed!

prompt> rm *hippo* ; ls -l hippo*

Look!

dataset

prompt> genhelp dataset