European Molecular
Biology Computing Network - Biocomputing Tutorials DNA Sequence Analysis Typical
(E)GCG Programmes

Typical E/GCG Programmes


Table of Contents

The generic E/GCG programme
Mapping sequences with map
Translating sequences


The generic E/GCG programme

As you saw with the GCG programmes fetch, translate, reformat and fromstaden, most E/GCG programmes are called like UNIX commands. You type the programme name, a flag or two (optional), and an argument or two (sometimes optional) at the UNIX prompt, press <RETURN>, and follow directions. Most E/GCG "commands" expect one or more arguments specifying the names of files or database entries to act on. And many E/GCG commands accept flags (called "switches") to modify their behaviour.


prompt> programname argument1 argument2 -switch1 -switch2

This is the short description of the program that is running.
It is usually two lines long and fairly terse.

PROGRAMNAME what sequence(s) ? ge:someseq

Begin (* 1 *) ?
End (* 516 *) ?
Reverse (* No *) ?

Select one of:

A) First option
B) Second option

Please choose one (* A *): B (don't accept defaults without
knowing what you are accepting)

What should I call the output file (* someseq.pgmnm *) ?


Note that the arguments can occur before or after any switches; an argument is actually the answer to the programmes default switch "-INfile=". If the arguments are not present on the command line, then the programme will prompt for them. If switches are not present on the command line, the programme will use default values and will NOT prompt for them.

To see what switches are available and optionally to set them, run the programme with the switch "-CHEck". You may abbreviate a switch by entering only the uppercase part of the switchname; the rest is optional.


prompt> programname -che

This is the short description of the program that is running.
It is usually two lines long and fairly terse.

Press <rtn> for more:

Syntax: % programname [-INfile=]GenEMBL:Humhb*

Required Parameters: None

Local Data Files: None

Optional Parameters:

-OUTfile=FileName copy file(s)-sequence(s) into one file
-DOCLines=6 copies only the first 6 lines of documentation.
-NOMONitor suppresses the screen monitor
-PROtein input sequence is protein

Add what to the command line ? -pro

PROGRAMNAME what sequence(s) ?

etc.


 

Input sequence specification: answering the -INfile switch

With the exception of the sequence exchange programmes and a few others, E/GCG programmes only recognise E/GCG format sequence files or entries in databases that have been converted to E/GCG data libraries.

Most E/GCG programmes work with sequence data. You can specify a single sequence either as an entry in one of the E/GCG data libraries,
fetch ge:humrep2
or as an E/GCG format sequence file in your directory.
frames humrep2.ge_pr

To specify multiple sequences you have three options: multiple entries in a data library (using a wildcard),
dataset ge:hs*
a single file holding many sequences (an ".msf" file),
pileup multseqfileA.msf
or a single file holding a list of E/GCG format sequence files (a "listfile").
assemble @listfile01

One point to note about arguments for E/GCG programmes: arguments that are database entries [actually from E/GCG data libraries] may be given in upper- &/or lower-case because E/GCG itself is "case-insensitive". E/GCG programmes are run under the UNIX environment, though, and UNIX is a "case-sensitive" operating system. Therefore, if an argument is a UNIX file with one or more upper-case letters, it must be typed with its upper-case letter(s).

 

Output sequence file specification: answering the -OUTfile switch

E/GCG programmes usually suggest a default name for their output file. It is best to select a name that has an extension reminding you of the programme that created the file, and this is what E/GCG attempts with the default suggestion. For example, DNAsequence23.fra could be the filename of the result from passing a nucleotide sequence through frames. Often, the output file of one programme is the input file for another; accepting E/GCG's default file extension for output files can save typing in subsequent steps.

 


Mapping sequence with map

map is a versatile program that finds restriction enzyme sites in a sequence. As with most E/GCG programmes, it accepts sequence data as its default input, and can be run with zero to many switches. These switches can modify the behaviour of map in useful ways, and we'll explore some of these modifications with the sequences fetched in the Sequences Databases Exercise 2.

In addition to the files or data library entries you specify, map accesses a file describing a vast number of commercially available restriction enzymes to determine what sites it can seek. This extra input file is normally read in from a central, hidden part of the system. We will fetch this file, too, and modify it to reflect our enzyme freezer stock, budget, and available vector sites.

Exercise DNA Analysis - Typical E/GCG Programmes 1: map a sequence
Run map with no argument or switches. Enter "hsfau.ge_pr" at the first question, and accept all of the remaining defaults. Have a look at the output.
prompt> map

Map displays both strands of a DNA sequence with restriction sites shown
above the sequence and possible protein translations shown below. 
 
 (Linear) MAP of what sequence ?  hsfau.ge_pr
 
                  Begin (* 1 *) ?  
                End (*   518 *) ?  
 
 Select the enzymes:  Type nothing or "*" to get all enzymes. Type "?"
 for help on which enzymes are available and how to select them. 
 
                                       Enzyme(* * *):  
 
 What protein translations do you want:
 
      a) frame 1   b) frame 2   c) frame 3
      d) frame 4   e) frame 5   f) frame 6
 
      t)hree forward frames   s)ix frames   o)pen frames only
 
      n)o protein translation   q)uit
 
 Please select (capitalize for 3-letter) (* t *):  
 
 What should I call the output file (* hsfau.map *) ?  

prompt> more hsfau.map

 (Linear) MAP of: hsfau  check: 2981  from: 1  to: 518
 
LOCUS       HSFAU         518 bp    RNA             PRI       23-SEP-1993
DEFINITION  H.sapiens fau mRNA.
ACCESSION   X65923
KEYWORDS    fau gene.
SOURCE      human.
  ORGANISM  Homo sapiens . . . 
 
 With 209 enzymes: * 
 
                             October 26, 1995 15:21  ..
 
                                               S
                    MH            B     C  AN  a        B          CB
               P   TbiM   B      AcT   Av  vlMAu        s          vs
               l   aonn   c      ceh   li  aawc9        m          io
               e   qIfl   c      ifa   uJ  IIoi6        F          RF
               I   IIII   I      III   II  IVIII        I          II
                      /            /    /      /                     
         TTCCTCTTTCTCGACTCCATCTTCGCGGTAGCTGGGACCGCCGTTCAGTCGCCAATATGC
       1 ---------+---------+---------+---------+---------+---------+ 60
         AAGGAGAAAGAGCTGAGGTAGAAGCGCCATCGACCCTGGCGGCAAGTCAGCGGTTATACG
 
a        F  L  F  L  D  S  I  F  A  V  A  G  T  A  V  Q  S  P  I  C   -
b         S  S  F  S  T  P  S  S  R  *  L  G  P  P  F  S  R  Q  Y  A  -
c          P  L  S  R  L  H  L  R  G  S  W  D  R  R  S  V  A  N  M  Q -

[several pages deleted]

 Enzymes that do cut:
 
   AceIII     AciI    AflII     AluI     ApaI     AscI    AvaII    BanII
     BbsI     BbvI     BccI    BcefI     BmgI     BpmI Bpu1102I    BsaJI
    BsaXI    BscGI    BsiEI  BsiHKAI     BslI    BsmFI    BsoFI Bsp1286I
     BsrI    BsrDI    BsrFI   BssHII   BstEII   Bsu36I    Cac8I    CviJI
    CviRI     DdeI     DpnI    DrdII     EaeI     EciI EcoO109I   EcoRII
     FauI     FokI    GdiII     HaeI    HaeII   HaeIII     HhaI    Hin4I
   HincII    HinfI     HphI    MaeII   MaeIII    MboII     MnlI     MscI
     MseI     MspI     MwoI     NciI   NlaIII    NlaIV     NspI     PleI
 Psp1406I     RsaI   Sau96I   Sau3AI    ScrFI    SfaNI     SphI     TaqI
     TauI     ThaI     TseI   Tsp45I  Tsp509I    TspRI Tth111II    UbaCI
 
 Enzymes that do not cut: 
 
    AatII     AccI   AflIII     AhdI     AlwI    AlwNI    ApaBI    ApaLI
     ApoI     AvaI    AvrII     BaeI    BamHI     BanI   Bce83I     BcgI
     BcgI     BclI     BfaI     BfiI     BglI    BglII     BplI   Bpu10I
     BsaI    BsaAI    BsaBI    BsaHI    BsaWI     BsbI    BseRI     BsgI
     BsmI    BsmAI    BsmBI   Bsp24I   Bsp24I    BspEI    BspGI BspLU11I
    BspMI    BsrBI    BsrGI    BssSI Bst1107I    BstXI    BstYI     CjeI
     CjeI    CjePI    CjePI     ClaI     DraI   DraIII     DrdI     DsaI
     EagI     EarI Eco47III   Eco57I    EcoNI    EcoRI    EcoRV     FseI
     FspI     HgaI   HgiEII  HindIII     HpaI     KpnI     MluI     MmeI
     MslI   MspA1I     MunI     NarI     NcoI     NdeI   NgoAIV     NheI
     NotI     NruI     NsiI     NspV     PacI Pfl1108I    PflMI    PinAI
     PmeI     PmlI    PshAI   Psp5II     PstI     PvuI    PvuII     RcaI
    RleAI    RsrII     SacI    SacII     SalI    SanDI     SapI     ScaI
    SexAI     SfcI     SfiI     SgfI    SgrAI     SmaI    SnaBI     SpeI
     SrfI Sse8387I Sse8647I     SspI     StuI     StyI     SunI     SwaI
    TaqII    TaqII     TfiI  Tth111I     VspI     XbaI     XcmI     XhoI
     XmnI

prompt>

The sequence and its compliment are written in sets of 60, with the names of the enzymes written vertically above their cutting sites. Where two or more enzymes cut at the same site, the additional enzyme names are displaced to the right and a slash is placed underneath them. The slash indicates they actually cut somewhere to the left of their position. The requested three reading frames of protein translation are written under each 60 bases of sequence. The output ends with two lists of restriction enzymes: ones that do cut and ones that don't.

This is fine, but "noisy"; we don't have all of these enzymes in stock, so many of the reported cutting sites are irrelevant. Let's trim a local copy of the file holding the restriction enzyme information, and try map again.

Exercise DNA Analysis - Typical E/GCG Programmes 2: edit a resource file; re-map a sequence
Get a local copy of the restriction enzymes data file.

prompt> fetch data:enzyme.dat

Edit your copy (what is its filename?) with pico (or other UNIX text editor). Remove several enzymes you (supposedly) do not use; DO NOT remove the line with the two dots (..)!
Run map again, this time with "hsfau.ge_pr" as the argument and "-dat=enzyme.dat -out=hsfau2.map" as the switches. Accept the remaining defaults again, and look at the new output file.

prompt> map hsfau.ge_pr -dat=enzyme.dat -out=hsfau2.map
prompt> more hsfau2.map

Check all the possible switches for map.

prompt> map -che

Adjust map so your output file shows only enzymes that cut the sequence two or three times.

prompt> map hsfau.ge_pr -dat=enzyme.dat -out=hsfau3.map -minc=2 -maxc=3
prompt> more hsfau3.map

The hsfau3.map file is a text file that tries to be a picture. Unfortuantely, it shows the data at a large scale - 60 bases per line - so we can't see the entire sequence and all the enzymes that cut it on one screen. For a better view of how often and where the available enzymes cut the sequence, we need a true graphical display.

Exercise DNA Analysis - Typical E/GCG Programmes 3: configure the graphics display; plot a sequence map with mapplot
Show the possible graphics display options and choose one. If you are using X-windows, choose "xcol" or "xmon". If not, choose "epsf" or "psf", and transfer these files to your local computer for printing on a postscript printer.
    [NB I: Below are the options at BioBase.]
    [NB II: "epsf" or "psf" files can be transfered by FTP, Kermit, or as EMail attachments.]
prompt> setplot

+--------------------->  displaying all of 10 option(s)  <---------------------+
|psf        postscript - sent to file: homedir:graf.ps                         |
|epsf       eps postscript - sent to file: homedir:graf.eps                    |
|hpg        hp laser with hpgl - sent to file: homedir:graf.hp                 |
|xcol       x windows colour graphics - for x-windows terminal                 |
|xmon       x windows monochr. graphics - for x-windows terminal               |
|vt340      vt340 graphics  - for a vt340 terminal                             |
|vt241      vt241 graphics  - for a vt241 terminal                             |
|tek        versaterm tektronix 4105 graphics on your terminal                 |
|dec        declaser 5100 postscript/pcl/hpgl printer at biobase               |
|qms        qms colorscript210 ps printer at biobase (14 kr./pg)               |
|                                                                              |
|                                                                              |
+------------------------------------------------------------------------------+
enter a command. choices are:
          <up-arrow> and <down-arrow> scroll the list
          <return> makes GCG use the selected device
          Q quits without doing anything

          C creates and edits a new device
          (you can't delete from the site file)
          V views the selection (use C to edit a copy)


Plot the map. Have a look!

prompt> mapplot hsfau.ge_pr -dat=enzyme.dat -minc=2 -maxc=3

This final output might show possibilities for sub-cloning most of hsfau with only one enzyme. Can you sub-clone a fragment that is only coding sequence? Which open reading frame(s) is (are) used? Where is this information shown in the orginal sequence file? (Hint!) Are "hser2.ge_pr" or "hsht.ge_pr" better or worse prospects for sub-cloning with your reduced enzyme list?

 

Look!


Translating sequences

The output of map displays possible amino acid sequences for each of the three forward reading frames. If we had no information about which sections of the DNA sequence were coding sequence, we could extract the possible amino acid sequences directly from hsfau3.map with eextractpeptide.
Exercise DNA Analysis - Typical E/GCG Programmes 4: extract possible amino acid sequences from the output of map with eextractpeptide
Get the possible peptide sequences from hsfau3.map. View the result.

prompt> eextractpeptide hsfau3.map -out=hsfau3.pep
prompt> more hsfau3.pep

Get the possible peptides for the other two mapped DNA sequences, hsef2.ge_pr & hsht.ge_pr .

Given that we know the coding regions for these three example sequences, let's translate them properly into proteins. For quick reference, the coding regions of these three sequences follow:

coding regions for example sequences
data library entryfilenamecoding sequence
ge:hsef2hsef2.ge_pr1 .. 2577
ge:hsfauhsfau.ge_pr 57 .. 458
ge:hshthsht.ge_pr 128 .. 1420

Exercise DNA Analysis - Typical E/GCG Programmes 5: translate a sequence
Call for hsef2.ge_pr to be translated into protein. Specify the end of the coding sequence, and accept the rest of the defaults.
prompt> translate hsef2.ge_pr
 
TRANSLATE translates nucleotide sequences into peptide sequences.  
 
                  Begin (* 1 *) ?  
                End (*  3075 *) ?  2577
               Reverse (* No *) ?  
 
 Range begins ATGGT and ends TGTAG.  Is this correct (* Yes *) ? 
 
 That is done, now would you like to:
 
  A) Add another exon from this sequence
  B) Add another exon from a new sequence
 
  C) Translate and then add more genes from this sequence
  D) Translate and then add more genes from a new sequence
 
  W) Translate assembly and write everything into a file
 
 Please choose one (* W *):  
 
 What should I call the output file (* hsef2.pep *) ?  
Translate the other two sequences. Compare the two " .pep " files for each example DNA sequence. Be certain you name the translate output files differently from the output files of Exercise 4!

 

Exercise DNA Analysis - Typical E/GCG Programmes 6: map a sequence; find its possible peptides; try to devise a sub-cloning strategy
Map contigcg.seq . Using either eextractpeptide or translate, find its possible peptides. Choose a large ORF, and investigate the possibility of sub-cloning it with only one or two restriction enzyme digests.

 

Look!


Table of Contents Please continue with Part 7 - Sequence Comparison   Sequence Comparison


Comments? Questions? Accolades? Comments? Questions? Accolades?
Please send them to David Featherston Please   ( dwf@biobase.dk )
Updated on Thursday, 24 October, 1996
Copyright © 1995-1996 by Gary Williams, Peter Woollard, &David W. Featherston