European Molecular
Biology Computing Network - Biocomputing Tutorials

Typical E/GCG Programmes

The generic E/GCG programme
Input sequence specification
Output sequence file specification

Mapping sequences with map
Exercise 1: map a sequence
Exercise 2: edit a resource file; re-map a sequence
Exercise 3: configure the graphics display; plot a sequence map with mapplot

Translating sequences
Exercise 4: extract possible amino acid sequences from the output of map
Exercise 5: translate a sequence
Exercise 6: map a sequence; find its possible peptides; try to devise a sub-cloning strategy

The `generic` E/GCG programme

As you saw with the GCG programmes fetch, translate, reformat and fromstaden, most E/GCG programmes are called like UNIX commands. You type the programme name, a flag or two (optional), and an argument or two (sometimes optional) at the UNIX prompt, press <RETURN>, and follow directions. Most E/GCG "commands" expect one or more arguments specifying the names of files or database entries to act on. And many E/GCG commands accept flags (called "switches") to modify their behaviour.

prompt> programname argument1 argument2 -switch1 -switch2 This is the short description of the program that is running. It is usually two lines long and fairly terse. PROGRAMNAME what sequence(s) ? ge:someseq Begin (* 1 *) ? End (* 516 *) ? Reverse (* No *) ? Select one of: A) First option B) Second option Please choose one (* A *): B (don't accept defaults without knowing what you are accepting)What should I call the output file (* someseq.pgmnm *) ?

**coding regions for example sequences**
data library entry	filename	coding sequence
ge:hsef2	hsef2.ge_pr	1 .. 2577
ge:hsfau	hsfau.ge_pr	57 .. 458
ge:hsht	hsht.ge_pr	128 .. 1420

Note that the arguments can occur before or after any switches; an argument is actually the answer to the programmes default switch "-INfile=". If the arguments are not present on the command line, then the programme will prompt for them. If switches are not present on the command line, the programme will use default values and will NOT prompt for them.
To see what switches are available and optionally to set them, run the programme with the switch "-CHEck". You may abbreviate a switch by entering only the uppercase part of the switchname; the rest is optional.

prompt> programname -che This is the short description of the program that is running. It is usually two lines long and fairly terse. Press <rtn> for more: Syntax: % programname [-INfile=]GenEMBL:Humhb* Required Parameters: None Local Data Files: None Optional Parameters: -OUTfile=FileName copy file(s)-sequence(s) into one file -DOCLines=6 copies only the first 6 lines of documentation. -NOMONitor suppresses the screen monitor -PROtein input sequence is protein Add what to the command line ? -pro PROGRAMNAME what sequence(s) ?
etc.

Input sequence specification: answering the -INfile switch
With the exception of the sequence exchange programmes and a few others, E/GCG programmes only recognise E/GCG format sequence files or entries in databases that have been converted to E/GCG data libraries.

Most E/GCG programmes work with sequence data. You can specify a single sequence either as an entry in one of the E/GCG data libraries,
fetch ge:humrep2
or as an E/GCG format sequence file in your directory.
frames humrep2.ge_pr

To specify multiple sequences you have three options: multiple entries in a data library (using a wildcard),
dataset ge:hs*
a single file holding many sequences (an ".msf" file),
pileup multseqfileA.msf
or a single file holding a list of E/GCG format sequence files (a "listfile").
assemble @listfile01

One point to note about arguments for E/GCG programmes: arguments that are database entries [actually from E/GCG data libraries] may be given in upper- &/or lower-case because E/GCG itself is "case-insensitive". E/GCG programmes are run under the UNIX environment, though, and UNIX is a "case-sensitive" operating system. Therefore, if an argument is a UNIX file with one or more upper-case letters, it must be typed with its upper-case letter(s).

Output sequence file specification: answering the -OUTfile switch
E/GCG programmes usually suggest a default name for their output file. It is best to select a name that has an extension reminding you of the programme that created the file, and this is what E/GCG attempts with the default suggestion. For example, DNAsequence23.fra could be the filename of the result from passing a nucleotide sequence through frames. Often, the output file of one programme is the input file for another; accepting E/GCG's default file extension for output files can save typing in subsequent steps.

Mapping sequence with map
map is a versatile program that finds restriction enzyme sites in a sequence. As with most E/GCG programmes, it accepts sequence data as its default input, and can be run with zero to many switches. These switches can modify the behaviour of map in useful ways, and we'll explore some of these modifications with the sequences fetched in the Sequences Databases Exercise 2.
In addition to the files or data library entries you specify, map accesses a file describing a vast number of commercially available restriction enzymes to determine what sites it can seek. This extra input file is normally read in from a central, hidden part of the system. We will fetch this file, too, and modify it to reflect our enzyme freezer stock, budget, and available vector sites.

Exercise DNA Analysis - Typical E/GCG Programmes 1: map a sequence
Run map with no argument or switches. Enter "hsfau.ge_pr" at the first question, and accept all of the remaining defaults. Have a look at the output.
prompt> map Map displays both strands of a DNA sequence with restriction sites shown above the sequence and possible protein translations shown below. (Linear) MAP of what sequence ? hsfau.ge_pr Begin (* 1 *) ? End (* 518 *) ? Select the enzymes: Type nothing or "*" to get all enzymes. Type "?" for help on which enzymes are available and how to select them. Enzyme(* * *): What protein translations do you want: a) frame 1 b) frame 2 c) frame 3 d) frame 4 e) frame 5 f) frame 6 t)hree forward frames s)ix frames o)pen frames only n)o protein translation q)uit Please select (capitalize for 3-letter) (* t *): What should I call the output file (* hsfau.map *) ? prompt> more hsfau.map (Linear) MAP of: hsfau check: 2981 from: 1 to: 518 LOCUS HSFAU 518 bp RNA PRI 23-SEP-1993 DEFINITION H.sapiens fau mRNA. ACCESSION X65923 KEYWORDS fau gene. SOURCE human. ORGANISM Homo sapiens . . . With 209 enzymes: * October 26, 1995 15:21 .. S MH B C AN a B CB P TbiM B AcT Av vlMAu s vs l aonn c ceh li aawc9 m io e qIfl c ifa uJ IIoi6 F RF I IIII I III II IVIII I II / / / / TTCCTCTTTCTCGACTCCATCTTCGCGGTAGCTGGGACCGCCGTTCAGTCGCCAATATGC 1 ---------+---------+---------+---------+---------+---------+ 60 AAGGAGAAAGAGCTGAGGTAGAAGCGCCATCGACCCTGGCGGCAAGTCAGCGGTTATACG a F L F L D S I F A V A G T A V Q S P I C - b S S F S T P S S R * L G P P F S R Q Y A - c P L S R L H L R G S W D R R S V A N M Q - [several pages deleted] Enzymes that do cut: AceIII AciI AflII AluI ApaI AscI AvaII BanII BbsI BbvI BccI BcefI BmgI BpmI Bpu1102I BsaJI BsaXI BscGI BsiEI BsiHKAI BslI BsmFI BsoFI Bsp1286I BsrI BsrDI BsrFI BssHII BstEII Bsu36I Cac8I CviJI CviRI DdeI DpnI DrdII EaeI EciI EcoO109I EcoRII FauI FokI GdiII HaeI HaeII HaeIII HhaI Hin4I HincII HinfI HphI MaeII MaeIII MboII MnlI MscI MseI MspI MwoI NciI NlaIII NlaIV NspI PleI Psp1406I RsaI Sau96I Sau3AI ScrFI SfaNI SphI TaqI TauI ThaI TseI Tsp45I Tsp509I TspRI Tth111II UbaCI Enzymes that do not cut: AatII AccI AflIII AhdI AlwI AlwNI ApaBI ApaLI ApoI AvaI AvrII BaeI BamHI BanI Bce83I BcgI BcgI BclI BfaI BfiI BglI BglII BplI Bpu10I BsaI BsaAI BsaBI BsaHI BsaWI BsbI BseRI BsgI BsmI BsmAI BsmBI Bsp24I Bsp24I BspEI BspGI BspLU11I BspMI BsrBI BsrGI BssSI Bst1107I BstXI BstYI CjeI CjeI CjePI CjePI ClaI DraI DraIII DrdI DsaI EagI EarI Eco47III Eco57I EcoNI EcoRI EcoRV FseI FspI HgaI HgiEII HindIII HpaI KpnI MluI MmeI MslI MspA1I MunI NarI NcoI NdeI NgoAIV NheI NotI NruI NsiI NspV PacI Pfl1108I PflMI PinAI PmeI PmlI PshAI Psp5II PstI PvuI PvuII RcaI RleAI RsrII SacI SacII SalI SanDI SapI ScaI SexAI SfcI SfiI SgfI SgrAI SmaI SnaBI SpeI SrfI Sse8387I Sse8647I SspI StuI StyI SunI SwaI TaqII TaqII TfiI Tth111I VspI XbaI XcmI XhoI XmnI prompt>

The sequence and its compliment are written in sets of 60, with the names of the enzymes written vertically above their cutting sites. Where two or more enzymes cut at the same site, the additional enzyme names are displaced to the right and a slash is placed underneath them. The slash indicates they actually cut somewhere to the left of their position. The requested three reading frames of protein translation are written under each 60 bases of sequence. The output ends with two lists of restriction enzymes: ones that do cut and ones that don't.

This is fine, but "noisy"; we don't have all of these enzymes in stock, so many of the reported cutting sites are irrelevant. Let's trim a local copy of the file holding the restriction enzyme information, and try map again.

Exercise DNA Analysis - Typical E/GCG Programmes 2: edit a resource file; re-map a sequence
Get a local copy of the restriction enzymes data file.
prompt> fetch data:enzyme.dat

Edit your copy (what is its filename?) with pico (or other UNIX text editor). Remove several enzymes you (supposedly) do not use; DO NOT remove the line with the two dots (..)!
Run map again, this time with "hsfau.ge_pr" as the argument and "-dat=enzyme.dat -out=hsfau2.map" as the switches. Accept the remaining defaults again, and look at the new output file.
prompt> map hsfau.ge_pr -dat=enzyme.dat -out=hsfau2.map prompt> more hsfau2.map

Check all the possible switches for map.
prompt> map -che

Adjust map so your output file shows only enzymes that cut the sequence two or three times.
prompt> map hsfau.ge_pr -dat=enzyme.dat -out=hsfau3.map -minc=2 -maxc=3 prompt> more hsfau3.map

The hsfau3.map file is a text file that tries to be a picture. Unfortuantely, it shows the data at a large scale - 60 bases per line - so we can't see the entire sequence and all the enzymes that cut it on one screen. For a better view of how often and where the available enzymes cut the sequence, we need a true graphical display.

Exercise DNA Analysis - Typical E/GCG Programmes 3: configure the graphics display; plot a sequence map with mapplot
Show the possible graphics display options and choose one. If you are using X-windows, choose "xcol" or "xmon". If not, choose "epsf" or "psf", and transfer these files to your local computer for printing on a postscript printer.

[NB I: Below are the options at BioBase.]
[NB II: "epsf" or "psf" files can be transfered by FTP, Kermit, or as EMail attachments.]

prompt> setplot +---------------------> displaying all of 10 option(s) <---------------------+ |psf postscript - sent to file: homedir:graf.ps | |epsf eps postscript - sent to file: homedir:graf.eps | |hpg hp laser with hpgl - sent to file: homedir:graf.hp | |xcol x windows colour graphics - for x-windows terminal | |xmon x windows monochr. graphics - for x-windows terminal | |vt340 vt340 graphics - for a vt340 terminal | |vt241 vt241 graphics - for a vt241 terminal | |tek versaterm tektronix 4105 graphics on your terminal | |dec declaser 5100 postscript/pcl/hpgl printer at biobase | |qms qms colorscript210 ps printer at biobase (14 kr./pg) | | | | | +------------------------------------------------------------------------------+ enter a command. choices are: <up-arrow> and <down-arrow> scroll the list <return> makes GCG use the selected device Q quits without doing anything C creates and edits a new device (you can't delete from the site file) V views the selection (use C to edit a copy)

Plot the map. Have a look!
prompt> mapplot hsfau.ge_pr -dat=enzyme.dat -minc=2 -maxc=3

This final output might show possibilities for sub-cloning most of hsfau with only one enzyme. Can you sub-clone a fragment that is only coding sequence? Which open reading frame(s) is (are) used? Where is this information shown in the orginal sequence file? (Hint!) Are "hser2.ge_pr" or "hsht.ge_pr" better or worse prospects for sub-cloning with your reduced enzyme list?

On-line help for map and mapplot is available via the commands
prompt> genhelp map prompt> genhelp mapplot
You may also check the manual web pages for complete details: map & mapplot.

Translating sequences
The output of map displays possible amino acid sequences for each of the three forward reading frames. If we had no information about which sections of the DNA sequence were coding sequence, we could extract the possible amino acid sequences directly from hsfau3.map with eextractpeptide.

Exercise DNA Analysis - Typical E/GCG Programmes 4: extract possible amino acid sequences from the output of map with eextractpeptide
Get the possible peptide sequences from hsfau3.map. View the result.
prompt> eextractpeptide hsfau3.map -out=hsfau3.pep prompt> more hsfau3.pep

Get the possible peptides for the other two mapped DNA sequences, hsef2.ge_pr & hsht.ge_pr .

Given that we know the coding regions for these three example sequences, let's translate them properly into proteins. For quick reference, the coding regions of these three sequences follow:

coding regions for example sequences
data library entry filename coding sequence

ge:hsef2 hsef2.ge_pr 1 .. 2577

ge:hsfau hsfau.ge_pr 57 .. 458

ge:hsht hsht.ge_pr 128 .. 1420

Exercise DNA Analysis - Typical E/GCG Programmes 5: translate a sequence
Call for hsef2.ge_pr to be translated into protein. Specify the end of the coding sequence, and accept the rest of the defaults.
prompt> translate hsef2.ge_pr TRANSLATE translates nucleotide sequences into peptide sequences. Begin (* 1 *) ? End (* 3075 *) ? 2577 Reverse (* No *) ? Range begins ATGGT and ends TGTAG. Is this correct (* Yes *) ? That is done, now would you like to: A) Add another exon from this sequence B) Add another exon from a new sequence C) Translate and then add more genes from this sequence D) Translate and then add more genes from a new sequence W) Translate assembly and write everything into a file Please choose one (* W *): What should I call the output file (* hsef2.pep *) ?

Translate the other two sequences. Compare the two " .pep " files for each example DNA sequence. Be certain you name the translate output files differently from the output files of Exercise 4!

Exercise DNA Analysis - Typical E/GCG Programmes 6: map a sequence; find its possible peptides; try to devise a sub-cloning strategy
Map contigcg.seq . Using either eextractpeptide or translate, find its possible peptides. Choose a large ORF, and investigate the possibility of sub-cloning it with only one or two restriction enzyme digests.

On-line help for translate and eextractpeptide is available via the commands
prompt> genhelp translate prompt> egenhelp eextractpeptide
You may also check the manual web pages for complete details: translate & eextractpeptide.

Please continue with Part 7 - Sequence Comparison

Comments? Questions? Accolades? Comments? Questions? Accolades?
Please send them to David Featherston Please ( dwf@biobase.dk )

Typical E/GCG Programmes

The `generic` E/GCG programme

Input sequence specification: answering the `-INfile` switch

Output sequence file specification: answering the `-OUTfile` switch

Mapping sequence with `map`

Translating sequences

Typical E/GCG Programmes

The generic E/GCG programme

Input sequence specification: answering the -INfile switch

Output sequence file specification: answering the -OUTfile switch

Mapping sequence with map

Translating sequences

The `generic` E/GCG programme

Input sequence specification: answering the `-INfile` switch

Output sequence file specification: answering the `-OUTfile` switch

Mapping sequence with `map`