NB: Be careful when using these programmes; it is possible to align one sequence with any other, if you really want to. False alignments, and the research you plan using them, may have no biological significance!
We will begin with the most common use of bestfit - to find the best region of similarity between two distantly-related (but homologous) sequences.
prompt> fetch gb_in:pdrhod -out=pdrhod.gb_in
prompt> fetch gb_ro:rnops -out=rnops.gb_ro
prompt> bestfit pdrhod.gb_in rnops.gb_ro -out=rhodop.pair BestFit makes an optimal alignment of the best segment of similarity between two sequences. Optimal alignments are found by inserting gaps to maximize the number of matches using the local homology algorithm of Smith and Waterman. Begin (* 1 *) ? End (* 1675 *) ? Reverse (* No *) ? Begin (* 1 *) ? End (* 1493 *) ? Reverse (* No *) ? What is the gap creation penalty (* 5.00 *) ? What is the gap extension penalty (* 0.30 *) ? Aligning .................................................. ........................-. Gaps: 0 Quality: 20.1 Quality Ratio: 0.490 % Similarity: 73.171 Length: 41 prompt>
prompt> more rhodop.pair BESTFIT of: pdrhod.ge_in check: 8638 from: 1 to: 1675 LOCUS PDRHOD 1675 bp RNA INV 12-SEP-1993 DEFINITION Octopus mRNA for rhodopsin. ACCESSION X07797 NID g9822 KEYWORDS rhodopsin. SOURCE Octopus dofleini. . . . to: rnops.ge_ro check: 6230 from: 1 to: 1493 LOCUS RNOPS 1493 bp RNA ROD 20-DEC-1994 DEFINITION R.norvegicus mRNA for rhodopsin. ACCESSION Z46957 NID g603874 KEYWORDS rhodopsin. SOURCE Norway rat. . . . Symbol comparison table: /usr/prog/gcg/gcgcore/data/rundata/swgapdna.cmp CompCheck: 5234 Gap Weight: 5.000 Average Match: 1.000 Length Weight: 0.300 Average Mismatch: -0.900 Quality: 20.1 Length: 41 Ratio: 0.490 Gaps: 0 Percent Similarity: 73.171 Percent Identity: 73.171 pdrhod.ge_in x rnops.ge_ro September 24, 1996 10:00 .. . . . . 982 TGTTTGCTAAAGCTTCAGCTATCCACAACCCAATTGTCTAC 1022 | |||||||| | | | ||| |||||||||| ||||| 961 TCTTTGCTAAGACCGCCTCCATCTACAACCCAATCATCTAC 1001 prompt>
prompt> gapshow pdrhod.ge_in rnops.ge_ro -begin1=982 begin2=961 end1=1022 end2=1001
gap is for aligning two sequences over their entire length. While it will work with distantly-related sequences (as in the example above), much of the alignment may have little to no biological significance. Instead, we will align two more closely-related rhodopsin mRNAs.
prompt>lookup -library=genbank -definition=rhodopsin
-organism="Oryctolagus cuniculus" -out=rabbitrh.list
prompt> more rabbitrh.list
LOOKUP in: genbank of: "([SQ-DEF: rhodopsin*] & [SQ-ORG: oryctolagus cuniculus*])" 1 entry September 24, 1996 14:52 .. gb:OCU21688 ! ID: a4f20006 ! DEFINITION Oryctolagus cuniculus rhodopsin mRNA, complete cds.prompt> fetch gb:OCU21688 -out=ocops.gb_om
prompt> gap ocops.gb_om rnops.gb_ro -out=rhodopm.pair -outfile2=ocops.gap -outfile3=rnops.gap
Gap uses the algorithm of Needleman and Wunsch to find the alignment of two complete sequences that maximizes the number of matches and minimizes the number of gaps. Begin (* 1 *) ? End (* 1198 *) ? Reverse (* No *) ? Begin (* 1 *) ? End (* 1493 *) ? Reverse (* No *) ? What is the gap creation penalty (* 5.00 *) ? What is the gap extension penalty (* 0.30 *) ? Aligning .................................................. .........-...... Gaps: 5 Quality: 1004.7 Quality Ratio: 0.839 % Similarity: 86.880 Length: 1502 prompt>
prompt> more rhodopm.pair
prompt>
gapshow ocops.gap
rnops.gap
compare, together with the graphing programme dotplot, is used to show regions of similarity within a sequence or between two sequences.
In the example sequences for bestfit, the two distantly-related rhodopsin mRNAs showed a best alignment region having ~73% similarity, and a second best one with ~70%; overall, though, these two sequences have only ~43% similarity (data from gap not shown). Thus, for compare to show only the best regions of similarity for the two distantly-related sequences, we need to use a stringency of between 60% & 70% matching bases. When compare checks for the percentage of matching bases, it does so in every possible comparison register, and within a window, i.e., a certain number of bases at a time. In a window of size 10, at least 6 to 7 bases must match (our best alignment region stringency conditions) for compare to score a "hit" between the two sequences.
prompt> compare pdrhod.ge_in rnops.ge_ro
...
prompt> dotplot pdrhod.pnt
prompt> compare pdrhod.ge_in rnops.ge_ro -win=41
-stri=28 -out=rhod4128.pnt
...
prompt> dotplot rhod4128.pnt
Generally, distantly-related sequences reveal their significant homologies when the window size is high and the stringency is low. With closely-related sequences, a medium size window with high stringency is best. E/GCG recommends the default window size of 21 and stringency of 67% (14) only as a starting point.
prompt> compare pdrhod.ge_in rnops.ge_ro -win=100
-stri=40 -out=rhod0040.pnt
...
prompt> dotplot rhod0040.pnt
prompt> compare ocops.ge_om rnops.ge_ro
-out=rhodm2114.pnt
...
prompt> dotplot rhodm2114.pnt
prompt> compare pdrhod.ge_in pdrhod.ge_in
-out=pdrhod2114.pnt
...
prompt> dotplot pdrhod2114.pnt -all