Merging data sets

Now, we want to merge two different dataset (populations) into one. Two types of merging are available:

In this tutorial, we will merge two data sets by ``order'' which is the most complex case. The merge by order process is used because we want to merge genetic recombination data and radiated hybrid data and there is no simple way of relating the 2 distances (RH data being essentially physical). Actually, genetic and RH data nicely complement each other for ordering markers. Genetic data leads to myopic ordering: set of close markers cannot be reliably ordered because usually no recombination can be observed between them. On the contrary RH data leads to hypermetropic ordering: set of closely related markers can be reliably ordered but distant groups are sometimes difficult to order because too many breaks occurred between them. The data used here is coming from a single group representing one porcine chromosome.

First, let us examine each dataset independently and see what the buildfw command can produce from each of these. We first reinitialize CarthaGene using the cgrestart command (to keep markers numerical Ids low) and load the new data-set:

CG> cgrestart

CG> dsload Data/bc1.cg
{1 f2 backcross 17 208 /homes/thomas/carthagene/test/Data/bc1.cg}
CG> buildfw 3.0 3.0 {} 1

BuildFW, Adding  Threshold = 3.00, Saving Threshold = 3.00.

>>> Delta = 10.84 :

Map  0 : log10-likelihood =  -114.56
-------:
 Set : Marker List ...
   1 : MS4 MS9 MS12

>>> Delta = 9.73 , Id = 17,  Locus = MS1 :

Map  0 : log10-likelihood =  -155.67
-------:
 Set : Marker List ...
   1 : MS12 MS9 MS4 MS1

>>> Delta = 10.78 , Id = 9,  Locus = MS7 :

Map  0 : log10-likelihood =  -166.66
-------:
 Set : Marker List ...
   1 : MS12 MS9 MS7 MS4 MS1

>>> Delta = 13.30 , Id = 10,  Locus = MS2 :

Map  0 : log10-likelihood =  -178.48
-------:
 Set : Marker List ...
   1 : MS12 MS9 MS7 MS4 MS2 MS1

>>> Delta = 7.08 , Id = 7,  Locus = MS16 :

Map  0 : log10-likelihood =  -203.02
-------:
 Set : Marker List ...
   1 : MS16 MS12 MS9 MS7 MS4 MS2 MS1

>>> Delta = 11.51 , Id = 5,  Locus = MS11 :

Map  0 : log10-likelihood =  -207.54
-------:
 Set : Marker List ...
   1 : MS16 MS12 MS11 MS9 MS7 MS4 MS2 MS1

>>> Delta = 8.25 , Id = 16,  Locus = MS19 :

Map  0 : log10-likelihood =  -246.26
-------:
 Set : Marker List ...
   1 : MS19 MS16 MS12 MS11 MS9 MS7 MS4 MS2 MS1

>>> Delta = 8.23 , Id = 13,  Locus = MS15 :

Map  0 : log10-likelihood =  -250.99
-------:
 Set : Marker List ...
   1 : MS19 MS16 MS15 MS12 MS11 MS9 MS7 MS4 MS2 MS1

>>> Delta = 7.07 , Id = 2,  Locus = MS5 :

Map  0 : log10-likelihood =  -253.69
-------:
 Set : Marker List ...
   1 : MS19 MS16 MS15 MS12 MS11 MS9 MS7 MS5 MS4 MS2 MS1

>>> Delta = 4.07 , Id = 8,  Locus = MS8 :

Map  0 : log10-likelihood =  -262.35
-------:
 Set : Marker List ...
   1 : MS19 MS16 MS15 MS12 MS11 MS9 MS8 MS7 MS5 MS4 MS2 MS1

BuildFW, remaining loci test :
       |                          |
       | 1   1 1   1         1 1  |     Lod2pt         Dist2pt
       | 6 7 3 4 5 2 8 9 2 1 0 7  |  Left<-M->Right Left<-M->Right | 0->N  ...
     --|--------------------------|--------------------------------|-------...
  MS13 |      0 +                 |  18.66   11.91     0.0   8.4   |   43.3...
   MS6 |              3 +         |  17.32   15.41     1.6   4.5   |   70.3...
  MS17 |2 +                       |  14.06    6.31     7.4  27.2   |    0.0...
   MS3 |                    + 2   |  35.05   47.62     5.7   0.6   |   79.2...
  MS20 |+ 1                       |    -     34.83      -    2.9   |   -   ...
The BC dataset is a good quality dataset. Few markers could not be placed in the framework map. The ``quality'' of this map should naturally be checked more thoroughly using other ``validating'' and ``improving'' commnds. We now perform the same analysis on RH data:
CG> dsload Data/rh1.cg
{2 haploid RH 13 118 /homes/thomas/carthagene/test/Data/rh1.cg}
CG> buildfw 3.0 3.0 {} 1

BuildFW, Adding  Threshold = 3.00, Saving Threshold = 3.00.

>>> Delta = 5.13 :

Map  0 : log10-likelihood =   -68.70
-------:
 Set : Marker List ...
   2 : MS4 G40 G39

>>> Delta = 6.14 , Id = 18,  Locus = G36 :

Map  0 : log10-likelihood =   -77.85
-------:
 Set : Marker List ...
   2 : MS4 G36 G40 G39

>>> Delta = 3.91 , Id = 4,  Locus = MS6 :

Map  0 : log10-likelihood =  -101.52
-------:
 Set : Marker List ...
   2 : MS6 MS4 G36 G40 G39

>>> Delta = 3.65 , Id = 12,  Locus = MS9 :

Map  0 : log10-likelihood =  -125.81
-------:
 Set : Marker List ...
   2 : MS9 MS6 MS4 G36 G40 G39

BuildFW, remaining loci test :
       |              |
       | 1     1 2 1  |     Lod2pt         Dist2pt
       | 2 4 1 8 1 9  |  Left<-M->Right Left<-M->Right | 0->N    N->M | Wei...
     --|--------------|--------------------------------|--------------|----...
   MS5 |  2 +         |   7.44    3.64    54.3  90.1   |   81.4  29.4 |    ...
   MS8 |+ 0           |    -     18.48      -   16.5   |   -     16.5 |    ...
   MS7 |  + 3         |   5.44   12.79    72.3  29.5   |    0.0  57.8 |    ...
   MS3 |          + 0 |   9.84    9.69    45.2  45.9   |  205.2  23.5 |    ...
  MS15 |+           1 |    -      0.99      -  159.0   |   -    159.0 |    ...
   MS1 |3           + |   3.76     -      89.6    -    |  252.7  89.6 |    ...
   G37 |        3 +   |  15.04    7.23    24.9  59.2   |  205.2  14.0 |    ...
Since radiated hybrid data usually have very few missing information, it is possible to use specific ordering methods to build very good maps. Several commands in CARTHAGENE can directly translate a RH dataset into a specific travelling salesman problem instance such that the optimal solution of this TSP instance will also yield the optimal map. This translation [BDCP00] is faithful when there is no missing information but is still very effective when few missing exists. CARTHAGENE automatically reevaluates all the maps generated using multi-point maximum log-likelihood. Here we will try to reorder the framework map we built using the lkhn command (lk stands for Lin-Kernighan heuristics [LK73,Hel00], a famous algorithm used to solve travelling salesman problem instances.):
CG> lkhn 1 1
Best map with log10-likelihood = -125.81
TSP: optimum= 127.355000 lowerbound= 127.355000 gap= 0.000000% totaltime= 0.01

Map -1 : log10-likelihood =  -125.81
-------:
 Set : Marker List ...
   2 : MS9 MS6 MS4 G36 G40 G39
Optimum found, equal to 127355! The minimum 1-tree is a tour.
The same map with a $-125.81$ loglikelihood is found which is a very good indication that the order is indeed optimal.

The RH dataset is less order informative than the BC dataset. However, merging the two data sets will help in ordering the markers. Because we are merging populations with not directly related parameters (RH and genetic distances), we will use the dsmergor command. The command takes two parameters that specify the numerical id of the two populations to be merged. To see the numerical ids, we can use dsinfo.

CG> dsinfo

Data Sets :
----------:
ID        Data Type    markers individuals             filename constraints...
 1     f2 backcross         17         208               bc1.cg
 2       haploid RH         13         118               rh1.cg

CG> dsmergor 1 2
{3 merged by order 21 326}
CG> dsinfo

Data Sets :
----------:
ID        Data Type    markers individuals             filename constraints...
 1     f2 backcross         17         208               bc1.cg
 2       haploid RH         13         118               rh1.cg
 3  merged by order         21         326                                 ...
To better see how markers are shared between the 2 populations, we can use the mrkinfo command:
CG> mrkinfo
...
We can now try to build a framework map for the hybrid genetic/RH data set.
CG> buildfw 3.0 3.0 {} 1

BuildFW, Adding  Threshold = 3.00, Saving Threshold = 3.00.

>>> Delta = 11.54 :

Map  0 : log10-likelihood =  -202.47
-------:
 Set : Marker List ...
   1 : MS7 MS3 MS1
   2 : MS7 MS3 MS1

>>> Delta = 14.11 , Id = 1,  Locus = MS4 :

Map  0 : log10-likelihood =  -232.07
-------:
 Set : Marker List ...
   1 : MS7 MS4 MS3 MS1
   2 : MS7 MS4 MS3 MS1

>>> Delta = 11.64 , Id = 14,  Locus = MS12 :

Map  0 : log10-likelihood =  -266.52
-------:
 Set : Marker List ...
   1 : MS12 MS7 MS4 MS3 MS1
   2 :      MS7 MS4 MS3 MS1

>>> Delta = 13.07 , Id = 12,  Locus = MS9 :

Map  0 : log10-likelihood =  -302.10
-------:
 Set : Marker List ...
   1 : MS12 MS9 MS7 MS4 MS3 MS1
   2 :      MS9 MS7 MS4 MS3 MS1

>>> Delta = 11.18 , Id = 2,  Locus = MS5 :

Map  0 : log10-likelihood =  -325.87
-------:
 Set : Marker List ...
   1 : MS12 MS9 MS7 MS5 MS4 MS3 MS1
   2 :      MS9 MS7 MS5 MS4 MS3 MS1

>>> Delta = 7.07 , Id = 7,  Locus = MS16 :

Map  0 : log10-likelihood =  -350.40
-------:
 Set : Marker List ...
   1 : MS16 MS12 MS9 MS7 MS5 MS4 MS3 MS1
   2 :           MS9 MS7 MS5 MS4 MS3 MS1

>>> Delta = 11.49 , Id = 5,  Locus = MS11 :

Map  0 : log10-likelihood =  -354.94
-------:
 Set : Marker List ...
   1 : MS16 MS12 MS11 MS9 MS7 MS5 MS4 MS3 MS1
   2 :                MS9 MS7 MS5 MS4 MS3 MS1

>>> Delta = 8.25 , Id = 16,  Locus = MS19 :

Map  0 : log10-likelihood =  -393.66
-------:
 Set : Marker List ...
   1 : MS19 MS16 MS12 MS11 MS9 MS7 MS5 MS4 MS3 MS1
   2 :                     MS9 MS7 MS5 MS4 MS3 MS1

>>> Delta = 8.23 , Id = 13,  Locus = MS15 :

Map  0 : log10-likelihood =  -424.96
-------:
 Set : Marker List ...
   1 : MS19 MS16 MS15 MS12 MS11 MS9 MS7 MS5 MS4 MS3 MS1
   2 :           MS15           MS9 MS7 MS5 MS4 MS3 MS1

>>> Delta = 6.52 , Id = 4,  Locus = MS6 :

Map  0 : log10-likelihood =  -438.73
-------:
 Set : Marker List ...
   1 : MS19 MS16 MS15 MS12 MS11 MS9 MS7 MS6 MS5 MS4 MS3 MS1
   2 :           MS15           MS9 MS7 MS6 MS5 MS4 MS3 MS1

>>> Delta = 6.44 , Id = 21,  Locus = G40 :

Map  0 : log10-likelihood =  -454.74
-------:
 Set : Marker List ...
   1 : MS19 MS16 MS15 MS12 MS11 MS9 MS7 MS6 MS5 MS4     MS3 MS1
   2 :           MS15           MS9 MS7 MS6 MS5 MS4 G40 MS3 MS1

>>> Delta = 5.18 , Id = 20,  Locus = G37 :

Map  0 : log10-likelihood =  -463.88
-------:
 Set : Marker List ...
   1 : MS19 MS16 MS15 MS12 MS11 MS9 MS7 MS6 MS5 MS4         MS3 MS1
   2 :           MS15           MS9 MS7 MS6 MS5 MS4 G40 G37 MS3 MS1

>>> Delta = 6.99 , Id = 18,  Locus = G36 :

Map  0 : log10-likelihood =  -473.02
-------:
 Set : Marker List ...
   1 : MS19 MS16 MS15 MS12 MS11 MS9 MS7 MS6 MS5 MS4             MS3 MS1
   2 :           MS15           MS9 MS7 MS6 MS5 MS4 G36 G40 G37 MS3 MS1

>>> Delta = 6.88 , Id = 19,  Locus = G39 :

Map  0 : log10-likelihood =  -493.00
-------:
 Set : Marker List ...
   1 : MS19 MS16 MS15 MS12 MS11 MS9 MS7 MS6 MS5 MS4             MS3     MS1
   2 :           MS15           MS9 MS7 MS6 MS5 MS4 G36 G40 G37 MS3 G39 MS1

>>> Delta = 4.96 , Id = 8,  Locus = MS8 :

Map  0 : log10-likelihood =  -513.77
-------:
 Set : Marker List ...
   1 : MS19 MS16 MS15 MS12 MS11 MS9 MS8 MS7 MS6 MS5 MS4             MS3    ...
   2 :           MS15           MS9 MS8 MS7 MS6 MS5 MS4 G36 G40 G37 MS3 G39...

BuildFW, remaining loci test :
       |                                    |
       | 1   1 1   1           1 2 2 1 1 1  |     Lod2pt         Dist2pt
       | 6 7 3 4 5 2 8 9 4 2 1 8 1 0 1 9 7  |  Left<-M->Right Left<-M->Righ...
     --|------------------------------------|------------------------------...
  MS13 |      0 +                           |  18.66   11.91     0.0   8.4 ...
  MS17 |2 +                                 |  14.06    6.31     7.4  27.2 ...
   MS2 |                      2 2 2 2 + 0   |  47.62    0.00     0.6   0.0 ...
  MS20 |+ 1                                 |    -     34.83      -    2.9 ...
Thanks to RH data, a new genetic marker that could not previously be inserted is now sufficiently strongly ordered to be inserted. Furthermore, all new RH markers have been inserted. The final insertion map also shows that MS2 is replaced by MS3. The two markers are probably close one to the other and cannot both be inserted,

As mentioned earlier, further work should include a thorough validation of this order.

Thomas Schiex 2009-10-27