This is a MapMaker compatible format. CARTHAGENE uses a dedicated boosted EM for backcross data which may be one or two orders or magnitudes faster than a standard EM algorithm without any loss of precision [SCBM01]. All backcross data file must start with the following header line:
data type f2 backcross
In the case of backcross data, each locus can either be homozygous or heterozygous. These two situations are encoded using the H and A characters respectively. Loci with an unknown status are encoded as -. This encoding can be redefined using aliasing in the second header line (see the beginning of the section).
This is a MapMaker compatible format. All F2 (intercross) datasets must start with the following header line:
data type f2 intercross
Depending on the dominance or codominance of each loci, several situations can be encoded. If we call A and B the two alleles of an heterozygous individual, the descendance can be either:
These are MapMaker compatible formats. Both self and sib RIL data are handled by CARTHAGENE. Note that since these data types are internally handled as pure backcross data (recombination frequencies being adequately corrected to take into account the fact that the data represent self/sib RIL), it is impossible to completely merge (both on order and recombination frequencies, see the dsmergen command, section 2.2.2) ) such data with other genetic data. Use the dsmergor command in this case.
Depending on the RIL type (self/sib), the first header line of the format file must be respectively:
data type ri selfor
data type ri sib
The characters used to encode RIL data are the same as for backcross data (see section 2.1.1). This encoding can be redefined using aliasing in the second header line (see the beginning of the section).
Mating designs consisting of a series of backcrossing, selfing, random intercrossing, and/or haploid-doubling operations applied to F1 progeny of a cross between two homozygous individuals are accepted. The file header will look like this:
data type bs BBSBS
where the first bs (case-insensitive) is required and the final word on the first line is a variable sequence of letters that denote mating operations. In the example above, two backcrosses are followed by a selfing, another backcross, and a final self. The allowed codes are b, s, i, and d, and any sequence of up to eight operations is permitted.
If your design ends with a backcross operation, be sure that the recurrent parent is represented by the A character. You can use CarthaGene?s aliasing notation to alter the character meanings, as described for the f2 backcross and other mating types.
Don?t use this method to analyze a single backcross (coded as bs b) or RI design (coded, for the example of an F9, as bs ssssssss). For these, the standard CarthaGene types are handled with faster algorithms. The F2 design (coded as bs s) is handled at probably about the same speed as if it were coded f2 intercross.
CARTHAGENE can handle outbred data as far as phases are fixed (either known or fixed to the most probable phases). Such phase known outbred datasets can be handled using different strategies. A first simple method (which ignores part of the information) consists in projecting the information on each parent side: this gives two backcross datasets which can be merged using either the mergen or mergor command. The first case will aim at computing a consensus map (with consensus distances) while the second one will aim at computing a consensus order with different recombination ratio on each parent. We will not detail this strategy here although it has the advantage of relying on our ``Boosted EM'' algorithm which means that it will be a lot faster than the approach below. In this section, we describe a more complex encoding which does not ignore any information. All outbred datasets must start with the following header line (same as for intercross data:
data type f2 intercross
Because the ability to handle outbred data has evolved from a classical intercross situation using Mapmaker syntax, the syntax used to encode such data is currently rather clumsy. This may change one of these days but you'd better not count on it...
At one locus, consider the cross of where , , and stand for the alleles on each haplotype of the father and the mother respectively. The genotype of the child obtained may be either , , or . Depending on the heterozygocity of the parents, or the number of different alleles, on the dominance or codominance of the markers, on the observations available on the child's phenotype, only a subset of these 4 possibilities is possible. For example, in the ``usual'' F2 intercross situation, the two parents are heterozygous and bear the same pair of alleles: and . In this simple case, observations on the phenotype of a child may lead to different situations:
In order to be able to cope with all phases known situations, including cases where one parent is homozygous, when 3 or 4 different alleles are present, Carthagène actually enables the user to express any subset of the 4 different possibilities. In order to do so, these 4 possibilities are associated with a number:
and the user will be able to tell Carthagène which set of genotype is actually possible at a given locus by:
The following tables recapitulates all possible codes from 1 to f and the corresponding set of possible genotypes at the locus.
Notation | Synonym | Possible Genotypes | |
1 | A | ||
2 | |||
3 | , | ||
4 | |||
5 | , | ||
6 | H | , | |
7 | D | , | |
8 | B | ||
9 | , | ||
a | , | ||
b | , , | ||
c | , | ||
d | , , | ||
e | C | , , | |
f | - | , , , |
Here is an example of a small outbred dataset:
data type f2 intercross 40 5 0 0 *M1 1118822228821414212414281812248128422488 *M2 1418822228821414212414281882248128422488 *M3 4418828288821414242414281881148122422488 *M4 4412288228411814211444881884248222124488 *M5 8412288224412814811484881281848822184188
Let see how this can be used in some practical phase-known outbred situations according to the segregation ratio of the marker at the current locus.
This is the usual case in F2 intercross with codominance. In this case, the parents and are such that typically and and codominate. The usual observations on the child are either phenotype , or . this lead to the following encoding:
When missing data occurs, it is still possible to give partial information to Carthagène. Eg., if the child is typed but the other allele is not known. The child's genotype can be either or or . This is encoded by the character 7=1+2+4 (synonym D)
The character e = 14 = 8+4+2 (synonym C) encodes a situation where the child has been typed but the other allele is not known. In this case, the child's genotype can be or or .
This type of segregation ratio occurs when dominance appears. Imagine dominates , then it is impossible to distinguish between , and . Precisely, if the child is typed , then the character (or the synonym B of MapMaker) will be used. Else the character 7 = 1+2+4 (or the synonym D of MapMaker) will be used to represent the fact that we simply know that the child genotype is either , or .
Conversely, if dominates , the code 1 (resp. e) will be used if the child is typed (resp. ). The respective synonyms A and C of MapMaker can also be used.
This occurs when different alleles appear in the father and in the mother. For example in . In this case the child is either or or or and the corresponding codes are respectively 1 (or A), 2, 4 and 8 (or B).
When some data is missing, it is still possible to give information to Carthagène. Imagine for example that the child is typed then the second allele is unknown. Because we have 4 different alleles, we know for sure that the first allele of the children is . Therefore, the only possible genotypes are (code 1) or (code 2) and the corresponding code is 3= 1+2.
Imagine that instead of having 4 alleles, we have 3 alleles in . Then children may be , , or i.e., there is still a 1:1:1:1 segregation ratio. If again the children is just typed then there is more indetermination at hand: the child may be either (code 1), (2) or (code 4). In this case, the corresponding code is 7 = 1+2+4 (or the synonym D).
Imagine the second parent is homozygous at current locus i.e., we cross with (a backcross like situation), then the children genotypes may either be or If we observe (or simply), it is not know where does the come from i.e, the children may be or . This case is encoded by 3 = 1+2. This is the sum of 1 and 2 which corresponds to the two possible cases: with coming from one grand-parent (code 1) or the other (code 2).
Similarly, if we observe , then we know that the first allele is , the second is but we don't where this allele comes from. The code will be c = 12 = 4+8.
If the homozygocity appears on the first parent ( ) instead of the second one and if we observe we get the code 5, the sum of 1 and 4 corresponding to the fact that the origin of the first allele is unknown. If we observe , we get the code a which in hexadecimal corresponds to 10 which is the sum of 8 and 2.
Thomas Schiex 2009-10-27