SAMtools |
Suppose we have reference sequences in ref.fa, indexed by samtools faidx, and position sorted alignment files aln1.bam and aln2.bam, the following command lines call SNPs and short INDELs:
Since r865, it is possible to generate the consensus sequence with
In the command line above, samtools collects summary information in the input BAMs, computes the likelihood of data given each possible genotype and stores the likelihoods in the BCF format (see below). It does not call variants.
Bcftools applies the prior and does the actual calling. It can also concatenate BCF files, index BCFs for fast random access and convert BCF to VCF. In addition, bcftools can operate on some VCFs (e.g. calling SNPs from GL-tagged VCFs), but not for all VCFs; VCF to BCF conversion is not working at the moment, either.
One should consider to apply the following parameters to mpileup in different scenarios:
The Variant Call Format (VCF) is the emerging standard for storing variant data. Originally designed for SNPs and short INDELs, it also works for structural variations.
VCF consists of a header section and a data section. The header must contain a line starting with one '#', showing the name of each field, and then the sample names starting at the 10th column. The data section is TAB delimited with each line consisting of at least 8 mandatory fields (the first 8 fields in the table below). The FORMAT field and sample information are allowed to be absent. We refer to the official VCF spec for a more rigorous description of the format.
Col | Field | Description |
---|---|---|
1 | CHROM | Chromosome name |
2 | POS | 1-based position. For an indel, this is the position preceding the indel. |
3 | ID | Variant identifier. Usually the dbSNP rsID. |
4 | REF | Reference sequence at POS involved in the variant. For a SNP, it is a single base. |
5 | ALT | Comma delimited list of alternative seuqence(s). |
6 | QUAL | Phred-scaled probability of all samples being homozygous reference. |
7 | FILTER | Semicolon delimited list of filters that the variant fails to pass. |
8 | INFO | Semicolon delimited list of variant information. |
9 | FORMAT | Colon delimited list of the format of individual genotypes in the following fields. |
10+ | Sample(s) | Individual genotype information defined by FORMAT. |
BCF, or the binary variant call format, is the binary version of VCF. It keeps the same information in VCF, while much more efficient to process especially for many samples. The relationship between BCF and VCF is similar to that between BAM and SAM. The detailed format description of the BCF format can be found in bcf.tex included in the samtools source code package.
SAMtools/BCFtools may write the following tags in the INFO field in VCF/BCF.
Tag | Description | ||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
I16 | 16 integers:
|
||||||||||||||||||||||||||||||||
INDEL | Indicating the variant is an INDEL. | ||||||||||||||||||||||||||||||||
DP | The number of reads covering or bridging POS. | ||||||||||||||||||||||||||||||||
DP4 | Number of 1) forward ref alleles; 2) reverse ref; 3) forward non-ref; 4) reverse non-ref alleles, used in variant calling. Sum can be smaller than DP because low-quality bases are not counted. | ||||||||||||||||||||||||||||||||
PV4 | P-values for 1) strand bias (exact test); 2) baseQ bias (t-test); 3) mapQ bias (t); 4) tail distance bias (t) | ||||||||||||||||||||||||||||||||
FQ | Consensus quality. If positive, FQ equals the phred-scaled probability of there being two or more different alleles. If negative, FQ equals the minus phred-scaled probability of all chromosomes being identical. Notably, given one sample, FQ is positive at hets and negative at homs. | ||||||||||||||||||||||||||||||||
AF1 | EM estimate of the site allele frequency of the strongest non-reference allele. | ||||||||||||||||||||||||||||||||
CI95 | Equal-tail (Bayesian) credible interval of the site allele frequency at the 95% level. | ||||||||||||||||||||||||||||||||
PC2 | Phred-scaled probability of the alternate allele frequency of group1 samples being larger (,smaller) than of group2 samples. | ||||||||||||||||||||||||||||||||
PCHI2 | Posterior weighted chi^2 P-value between group1 and group2 samples. This P-value is conservative. | ||||||||||||||||||||||||||||||||
QCHI2 | Phred-scaled PCHI2 | ||||||||||||||||||||||||||||||||
RP | Number of permutations yeilding a smaller PCHI2 |
The following command performs association test:
To further calibrate P-value, one may perform a permutation test with the `-U' option. The `RP' INFO field will give the number of permutations which yield a smaller PCHI2 test statistics. Permutation test is very slow, due to the genotype ambiguity in sequencing data.
While calling SNPs, bcftools will print the estimated AFS to the error output at lines starting with [afs]. However, to accurately estimate AFS, we need to iterate the procedure. In most applications, we are only interested in the AFS conditional on a list of loci. Suppose we have the list of loci in file cond.txt with the first two columns in the file giving the coordinates, the procedure to estimate AFS is:
Another way to estimate AFS is to get the list of site allele frequency by
Base Alignment Quality (BAQ) is a new concept deployed in samtools-0.1.9+. It aims to provide an efficient and effective way to rule out false SNPs caused by nearby INDELs. The following shows the alignments of 6 reads by a typical read mapper in the presence of a 4bp homozygous INDEL:
coor 12345678901234 5678901234567890123456
ref aggttttataaaac----aattaagtctacagagcaacta
sample aggttttataaaacAAATaattaagtctacagagcaacta
read1 aggttttataaaac****aaAtaa
read2 ggttttataaaac****aaAtaaTt
read3 ttataaaacAAATaattaagtctaca
read4 CaaaT****aattaagtctacagagcaac
read5 aaT****aattaagtctacagagcaact
read6 T****aattaagtctacagagcaacta
where capital bases represent differences from the reference and underlined bases are the inserted bases. The alignments except for read3 are wrong because the 4bp insertion is misplaced. The mapper produces such alignments because when doing a pairwise alignment, the mapper prefers one or two mismatches over a 4bp insertion. What is hurting more is that the wrong alignments lead to recurrent mismatches, which are likely to deceive most site-independent SNP callers into calling false SNPs.
One way to avoid such an artifact is to do multi-sequence realignment, but the current implementations are very computationally demanding. SAMtools seeks another solution. It assigns each base a BAQ which is the Phred-scaled probability of the base being misaligned. BAQ is low if the base is aligned to a different reference base in a suboptimal alignment, and in this case a mismatch should contribute little to SNP calling even if the base quality is high. With BAQ, the mismatches in the example above are significantly downweighted. SAMtools will not call SNPs from that.
The BAQ strategy is invoked by default in mpileup. To make other SNP callers take advantage of BAQ, one should run
The following shows the detailed procedure on how to call SNPs/INDELs for hundreds of exomes. It only aims to provide an overview of how to handle huge data sets. Some command lines given below may not work for all systems. Advanced users may also want to modify based on their own system configurations.
In the following, the key and the most difficult part is the command line calling samtools mpileup. Once that is done, one can use 3rd party tools or write their own scsripts to achieve the rest.