Kevin's GATTACA World: Somatic Mutation Detection in Whole Genome Sequencing Data

Thursday 22 December 2011

Somatic Mutation Detection in Whole Genome Sequencing Data | MassGenomics

There's a new tool for SNP calling.
this post is useful in understanding the challenges of reducing false positives / background noise for the bona fide mutations from NGS data. For the wet lab, trying a variety of SNP callers and doing validation on the ones that are present in all of the SNP callers gives u a higher validation rate. But having the bona fide mutation doesn't mean you have hit on the correct mutation associated with your disease ...

Excerpted from
http://www.massgenomics.org/2011/12/somatic-mutation-detection-in-whole-genome-sequencing-data.html

Filtering Out the Noise

No matter how good the mutation caller, there are going to be some false positives. This is because you're looking for a one-in-a-million event, a true somatic mutation. Raw SomaticSniper calls therefore undergo a series of Maq-inspired filters. Sites are retained if they meet these criteria:

Covered by at least 3 reads
Consensus quality of at least 20
Called a SNP in the tumor sample with SNP quality of at least 20
Maximum mapping quality of at least 40
No high-quality predicted indel within 10 bp
No more than 2 other SNVs called within 10 bp

Sites passing these criteria are subjected to two additional filters: a screen against germline variants from dbSNP (remove if matches position and allele of known non-cancer dbSNP) and an LOH filter (remove if normal is heterozygous and tumor homozygous for the same variant allele). Sites removed by the former are probably inherited variants under-sampled in the matched normal, while sites removed by the latter are likely due to large-scale structural changes (e.g. deletions) causing the loss of one allele. Finally, the filter-passed mutations are classified as high-confidence (HC) if the somatic score is at least 40 and the mapping quality is at least 40 (for BWA) or 70 (for Maq).

Frequent Sources of False Positives

Even sites that pass the filters above are vulnerable to certain sequencing and alignment artifacts that produce false positive calls. A detailed study revealed (as many in the field know already) a few common sources of false positives: strand bias, homopolymer sequences, paralogous reads (deriving from a paralogous region of the genome, but mapped to the wrong region, usually three or more substitutions), and the read position of the predicted variant. The latter type of artifact is something new; it turned out that variants only seen near the “effective” 3′ end of reads (the start of soft-trimmed bases or the actual end of the read if untrimmed) were more likely to be false positives. This may be a combination of sequencing error, which is higher at the 3′ end of reads, and alignment bias favoring mismatches over gaps near the ends of reads. In any case, false positives deriving from these common causes tend to have certain properties enabling them to be identified and removed while maintaining sensitivity for true mutations.
SomaticSniper adds to the growing arsenal of tools developed by our group to address the significant challenges presented by next-generation sequencing data analysis.

Kevin's GATTACA World

Thursday 22 December 2011

Somatic Mutation Detection in Whole Genome Sequencing Data | MassGenomics

Filtering Out the Noise

Frequent Sources of False Positives

No comments:

Post a Comment

Datanami, Woe be me

Analytics code

Contributors