Wednesday 26 September 2012

CORTEX update contains scripts for assembly of large numbers of samples with large genomes - i.e. for the 1000 Genomes project.

cortex_var is a tool for genome assembly and variation analysis from sequence data. You can use it to discover and genotype variants on single or multiple haploid or diploid samples. If you have multiple samples, you can use Cortex to look specifically for variants that distinguish one set of samples (eg phenotype=X, cases, parents, tumour) from another set of samples (eg phenotype=Y, controls, child, normal). See our Nature Genetics paper and the documentation for detailed descriptions.

The Cortex paper is now out in Nature Genetics! 

cortex_var features

  • Variant discovery by de novo assembly - no reference genome required
  • Supports multicoloured de Bruijn graphs - have multiple samples loaded into the same graph in different colours, and find variants that distinguish them.
  • Capable of calling SNPs, indels, inversions, complex variants, small haplotypes
  • Extremely accurate variant calling - see our paper for base-pair-resolution validation of entire alleles (rather than just breakpoints) of SNPs, indels and complex variants by comparison with fully sequenced (and finished) fosmids - a level of validation beyond that demanded of any other variant caller we are aware of - currently cortex_var is the most accurate variant caller for indels and complex variants.
  • Capable of aligning a reference genome to a graph and using that to call variants
  • Support for comparing cases/controls or phenotyped strains
  • Typical memory use: 1 high coverage human in under 80Gb of RAM, 1000 yeasts in under 64Gb RAM, 10 humans in under 256 Gb RAM
23rd August 2012: Bugfix release v1.0.5.11. Get it here.. The main change in this release is in the scripts/1000genomes directory, which I have not advertised previously. It contains scripts for running Cortex on large numbers (tens, hundreds) of samples with large genomes - i.e. for the 1000 Genomes project. These are to allow collaborators across the world to reliably run a consistent Cortex pipeline on human populations. However this is the first time people other than me have done this, so I expect there may be some smoothing-out of issues in the near future. You can see aPDF describing the pipeline here. I've had enough people ask me about running Cortex on lots of samples with big genomes, that I thought people would find it useful to see the process. This release is a bugfix for a script in that 1000 Genomes directory, plus fixes for a few potential bugs-in-waiting (array overflow errors) in Cortex itself.

No comments:

Post a Comment

Datanami, Woe be me