Wednesday, 30 November 2011
The Human OligoGenome Resource: a database of oligonucleotide capture probes for resequencing target regions across the human genome
Nucl. Acids Res. (2011) doi: 10.1093/nar/gkr973 First published online: November 18, 2011
AbstractRecent exponential growth in the throughput of next-generation DNA sequencing platforms has dramatically spurred the use of accessible and scalable targeted resequencing approaches. This includes candidate region diagnostic resequencing and novel variant validation from whole genome or exome sequencing analysis. We have previously demonstrated that selective genomic circularization is a robust in-solution approach for capturing and resequencing thousands of target human genome loci such as exons and regulatory sequences. To facilitate the design and production of customized capture assays for any given region in the human genome, we developed the Human OligoGenome Resource (http://oligogenome.stanford.edu/). This online database contains over 21 million capture oligonucleotide sequences. It enables one to create customized and highly multiplexed resequencing assays of target regions across the human genome and is not restricted to coding regions. In total, this resource provides 92.1% in silico coverage of the human genome. The online server allows researchers to download a complete repository of oligonucleotide probes and design customized capture assays to target multiple regions throughout the human genome. The website has query tools for selecting and evaluating capture oligonucleotides from specified genomic regions
Nucl. Acids Res. (2011) doi: 10.1093/nar/gkr995 First published online: November 18, 2011http://nar.oxfordjournals.org/content/early/2011/11/17/nar.gkr995.long
AbstractDue to ongoing advances in sequencing technologies, billions of nucleotide sequences are now produced on a daily basis. A major challenge is to visualize these data for further downstream analysis. To this end, we present GenomeView, a stand-alone genome browser specifically designed to visualize and manipulate a multitude of genomics data. GenomeView enables users to dynamically browse high volumes of aligned short-read data, with dynamic navigation and semantic zooming, from the whole genome level to the single nucleotide. At the same time, the tool enables visualization of whole genome alignments of dozens of genomes relative to a reference sequence. GenomeView is unique in its capability to interactively handle huge data sets consisting of tens of aligned genomes, thousands of annotation features and millions of mapped short reads both as viewer and editor. GenomeView is freely available as an open source software package.
Abstract | eXframe: reusable framework for storage, analysis and visualization of genomics experiments
Genome-wide experiments are routinely conducted to measure gene expression, DNA-protein interactions and epigenetic status. Structured metadata for these experiments is imperative for a complete understanding of experimental conditions, to enable consistent data processing and to allow retrieval, comparison, and integration of experimental results. Even though several repositories have been developed for genomics data, only a few provide annotation of samples and assays using controlled vocabularies. Moreover, many of them are tailored for a single type of technology or measurement and do not support the integration of multiple data types.
We have developed eXframe - a reusable web-based framework for genomics experiments that provides 1) the ability to publish structured data compliant with accepted standards 2) support for multiple data types including microarrays and next generation sequencing 3) query, analysis and visualization integration tools (enabled by consistent processing of the raw data and annotation of samples) and is available as open-source software. We present two case studies where this software is currently being used to build repositories of genomics experiments - one contains data from hematopoietic stem cells and another from Parkinson's disease patients.
The web-based framework eXframe offers structured annotation of experiments as well as uniform processing and storage of molecular data from microarray and next generation sequencing platforms. The framework allows users to query and integrate information across species, technologies, measurement types and experimental conditions. Our framework is reusable and freely modifiable - other groups or institutions can deploy their own custom web-based repositories based on this software. It is interoperable with the most important data formats in this domain. We hope that other groups will not only use eXframe, but also contribute their own useful modifications.
Video: Non-Laser Capture Microscopy Approach for the Microdissection of Discrete Mouse Brain Regions for Total RNA Isolation and Downstream Next-Generation Sequencing and Gene Expression Profiling
As technological platforms, approaches such as next-generation sequencing, microarray, and qRT-PCR have great promise for expanding our understanding of the breadth of molecular regulation. Newer approaches such as high-resolution RNA sequencing (RNA-Seq)1 provides new and expansive information about tissue- or state-specific expression such as relative transcript levels, alternative splicing, and micro RNAs2-4. Prospects for employing the RNA-Seq method in comparative whole transcriptome profiling5 within discrete tissues or between phenotypically distinct groups of individuals affords new avenues for elucidating molecular mechanisms involved in both normal and abnormal physiological states. Recently, whole transcriptome profiling has been performed on human brain tissue, identifying gene expression differences associated with disease progression6. However, the use of next-generation sequencing has yet to be more widely integrated into mammalian studies.
Gene expression studies in mouse models have reported distinct profiles within various brain nuclei using laser capture microscopy (LCM) for sample excision7,8. While LCM affords sample collection with single-cell and discrete brain region precision, the relatively low total RNA yields from the LCM approach can be prohibitive to RNA-Seq and other profiling approaches in mouse brain tissues and may require sub-optimal sample amplification steps. Here, a protocol is presented for microdissection and total RNA extraction from discrete mouse brain regions. Set-diameter tissue corers are used to isolate 13 tissues from 750-μm serial coronal sections of an individual mouse brain. Tissue micropunch samples are immediately frozen and archived. Total RNA is obtained from the samples using magnetic bead-enabled total RNA isolation technology. Resulting RNA samples have adequate yield and quality for use in downstream expression profiling. This microdissection strategy provides a viable option to existing sample collection strategies for obtaining total RNA from discrete brain regions, opening possibilities for new gene expression discoveries.
Monday, 28 November 2011
Well I did a test run of 32 bit 11.10 in a virtual environment giving it 2 Gb ram and a single core, Unity or something definitely slowed it down too much to be useable.. guess I will stick to 10.04 LTS for now till it's sorted out.
Whole transcriptome sequencing by mRNA-Seq is now used extensively to perform global gene expression, mutation, allele-specific expression and other genome-wide analyses. mRNA-Seq even opens the gate for gene expression analysis of non-sequenced genomes. mRNA-Seq offers high sensitivity, a large dynamic range and allows measurement of transcript copy numbers in a sample. Illumina's genome analyzer performs sequencing of a large number (> 10(7)) of relatively short sequence reads (< 150 bp).The "paired end" approach, wherein a single long read is sequenced at both its ends, allows for tracking alternate splice junctions, insertions and deletions, and is useful for de novo transcriptome assembly. One of the major challenges faced by researchers is a limited amount of starting material. For example, in experiments where cells are harvested by laser micro-dissection, available starting total RNA may measure in nanograms. Preparation of mRNA-Seq libraries from such samples have been described(1, 2) but involves significant PCR amplification that may introduce bias. Other RNA-Seq library construction procedures with minimal PCR amplification have been published(3, 4) but require microgram amounts of starting total RNA. Here we describe a protocol for the Illumina Genome Analyzer II platform for mRNA-Seq sequencing for library preparation that avoids significant PCR amplification and requires only 10 nanograms of total RNA. While this protocol has been described previously and validated for single-end sequencing(5), where it was shown to produce directional libraries without introducing significant amplification bias, here we validate it further for use as a paired end protocol. We selectively amplify polyadenylated messenger RNAs from starting total RNA using the T7 based Eberwine linear amplification method, coined "T7LA" (T7 linear amplification). The amplified poly-A mRNAs are fragmented, reverse transcribed and adapter ligated to produce the final sequencing library. For both single read and paired end runs, sequences are mapped to the human transcriptome(6) and normalized so that data from multiple runs can be compared. We report the gene expression measurement in units of transcripts per million (TPM), which is a superior measure to RPKM when comparing samples(7).
Saturday, 26 November 2011
From: "Heng Li"
Date: 24 Nov 2011 11:38
Thanks for all the replies. I will disable the color-space support in the 0.6.x branch, but leave non-functional source code in the files (though this is not my style). In future, I may re-evaluate the necessity of supporting color-space alignment in the 0.6.x branch. People who use bwa for color-space alignment may continue to use 0.5.10. 0.5.10 is as accurate as 0.6.x. It may be slower but just a little bit.
Thank you all,
Friday, 25 November 2011
Convey Computer's new Burrows-Wheeler Aligner (BWA) personality dramatically accelerates genome reference mapping by 15x, enabling researchers and clinicians to more rapidly and cost-effectively identify variants.
not affliated .. interesting
Thursday, 24 November 2011
Genome-wide profiling of novel and conserved Populus microRNAs involved in pathogen stress response by deep sequencing.. [Planta. 2011] - PubMed - NCBI
MicroRNAs (miRNAs) are small RNAs, generally of 20-23 nt, that down-regulate target gene expression during development, differentiation, growth, and metabolism. In Populus, extensive studies of miRNAs involved in cold, heat, dehydration, salinity, and mechanical stresses have been performed; however, there are few reports profiling the miRNA expression patterns during pathogen stress. We obtained almost 38 million raw reads through Solexa sequencing of two libraries from Populus inoculated and uninoculated with canker disease pathogen. Sequence analyses identified 74 conserved miRNA sequences belonging to 37 miRNA families from 154 loci in the Populus genome and 27 novel miRNA sequences from 35 loci, including their complementary miRNA* strands. Intriguingly, the miRNA* of three conserved miRNAs were more abundant than their corresponding miRNAs. The overall expression levels of conserved miRNAs increased when subjected to pathogen stress, and expression levels of 33 miRNA sequences markedly changed. The expression trends determined by sequencing and by qRT-PCR were similar. Finally, nine target genes for three conserved miRNAs and 63 target genes for novel miRNAs were predicted using computational analysis, and their functions were annotated. Deep sequencing provides an opportunity to identify pathogen-regulated miRNAs in trees, which will help in understanding the regulatory mechanisms of plant defense responses during pathogen infection.
Wednesday, 23 November 2011
What do graphs look like? How do they evolve over time? How do you handle a graph with a billion nodes? Chris presents a comprehensive list of static and temporal laws, grounded in recent observations on real graphs. He then presents tools for discovering anomalies and patterns in graphs. Finally, an overview of the PEGASUS system which is designed to handle billion-node graphs using Hadoop.
2. The Art and Science of Matching Items to Users by Deepak Agarwal (Yahoo! Research)
Algorithmically matching items to users in a given context is essential for the success and profitability of large scale recommender systems like content optimization, computational advertising, search, shopping, movie recommendation, and many more. In this talk, Deepak discusses some of the key technical challenges by focusing on a concrete application – content optimization on the Yahoo! front page. He also briefly discusses response prediction techniques for serving ads on the RightMedia Ad exchange.
3. Big Data in Real Time: Processing Data Streams at LinkedIn by Jay Kreps (LinkedIn)
My colleague, Jay Kreps, discusses the state of up-and-coming stream processing technologies and how they fit in the broader data infrastructure ecosystem — from live storage systems to Hadoop. He explores problems that are amenable to real-time stream processing, solutions that change and shape the way we think about data, and challenges and lessons that we have learned while building LinkedIn’s data infrastructure. A must-see presentation.
In addition to providing compelling speakers, Open Tech Talks offer attendees a low-pressure environment in which people with shared professional interests can reconnect with people they know, as well as make new connections. For those who cannot attend, we live-stream the talks and post the entire recordings on YouTube.
Tuesday, 22 November 2011
1 teraflop of double-precision floating point performance, 50+-core ... it would be interesting to see how GPU servers fare now that Intel is pushing this chip in HPC ... that said, I don't know any guys that use GPUs for bioinformatics work ... do you?
Compare, contrast, decide !
NGS Field Guide – Overview
The tables presented in Glenn (2011) are split and updated in the following:
- Table 1a-c. "Grades" for common applications on various NGS instruments. Other information from the original table 1 is relatively static.
- Table 2a. Run time, Millions of reads/run, Bases/read, and Yield/run for all common commercial NGS platforms.
- Table 2b. Reagent costs/run, reagent costs/Mb, and minimum commercially available units for all common commercial NGS platforms.
- Table 3a. List purchase price for for all common commercial NGS platforms, ancillary equipment, and service contracts.
- Table 3b. Computational resources required for all common commercial NGS platforms.
- Table 3c. Errors and error rates for common commercial NGS platforms.
- Table 4. Advantages and Disadvantages for all common commercial NGS platforms.
©2011 Blackwell Publishing Ltd
Monday, 21 November 2011
sudo dmidecode | less
e.g. of output
Socket Designation: Microprocessor
Type: Central Processor
Family: Pentium 4
ID: 41 0F 00 00 FF FB EB BF
Signature: Type 0, Family 15, Model 4, Stepping 1
FPU (Floating-point unit on-chip)
VME (Virtual mode extension)
DE (Debugging extension)
PSE (Page size extension)
TSC (Time stamp counter)
MSR (Model specific registers)
PAE (Physical address extension)
MCE (Machine check exception)
CX8 (CMPXCHG8 instruction supported)
APIC (On-chip APIC hardware supported)
SEP (Fast system call)
MTRR (Memory type range registers)
PGE (Page global enable)
MCA (Machine check architecture)
CMOV (Conditional move instruction supported)
PAT (Page attribute table)
PSE-36 (36-bit page size extension)
CLFSH (CLFLUSH instruction supported)
DS (Debug store)
ACPI (ACPI supported)
MMX (MMX technology supported)
FXSR (Fast floating-point save and restore)
SSE (Streaming SIMD extensions)
SSE2 (Streaming SIMD extensions 2)
HTT (Hyper-threading technology)
TM (Thermal monitor supported)
PBE (Pending break enabled)
Version: Not Specified
Voltage: 1.7 V
External Clock: 800 MHz
Max Speed: 4000 MHz
Current Speed: 3000 MHz
Status: Populated, Enabled
Upgrade: ZIF Socket
L1 Cache Handle: 0x0700
L2 Cache Handle: 0x0701
L3 Cache Handle: Not Provided
Sunday, 20 November 2011
Or on galaxy, which has only color space support on their test server ...
If bwa drops solid support then the remaining opensource options off the top of my head are bfast and bowtie.
There's also novoalign and bioscope (non open source )
But the drop for solid.support might also reflect a decrease in solid data in the wild ...
Possibly due to Life Tech push for Ion Torrent ...
What are your views?
On 20 Nov 2011 03:29, "Heng Li" wrote:
> The color-space alignment is not working in 0.6.0. Perhaps it is not so hard to make it work again, but bwa may not work well with solid reads all the time. Actually I have never evaluated this myself. 0.5.10 should work solid data.
> Any objections? Do you think it is worth keeping the color-space support in bwa?
> The Wellcome Trust Sanger Institute is operated by Genome Research
> Limited, a charity registered in England with number 1021457 and a
> company registered in England with number 2742969, whose registered
> office is 215 Euston Road, London, NW1 2BE.
> Bio-bwa-help mailing list
Friday, 18 November 2011
They have secured $15 mil funding led by google ventures so this is a company to watch for ..
Next-generation sequencing cloud computing for biologists.
Combining industry leading NGS technology with easy-to-use bioinformatics, storage, and sharing.
SOLiD™BioScope.com offers customers an alternative to buying and maintaining the expensive compute infrastructure typically required for Next Generation Sequencing (NGS) data analysis. Your NGS data and the tools necessary to analyze that data are available to you wherever you access the internet—be it in the lab or on the beach!
What are your views on the available solutions?
UCSC very good description of the BED format
Bedtools attempts to auto-detect the file formats used in each command. Thus, as long as your file conform to the format definitions for each file type, you should be okay. For example:
- BAM is zero-based, half-open. SAM is 1-based, closed.
- BED is zero-based, half-open.
zero length intervals (start == end), which in BED format, are interpreted as insertions in the reference.
I can't confirm this but from the top of my head, I recall that
start -1, end in BED format refers to SNPs
(source mostly from BEDTools mailling list)
Life is always easier when you find the correct tool.
I also adopt the path of least resistance when trying to solve problems that are more common than I imagine.
There's always the good old linux tools for comparing SNPs called from different programs / options
grep | sed | awk | cut | diff | comm
and if you are working with NGS data, you most probably already have samtools installed on your system and you might have used bcftools
Did you also know that there's also a (unrelated) set of tools called vcftools?
- The vcftools binary program, generally used to analyse VCF files.
- The Vcf.pm perl module, which is a general Perl API containing a core of the utilities vcf-convert, vcf-merge, vcf-compare, vcf-isec, and others.
Examples of usage by topic
Then there's also the highly used BEDTools http://code.google.com/p/bedtools/
which I highly recommend to keep as part of your tools collection. Check out the link below
Do watch out for this 'oversight' in vcftools as pointed out in seqanswers.
Overlap number discrepancy between VCFTools and BEDTools
|Usage||Examples of common usage. Featured|
Whole genome resequencing of Black Angus and Holstein cattle for SNP and CNV discovery using SOLID [BMC Genomics. 2011] - PubMed - NCBI
One of the goals of livestock genomics research is to identify the genetic differences responsible for variation in phenotypic traits, particularly those of economic importance. Characterizing the genetic variation in livestock species is an important step towards linking genes or genomic regions with phenotypes. The completion of the bovine genome sequence and recent advances in DNA sequencing technology allow for in-depth characterization of the genetic variations present in cattle. Here we describe the whole-genome resequencing of two Bos taurus bulls from distinct breeds for the purpose of identifying and annotating novel forms of genetic variation in cattle.
The genomes of a Black Angus bull and a Holstein bull were sequenced to 22-fold and 19-fold coverage, respectively, using the ABI SOLiD system. Comparisons of the sequences with the Btau4.0 reference assembly yielded 7 million single nucleotide polymorphisms (SNPs), 24% of which were identified in both animals. Of the total SNPs found in Holstein, Black Angus, and in both animals, 81%, 81%, and 75% respectively are novel. In-depth annotations of the data identified more than 16 thousand distinct non-synonymous SNPs (85% novel) between the two datasets. Alignments between the SNP-altered proteins and orthologues from numerous species indicate that many of the SNPs alter well-conserved amino acids. Several SNPs predicted to create or remove stop codons were also found. A comparison between the sequencing SNPs and genotyping results from the BovineHD high-density genotyping chip indicates a detection rate of 91% for homozygous SNPs and 81% for heterozygous SNPs. The false positive rate is estimated to be about 2% for both the Black Angus and Holstein SNP sets, based on follow-up genotyping of 422 and 427 SNPs, respectively. Comparisons of read depth between the two bulls along the reference assembly identified 790 putative copy-number variations (CNVs). Ten randomly selected CNVs, five genic and five non-genic, were successfully validated using quantitative real-time PCR. The CNVs are enriched for immune system genes and include genes that may contribute to lactation capacity. The majority of the CNVs (69%) were detected as regions with higher abundance in the Holstein bull.
Substantial genetic differences exist between the Black Angus and Holstein animals sequenced in this work and the Hereford reference sequence, and some of this variation is predicted to affect evolutionarily conserved amino acids or gene copy number. The deeply annotated SNPs and CNVs identified in this resequencing study can serve as useful genetic tools, and as candidates in searches for phenotype-altering DNA differences.
Thursday, 17 November 2011
Feature based classifiers for somatic mutation detection in tumour-normal paired sequencing data. [Bioinformatics. 2011] - PubMed - NCBI
The study of cancer genomes now routinely involves using next generation sequencing technology (NGS) to profile tumors for single nucleotide variant (SNV) somatic mutations. However, surprisingly few published bioinformatics methods exist for the specific purpose of identifying somatic mutations from NGS data and existing tools are often inaccurate, yielding intolerably high false prediction rates. As such, the computational problem of accurately inferring somatic mutations from paired tumour/normal NGS data remains an unsolved challenge.
We present the comparison of four standard supervised machine learning algorithms for the purpose of somatic SNV prediction in tumour/normal NGS experiments. To evaluate these approaches (random forest, Bayesian additive regression tree, support vector machine, and logistic regression) we constructed 106 features representing 3369 candidate somatic SNVs from 48 breast cancer genomes, originally predicted with naive methods and subsequently revalidated to establish ground truth labels. We trained the classifiers on this data (consisting of 1015 true somatic mutations and 2354 non-somatic mutation positions) and conducted a rigourous evaluation of these methods using a cross-validation framework and hold-out test NGS data from both exome capture and whole genome shotgun platforms. All learning algorithms employing predictive discriminative approaches with feature selection improved the predictive accuracy over standard approaches by statistically significant margins. In addition, using unsupervised clustering of the ground truth 'false positive' predictions, we noted several distinct classes and present evidence suggesting non-overlapping sources of technical artefacts illuminating important directions for future study.
Software called MutationSeq and datasets are available from http://compbio.bccrc.ca.
New version: 2.0.4.rc1 (2011-11-15). Release Candidate 1http://snpeff.sourceforge.net/
Take a look at all the new features added
- Database download command, e.g. "java -jar snpEff.jar download GRCH37.64"
- RefSeq annotations support added.
- Rogue transcript filter: By default SnpEff filters out some suspicious transcripts from annotations databases. This should improve false positive rates.
- Amino acid changes in HGVS style (VCF output)
- SnpSift: Added 'intIdx', looks for intervals using indexing and memory mapped I/O on the VCF file. Works really fast! Designed to extract a small number of intervals from huge VCF files.
- Optimized parsing for VCF files with large number of samples (genotypes).
- Option to suppress summary calculation ('-noStats'), can speed up processing considerably in some cases.
- Option '-onlyCoding' is set to 'auto' to reduce number of false positives (see next).
- Option '-onlyCoding' can be assigne a value: If value is 'true', report only 'protein_coding' transcripts as proteing coding changes. If 'false', report all transcript as if they were conding. Default: Auto, i.e. if transcripts any marked as 'protein_coding' the set it to 'true', if no transcripts are marked as 'protein_coding' then set it to 'false'.
- Added BED output format. This is usefull to annotate the output of a Chip-Seq experiment (e.g. after performing peak calling with MACS, you want to know where the peaks hit).
Evaluation of genomic high-throughput sequencing data generated on Illumina HiSeq and Genome Analyzer systems |Genome Biology |
The generation and analysis of high-throughput sequencing data is becoming a major component of many studies in molecular biology and medical research. Illumina's Genome Analyzer (GA) and HiSeq instruments are currently the most widely used sequencing devices. Here, we comprehensively evaluate properties of genomic HiSeq and GAIIx data derived from two plant genomes and one virus, with read lengths of 95-150 bases.
We provide quantifications and evidence for GC bias, error rates, error sequence context, effects of quality filtering, and the reliability of quality values. By combining different filtering criteria we reduced error rates 7-fold at the expense of discarding 12.5% of alignable bases. While overall error rates are low in HiSeq data we observed regions of accumulated wrong base calls. Only 3% of all error positions accounted for 24.7% of all substitution errors. Analyzing the forward and reverse strand separately revealed error rates of up to 18.7%. Insertions and deletions occurred at very low rates on average but increased to up to 2% in homopolymers. A positive correlation between read coverage and GC content was found depending on the GC content range.
The errors and biases we report have implications on the use and the interpretation of Illumina sequencing data. GAIIx and HiSeq data sets show slightly different error profiles. Quality filtering is essential to minimize downstream analysis artifacts. Supporting previous recommendations, the strand-specificity provides a criterion to distinguish sequencing errors from low abundance polymorphisms.
Came across this new blog which also highlighted the very useful RNA-seq tutorial which can be done entirely in Galaxy (comes with sample data)
I have also previously highlighted this tutorial in
Costion, Ford et al., PLoS One
Investigators at Australia's University of Adelaide show that "plant DNA barcodes can accurately estimate species richness in poorly known floras." In a case study, the Adelaide team demonstrates the "potential of plant DNA barcodes for the rapid estimation of species richness in taxonomically poorly known areas or cryptic populations revealing a powerful new tool for rapid biodiversity assessment." Overall, the team says it shows that "although DNA barcodes fail to discriminate all species of plants, new perspectives and methods on biodiversity value and quantification may overshadow some of these shortcomings by applying barcode data in new ways."
Tuesday, 15 November 2011
You will need to download the juniper client, the easiest way is to try to login to the webvpn address via a browser
for NUS it is
Answer 'yes' to all the questions
You should be connected :)
There's another similar guide @ http://wireless.siu.edu/install-ubuntu-64.htm
Are there ways in which such gaps between scientific knowledge and public acceptance can be bridged?
There is much evidence that the framing of information facilitates its acceptance when it no longer threatens people's worldview. HI individuals are more likely to accept climate science when the proposed solution involves nuclear power than when it involves emission cuts.
Similarly, the messenger matters. HPV vaccination is more likely to be found acceptable by HI individuals if arguments in its favour are presented by someone clearly identified as hierarchical-individualistic.
Monday, 14 November 2011
RT @thinkgenome: How to apply de Bruijn graphs to genome assembly : Nature ...: A mathematical concept known as a de Bruijn graph... http://t.co/UXNfYh28
Warning! Math ahead ...
Finally! An article that explains de Bruijin to the biologists
Saturday, 12 November 2011
Next-generation genomic technology has both greatly accelerated the pace of genome research as well as increased our reliance on draft genome sequences. While groups such as the Genomics Standards Consortium have made strong efforts to promote genome standards there is a still a general lack of uniformity among published draft genomes, leading to challenges for downstream comparative analyses. This lack of uniformity is a particular problem when using standard draft genomes that frequently have large numbers of low-quality sequencing tracts. Here we present a proposal for an "enhanced-quality draft" genome that identifies at least 95% of the coding sequences, thereby effectively providing a full accounting of the genic component of the genome. Enhanced-quality draft genomes are easily attainable through a combination of small- and large-insert next-generation, paired-end sequencing. We illustrate the generation of an enhanced-quality draft genome by re-sequencing the plant pathogenic bacterium Pseudomonas syringae pv. phaseolicola 1448A (Pph 1448A), which has a published, closed genome sequence of 5.93 Mbp. We use a combination of Illumina paired-end and mate-pair sequencing, and surprisingly find that de novo assemblies with 100x paired-end coverage and mate-pair sequencing with as low as low as 2-5x coverage are substantially better than assemblies based on higher coverage. The rapid and low-cost generation of large numbers of enhanced-quality draft genome sequences will be of particular value for microbial diagnostics and biosecurity, which rely on precise discrimination of potentially dangerous clones from closely related benign strains.