Friday, 30 December 2011

Article: PLoS ONE: myKaryoView: A Light-Weight Client for Visualization of Genomic Data

PLoS ONE: myKaryoView: A Light-Weight Client for Visualization of Genomic Data

 a web tool for visualization of genomic data specifically designed for direct-to-consumer genomic data that uses publicly available data distributed throughout the Internet. It does not require data to be held locally and it is capable of rendering any feature as long as it conforms to DAS specifications. Configuration and addition of sources to myKaryoView can be done through the interface. Here we show a proof of principle of myKaryoView's ability to display personal genomics data with 23andMe genome data sources. The tool is available at:

Vim the ubiquitous ide

Vim: revisited Read it online: My love affair with vim started with a futile effort in locating a perl ide. Then I chanced on a debate between vim and emacs I tried both but turns out vim is installed by default and emacs wasn't so I started on a journey of being used to vim and unable to use notepad forever. Right now I am playing with cream and wondering if it's worth the effort just for using crtl s for pasting clipboard items.

Thursday, 29 December 2011

multiBamCov (BAM) Counts sequence coverage for multiple position-sorted bams at specific loci defined in a BED/GFF/VCF file BedTools

multiBamCov will report a count of alignments for each BAM at each  interval

Program: multiBamCov (v2.14.2)
Summary: Counts sequence coverage for multiple bams at specific loci.

Usage:   multiBamCov [OPTIONS] -bams aln.1.bam aln.2.bam ... aln.n.bam -bed

        -bams   The bam files.

        -bed    The bed file.

        -q      Minimum mapping quality allowed. Default is 0.

        -D      Include duplicate-marked reads.  Default is to count non-duplicates only

        -F      Include failed-QC reads.  Default is to count pass-QC reads only

        -p      Only count proper pairs.  Default is to count all alignments with MAPQ
                greater than the -q argument, regardless of the BAM FLAG field.

 the sample output looks like

  chr20 0 100000 19312 16844
  chr20 100000 200000 43910 37579
  chr20 200000 300000 43245 43215
  chr20 300000 400000 41653 47556
  chr20 400000 500000 42929 43165
  chr20 500000 600000 44265 45325 

The last two columns are the count  of alignments in the first and second BAM, respectively.

FAQ - How to get average depth of coverage for targeted seq regions using BedTools

If I have a list of targeted resequencing regions and a bam file is
there any easy way to get the average depth of coverage for each region?
I could calculate it from all of the values in the histogram output from
coverageBed -hist but that seems messy. In addition is there any way to
adjust the minimum number of hits at a location for it to be included in
the fraction of bases in B that had non-zero coverage into bases in B
that had more-than-X coverage?

There is no option for directly computing the mean coverage. However, one can combine the -d option (reports the depth at each base in each interval) with the groupBy tool found in the filo package (
) to compute the mean.  For example, here's a 5 column BED of the UCSC simple repeat track:

$ head simRep.chr22.bed
22      16052163        16052199        trf     65
22      16059465        16059496        trf     53
22      16060479        16060524        trf     90
22      16060479        16060592        trf     120
22      16060517        16060609        trf     129
22      16071925        16071962        trf     74
22      16078616        16078686        trf     86
22      16079642        16079731        trf     61
22      16079644        16079696        trf     59
22      16079648        16079724        trf     52

I can compute the coverage at each base in each of these intervals as follows (column 6 is the position in the interval, column 7 is the depth):

$ coverageBed -abam aln.bam -b simRep.22.bed -d | head
22      17825583        17825839        trf     372     1       2
22      17825583        17825839        trf     372     2       2
22      17825583        17825839        trf     372     3       2
22      17825583        17825839        trf     372     4       2
22      17825583        17825839        trf     372     5       2
22      17825583        17825839        trf     372     6       2
22      17825583        17825839        trf     372     7       3
22      17825583        17825839        trf     372     8       3
22      17825583        17825839        trf     372     9       3
22      17825583        17825839        trf     372     10      3

Now, I can compute the mean by grouping on columns 1-5 of the output (the orig. BED entry) and computing the mean on column 7 for each grouped interval (the last column in the output below is the mean):

$ coverageBed -abam aln.bam -b simRep.22.bed -d | groupBy -g 1,2,3,4,5 -c 7 -o mean | head
22      17825583        17825839        trf     372     2.96875
22      16506905        16558419        trf     20099   7.6543852156695271205
22      16506905        16558419        trf     23903   7.6543852156695271205
22      16506941        16537880        trf     14401   7.47002165551569241586
22      16515001        16516375        trf     832     4.88427947598253275885
22      17039134        17049375        trf     5493    6.38755980861244054836
22      17039134        17049375        trf     4544    6.38755980861244054836
22      17039134        17049375        trf     5071    6.38755980861244054836
22      18087713        18088319        trf     759     5.42904290429042912791
22      19529662        19529964        trf     469     5.91721854304635730415

Note that coverageBed does not maintain the order of the intervals in the original file, so you might want to resort the output first:

$ coverageBed -abam aln.bam -b simRep.22.bed -d | sort -k1,1 -k2,2n | groupBy -g 1,2,3,4,5 -c 7 -o mean

>  In addition is there any way to
> adjust the minimum number of hits at a location for it to be included in
> the fraction of bases in B that had non-zero coverage into bases in B
> that had more-than-X coverage?

There is no way to do this except by taking the raw output of the -d or -hist and computing this yourself.


Dissecting plant genomes with the PLAZA comparative genomics platform.
Click here to read

Plant Physiol. 2011 Dec 23. [Epub ahead of print]
Dissecting plant genomes with the PLAZA comparative genomics platform.


1 VIB - Ghent University;


With the arrival of low-cost, next-generation sequencing a multitude of new plant genomes is being publicly released, providing unseen opportunities and challenges for comparative genomics studies. Here, we present PLAZA 2.5, a user-friendly online research environment to explore genomic information from different plants. This new release features updates to previous genome annotations and a substantial number of newly available plant genomes, as well as various new interactive tools and visualizations. Currently, PLAZA hosts 25 organisms covering a broad taxonomic range, including 13 eudicots, five monocots, one Lycopod, one moss, and five algae. The available data consist of structural and functional gene annotations, homologous gene families, multiple sequence alignments, phylogenetic trees, and colinear regions within and between species. A new Integrative Orthology Viewer, combining information from different orthology prediction methodologies, was developed to efficiently investigate complex orthology relationships. Cross-species expression analysis revealed that the integration of complementary data types extended the scope of complex orthology relationships, especially between more distantly related species. Finally, based on phylogenetic profiling, we propose a set of core gene families within the green plant lineage that will be instrumental to assess the gene space of draft or newly sequenced plant genomes during the assembly or annotation phase.

ART: a next-generation sequencing read simulator.

Bioinformatics. 2011 Dec 23. [Epub ahead of print]
ART: a next-generation sequencing read simulator.


Biostatistics Branch, National Institute of Environmental Health Sciences, Research Triangle Park, NC 27709, USA.



ART is a set of simulation tools that generate synthetic next-generation sequencing reads. This functionality is essential for testing and benchmarking tools for next-generation sequencing data analysis including read alignment, de novo assembly, and genetic variation discovery. ART generates simulated sequencing reads by emulating the sequencing process with built-in, technology-specific read error models and base quality value profiles parameterized empirically in large sequencing datasets. We currently support all three major commercial next-generation sequencing platforms: Roche's 454, Illumina's Solexa, and Applied Biosystems' SOLiD.ART also allows the flexibility to use customized read error model parameters and quality profiles.


Both source and binary software packages are available at

NAR RNASEQR a RNA seq pipeline written in python

RNASEQR was written in Python 2.7 and runs on 64-bit Linux systems. It employs a Burrows–Wheeler transform (BWT)-based and a hash-based indexing algorithm. Briefly, there are three sequential processing steps: the first step is to align RNA-Seq sequences to a transcriptomic reference; the second step is to detect novel exons; the third step is to identify novel splice junctions using an anchor-and-align strategy.

Top 10 Reasons to Upgrade to a 5500 Series Genetic Analysis System

Reason 10 is kinda unexpected but it's a good effort by Life Tech

Designed to minimize environmental impact—The 5500 Series Genetic Analyzers separate the truly hazardous waste from the nonhazardous waste, reducing the amount of waste classified as hazardous by as much as 38%. All bottles and strip tubes are recyclable and clearly marked with either a (2) or a (5) recycling symbol. All product documentation, including user manuals, quick start guides, and product inserts are delivered on an Apple iPad® device. (We like trees.)

Amazon S3 Announces Object Expiration

Dear Amazon S3 Customer,

Today we're excited to announce Object Expiration, a new feature to help you efficiently manage data stored in Amazon S3. Object Expiration enables you to schedule the removal of objects after a defined time period.

You can define Object Expiration rules for a set of objects in your bucket. Each expiration rule allows you to specify a prefix and an expiration period in days. The prefix field (e.g. "logs/") identifies the object(s) subject to the expiration rule, and the expiration period specifies the number of days from creation date (i.e. age) after which object(s) should be removed. You may create multiple expiration rules for different prefixes. After an Object Expiration rule is added, the rule is applied to objects with the matching prefix that already exist in the bucket as well as new objects added to the bucket. Once the objects are past their expiration date, they will be queued for deletion. You will not be charged for storage for objects on or after their expiration date. Amazon S3 doesn't charge you for using Object Expiration. You can use Object Expiration rules on objects stored in both Standard and Reduced Redundancy storage. Using Object Expiration rules to schedule periodic removal of objects eliminates the need to build processes to identify objects for deletion and submit delete requests to Amazon S3.

You can start using Object Expiration today using the AWS Management Console or the Amazon S3 API.

To use Object Expiration from the AWS Management console:

Under the Amazon S3 tab, select the bucket on which you want to apply Object Expiration rules.
Select the "Properties" action on that bucket.
Select the "Lifecycle" Tab.
Under the "Lifecycle" tab, add an Object Expiration rule by specifying a prefix (e.g. "logs/") that matches the object(s) you want to expire. Also specify the number of days from creation after which object(s) matching the prefix should be expired.
You can optionally specify a name for the rule for better organization.

For more information on using Object Expiration, please see the Object Expiration topic in the Amazon S3 Developer Guide.

The Amazon S3 Team

We hope you enjoyed receiving this message. If you wish to remove yourself from receiving future product announcements and the monthly AWS Newsletter, please update your communication preferences.

Amazon Web Services LLC is a subsidiary of, Inc. is a registered trademark of, Inc. This message produced and distributed by Amazon Web Services, LLC, 1918 8th Avenue, Seattle, WA 98101.

Wednesday, 28 December 2011

Amazon Builds World's Fastest Nonexistent Supercomputer | Wired Enterprise |

Cycle Computing setup a virtual supercomputer for an unnamed pharmaceutical giant that spans 30,000 processor cores, and it cost $1,279 an hour. Stowe —who has spent more than two decades in the supercomputing game, working with supercomputers at Carnegie Mellon University and Cornell —says there's still a need for dedicated supercomputers you install in your own data center, but things are changing.

Tuesday, 27 December 2011

PLoS Biology: The Open Knowledge Foundation: Open Data Means Better Science

The Open Data in Science working group has a common goal of achieving a world in which scientific data is open by default according to the Panton Principles, with limited exceptions. As a diverse collection of individuals, the aims, objectives, and means to achieve this are a matter of healthy debate and we encourage others to join the discussion.

In terms of our primary aim of providing tools, apps, and datasets for generating, discovering, and reusing open data, ideas are flowing continuously but require the input of the wider scientific community in identifying the problems they face in publishing, discovering, and reusing data online and requesting assistance in solving them. The working group aims to provide a community and network that can respond to these needs and a hub for access to the resulting tools, which we hope all stakeholders in scientific data will find valuable. Better science—in terms of transparency, reproducibility, increased efficiency, and ultimately a greater benefit to society—depends on open data.

Sunday, 25 December 2011

Fwd: [Oases-users] Oases 0.2.04

Merry Xmas to all!

---------- Forwarded message ----------
From: Daniel Zerbino
Date: Sat, Dec 24, 2011 at 6:42 PM
Subject: [Oases-users] Oases 0.2.04

Dear Oases users,

Oases 0.2.04 is now available on github and at

In it:
- The manual was edited for clarity
- The pipeline script was amended to include kmax in the array of
assembled k-mers (as a C programmer I assumed kmin to be included, kmax
to be excluded. Apparently the rest of the world does not count this way)
- The merge function was corrected to remove redundancies.


Oases-users mailing list

Reducing the Effects of PCR Amplification and Sequencing Artifacts on 16S rRNA-Based Studies.

PLoS One. 2011;6(12):e27310. Epub 2011 Dec 14.
Reducing the Effects of PCR Amplification and Sequencing Artifacts on 16S rRNA-Based Studies.Schloss PD, Gevers D, Westcott SL.


Department of Microbiology & Immunology, University of Michigan, Ann Arbor, Michigan, United States of America.


The advent of next generation sequencing has coincided with a growth in interest in using these approaches to better understand the role of the structure and function of the microbial communities in human, animal, and environmental health. Yet, use of next generation sequencing to perform 16S rRNA gene sequence surveys has resulted in considerable controversy surrounding the effects of sequencing errors on downstream analyses. We analyzed 2.7×10(6) reads distributed among 90 identical mock community samples, which were collections of genomic DNA from 21 different species with known 16S rRNA gene sequences; we observed an average error rate of 0.0060. To improve this error rate, we evaluated numerous methods of identifying bad sequence reads, identifying regions within reads of poor quality, and correcting base calls and were able to reduce the overall error rate to 0.0002. Implementation of the PyroNoise algorithm provided the best combination of error rate, sequence length, and number of sequences. Perhaps more problematic than sequencing errors was the presence of chimeras generated during PCR. Because we knew the true sequences within the mock community and the chimeras they could form, we identified 8% of the raw sequence reads as chimeric. After quality filtering the raw sequences and using the Uchime chimera detection program, the overall chimera rate decreased to 1%. The chimeras that could not be detected were largely responsible for the identification of spurious operational taxonomic units (OTUs) and genus-level phylotypes. The number of spurious OTUs and phylotypes increased with sequencing effort indicating that comparison of communities should be made using an equal number of sequences. Finally, we applied our improved quality-filtering pipeline to several benchmarking studies and observed that even with our stringent data curation pipeline, biases in the data generation pipeline and batch effects were observed that could potentially confound the interpretation of microbial community data.

[PubMed - in process]

COLD-PCR Enrichment of Rare Cancer Mutations prior to Targeted Amplicon Resequencing.

Clin Chem. 2011 Dec 21. [Epub ahead of print]
COLD-PCR Enrichment of Rare Cancer Mutations prior to Targeted Amplicon Resequencing.
Milbury CA, Correll M, Quackenbush J, Rubio R, Makrigiorgos GM.


Division of DNA Repair and Genome Stability, Department of Radiation Oncology;



Despite widespread interest in next-generation sequencing (NGS), the adoption of personalized clinical genomics and mutation profiling of cancer specimens is lagging, in part because of technical limitations. Tumors are genetically heterogeneous and often contain normal/stromal cells, features that lead to low-abundance somatic mutations that generate ambiguous results or reside below NGS detection limits, thus hindering the clinical sensitivity/specificity standards of mutation calling. We applied COLD-PCR (coamplification at lower denaturation temperature PCR), a PCR methodology that selectively enriches variants, to improve the detection of unknown mutations before NGS-based amplicon resequencing.


We used both COLD-PCR and conventional PCR (for comparison) to amplify serially diluted mutation-containing cell-line DNA diluted into wild-type DNA, as well as DNA from lung adenocarcinoma and colorectal cancer samples. After amplification of TP53 (tumor protein p53), KRAS (v-Ki-ras2 Kirsten rat sarcoma viral oncogene homolog), IDH1 [isocitrate dehydrogenase 1 (NADP(+)), soluble], and EGFR (epidermal growth factor receptor) gene regions, PCR products were pooled for library preparation, bar-coded, and sequenced on the Illumina HiSeq 2000.


In agreement with recent findings, sequencing errors by conventional targeted-amplicon approaches dictated a mutation-detection limit of approximately 1%-2%. Conversely, COLD-PCR amplicons enriched mutations above the error-related noise, enabling reliable identification of mutation abundances of approximately 0.04%. Sequencing depth was not a large factor in the identification of COLD-PCR-enriched mutations. For the clinical samples, several missense mutations were not called with conventional amplicons, yet they were clearly detectable with COLD-PCR amplicons. Tumor heterogeneity for the TP53 gene was apparent.


As cancer care shifts toward personalized intervention based on each patient's unique genetic abnormalities and tumor genome, we anticipate that COLD-PCR combined with NGS will elucidate the role of mutations in tumor progression, enabling NGS-based analysis of diverse clinical specimens within clinical practice.

[PubMed - as supplied by publisher]

Friday, 23 December 2011

Linkage analysis in the next-generation sequencing era. [Hum Hered. 2011] - PubMed - NCBI
Hum Hered. 2011;72(4):228-36. Epub 2011 Dec 23.
Linkage analysis in the next-generation sequencing era.


Inherited Disease Research Branch, National Human Genome Research Institute, National Institutes of Health, Baltimore, Md., USA.


Linkage analysis was developed to detect excess co-segregation of the putative alleles underlying a phenotype with the alleles at a marker locus in family data. Many different variations of this analysis and corresponding study design have been developed to detect this co-segregation. Linkage studies have been shown to have high power to detect loci that have alleles (or variants) with a large effect size, i.e. alleles that make large contributions to the risk of a disease or to the variation of a quantitative trait. However, alleles with a large effect size tend to be rare in the population. In contrast, association studies are designed to have high power to detect common alleles which tend to have a small effect size for most diseases or traits. Although genome-wide association studies have been successful in detecting many new loci with common alleles of small effect for many complex traits, these common variants often do not explain a large proportion of disease risk or variation of the trait. In the past, linkage studies were successful in detecting regions of the genome that were likely to harbor rare variants with large effect for many simple Mendelian diseases and for many complex traits. However, identifying the actual sequence variant(s) responsible for these linkage signals was challenging because of difficulties in sequencing the large regions implicated by each linkage peak. Current 'next-generation' DNA sequencing techniques have made it economically feasible to sequence all exons or the whole genomes of a reasonably large number of individuals. Studies have shown that rare variants are quite common in the general population, and it is now possible to combine these new DNA sequencing methods with linkage studies to identify rare causal variants with a large effect size. A brief review of linkage methods is presented here with examples of their relevance and usefulness for the interpretation of whole-exome and whole-genome sequence data.

Picard release 1.58

Picard release 1.58
20 December 2011

- Fix computation of coverage in
PER_TARGET_COVERAGE output file.  Fix courtesy of Matt Ducar.

- Add SVN revision to version string.

- Add COMMENT argument to FastqToSam.

biotoolbox - Tools for querying and analysis of genomic data - Google Project Hosting

A collection of various perl scripts that utilize BioPerl modules for use in bioinformatics analysis. Tools are included for processing microarray data, next generation sequencing data, data file format conversion, querying datasets, and general high level analysis of datasets.

This tool box of programs heavily relies on storing genome annotation, microarray, and next generation sequencing data in bioperl databases, allowing for data retrieval relative to any annotated feature in the database.

Also included are programs for converting and importing data from UCSC gene tables and ensEMBL, as well as a variety of other formats, into a GFF3 file that can be loaded into a bioperl database.

Who is this for?

These set of tools are designed to complement the Generic Genome Browser (GBrowse). If you view your model organism's genome annotation and microarray or next generation sequencing data using GBrowse, then these tools will assist you in fully analyzing your data.

Even if you don't use GBrowse, these programs may still be useful. Please check out the list of programs to see if it meets your needs.

What programs are available? How do I use them?

This is a list of programs.

This is an example on preparing and loading data into the database.

This is a list of supported data formats. In short, it will work with GFF, BED, wig, bigWig, bigBed, and Bam data formats and annotations. Most bioinformatic data can be represented in one or more of these formats, or at the very least converted.

This is an example of how to collect data.

This is an example on working with Next Generation Sequencing data.

What are the requirements?

These are command line Perl programs designed for modern Unix-based computers. Most of the analysis programs rely heavily on BioPerl modules, so at a minimum this should be installed. Additionally, if you want to use Bam, bigWig, bigBed, or wig data files, additional modules will need to be installed. Most (all?) of the programs should fail gracefully if the required modules are not installed. Some programs are quite minimal and may run without even BioPerl installed. A utility is provided to check for any missing or out of date modules.

Why were these programs written?

They were initially written to assist me in my own laboratory research. As they were expanded in scope, I realized they could be potentially useful to others in the same predicament as me. Thus, releasing these programs for others to use.

Ambry Genetics launches Illumina MiSeq Next-Gen Seq services

The system is generating over 2 Gb data per run with a high perce over Q30. The high data yield and superior quality a conduct a wide variety of sequencing applications in multiplexed PCR amplicon sequencing, small genom and de novo sequencing, small RNA sequencing, targ resequencing and 16S metagenomics. "The addition of numerous Illumina MiSeqs adds an sequencing for our clients," said Mr. Ardy Arianpour of business development at Ambry Genetics. "Our scientists spent the last couple months validating sequencing amazing results so we can deliver and work with mu samples that fit on the MiSeq."

Thursday, 22 December 2011 MapReduce on Python


Introduction is a Python implementation of the MapReduce distributed computing framework. is:
  • Lightweight - All of the code is contained in a single Python file (currently weighing in at <13kB) that depends only on the Python Standard Library. Any computer with Python and can be a part of your cluster.
  • Fault tolerant - Workers (clients) can join and leave the cluster at any time without affecting the entire process. (Master checkpointing coming in future versions)
  • Secure - authenticates both ends of every connection, ensuring that only authorized code is executed. (TLS support coming in future versions)
  • Open source - is distributed under the MIT License, and consequently is free for all use, including commercial, personal, and academic, and can be modified and redistributed without restriction.

Somatic Mutation Detection in Whole Genome Sequencing Data | MassGenomics

There's a new tool for SNP calling.
this post is useful in understanding the challenges of reducing false positives / background noise for the bona fide mutations from NGS data. For the wet lab, trying a variety of SNP callers and doing validation on the ones that are present in all of the SNP callers gives u a higher validation rate. But having the bona fide mutation doesn't mean you have hit on the correct mutation associated with your disease ...

Excerpted from

Filtering Out the Noise

No matter how good the mutation caller, there are going to be some false positives. This is because you're looking for a one-in-a-million event, a true somatic mutation. Raw SomaticSniper calls therefore undergo a series of Maq-inspired filters. Sites are retained if they meet these criteria:
  • Covered by at least 3 reads
  • Consensus quality of at least 20
  • Called a SNP in the tumor sample with SNP quality of at least 20
  • Maximum mapping quality of at least 40
  • No high-quality predicted indel within 10 bp
  • No more than 2 other SNVs called within 10 bp
Sites passing these criteria are subjected to two additional filters: a screen against germline variants from dbSNP (remove if matches position and allele of known non-cancer dbSNP) and an LOH filter (remove if normal is heterozygous and tumor homozygous for the same variant allele). Sites removed by the former are probably inherited variants under-sampled in the matched normal, while sites removed by the latter are likely due to large-scale structural changes (e.g. deletions) causing the loss of one allele. Finally, the filter-passed mutations are classified as high-confidence (HC) if the somatic score is at least 40 and the mapping quality is at least 40 (for BWA) or 70 (for Maq).

Frequent Sources of False Positives

Even sites that pass the filters above are vulnerable to certain sequencing and alignment artifacts that produce false positive calls. A detailed study revealed (as many in the field know already) a few common sources of false positives: strand bias, homopolymer sequences, paralogous reads (deriving from a paralogous region of the genome, but mapped to the wrong region, usually three or more substitutions), and the read position of the predicted variant. The latter type of artifact is something new; it turned out that variants only seen near the “effective” 3′ end of reads (the start of soft-trimmed bases or the actual end of the read if untrimmed) were more likely to be false positives. This may be a combination of sequencing error, which is higher at the 3′ end of reads, and alignment bias favoring mismatches over gaps near the ends of reads. In any case, false positives deriving from these common causes tend to have certain properties enabling them to be identified and removed while maintaining sensitivity for true mutations.
SomaticSniper adds to the growing arsenal of tools developed by our group to address the significant challenges presented by next-generation sequencing data analysis.

Picard Metrics Definitions - USEFUL!
  1. AlignmentSummaryMetrics: High level metrics about the alignment of reads within a SAM file, produced by the CollectAlignmentSummaryMetrics program and usually stored in a file with the extension ".alignment_summary_metrics".
  2. DuplicationMetrics: Metrics that are calculated during the process of marking duplicates within a stream of SAMRecords.
  3. ExtractIlluminaBarcodes.BarcodeMetric: Metrics produced by the ExtractIlluminaBarcodes program that is used to parse data in the basecalls directory and determine to which barcode each read should be assigned.
  4. GcBiasDetailMetrics: Class that holds detailed metrics about reads that fall within windows of a certain GC bin on the reference genome.
  5. GcBiasSummaryMetrics: High level metrics that capture how biased the coverage in a certain lane is.
  6. HsMetrics: The set of metrics captured that are specific to a hybrid selection analysis.
  7. InsertSizeMetrics: Metrics about the insert size distribution of a paired-end library, created by the CollectInsertSizeMetrics program and usually written to a file with the extension ".insert_size_metrics".
  8. MultilevelMetrics:
  9. RnaSeqMetrics: Metrics about the alignment of RNA-seq reads within a SAM file to genes, produced by the CollectRnaSeqMetrics program and usually stored in a file with the extension ".rna_metrics".
  10. SamFileValidator.ValidationMetrics:

Twin studies in autoimmune disease: Genetics, gender and environment. [J Autoimmun. 2011] - PubMed - NCBI
J Autoimmun. 2011 Dec 14. [Epub ahead of print]
Twin studies in autoimmune disease: Genetics, gender and environment.


Institute of Liver Studies, Liver Immunopathology, King's College London School of Medicine at King's College Hospital, Denmark Hill Campus, London SE5 9RS, UK.


Twin studies are powerful tools to discriminate whether a complex disease is due to genetic or environmental factors. High concordance rates among monozygotic (MZ) twins support genetic factors being predominantly involved, whilst low rates are suggestive of environmental factors. Twin studies have often been utilised in the study of systemic and organ specific autoimmune diseases. As an example, type I diabetes mellitus has been investigated to establish that that disease is largely affected by genetic factors, compared to rheumatoid arthritis or scleroderma, which have a weaker genetic association. However, large twin studies are scarce or virtually non-existent in other autoimmune diseases which have been limited to few sets of twins and individual case reports. In addition to the study of the genetic and environmental contributions to disease, it is likely that twin studies will also provide data in regards to the clinical course of disease, as well as risk for development in related individuals. More importantly, genome-wide association studies have thus far reported genomic variants that only account for a minority of autoimmunity cases, and cannot explain disease discordance in MZ twins. Future research is therefore encouraged not only in the analysis of twins with autoimmune disease, but also in regards to epigenetic factors or rare variants that may be discovered with next-generation sequencing. This review will examine the literature surrounding twin studies in autoimmune disease including discussions of genetics and gender.

Copyright © 2011 Elsevier Ltd. All rights reserved.

Bioinformatics tools and databases for analysis of next-generation sequence data. [Brief Funct Genomics. 2011] - PubMed - NCBI
Brief Funct Genomics. 2011 Dec 19. [Epub ahead of print]
Bioinformatics tools and databases for analysis of next-generation sequence data.


Genome sequencing has been revolutionized by next-generation technologies, which can rapidly produce vast quantities of data at relatively low cost. With data production now no longer being limited, there is a huge challenge to analyse the data flood and interpret biological meaning. Bioinformatics scientists have risen to the challenge and a large number of software tools and databases have been produced and these continue to evolve with this rapidly advancing field. Here, we outline some of the tools and databases commonly used for the analysis of next-generation sequence data with comment on their utility.

[PubMed - as supplied by publisher]

Whole genome sequencing of matched primary and metastatic acral melanomas. [Genome Res. 2011] - PubMed - NCBI

Genome Res. 2011 Dec 19. [Epub ahead of print]
Whole genome sequencing of matched primary and metastatic acral melanomas.


Next generation sequencing has enabled systematic discovery of mutational spectra in cancer samples. Here, we used whole genome sequencing to characterize somatic mutations and structural variation in a primary acral melanoma and its lymph node metastasis. Our data show that the somatic mutational rates in this acral melanoma sample pair were more comparable to the rates reported in cancer genomes not associated with mutagenic exposure than in the genome of a melanoma cell line or the transcriptome of melanoma short-term cultures. Despite the perception that acral skin is sun-protected, the dominant mutational signature in these samples is compatible with damage due to ultraviolet light exposure. A nonsense mutation in ERCC5 discovered in both the primary and metastatic tumors could also have contributed to the mutational signature through accumulation of unrepaired dipyrimidine lesions. However, evidence of transcription-coupled repair was suggested by the lower mutational rate in the transcribed regions and expressed genes. The primary and the metastasis are highly similar at the level of global gene copy number alterations, loss of heterozygosity and single nucleotide variation (SNV). Furthermore, the majority of the SNVs in the primary tumor were propagated in the metastasis and one nonsynonymous coding SNV and one splice site mutation appeared to arise de novo in the metastatic lesion.

[PubMed - as supplied by publisher]

Wednesday, 21 December 2011

Hilbertvis installation in R (no admin rights required!)

Playing around with
HilbertVis: Visualization of genomic data with the Hilbert curve

HilbertVis - Bioconductor

Trying out an idea to use Hilbert Curves as a method to visually inspect WGS mapped bams for regions of low coverage or unequal coverage across samples.

It has a standalone GUI version that requires gtk+ packages that may not be avail on all systems.

The cool thing is that it can be installed locally within your user directory with a few simple commands

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.
Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.
> source("")
> biocLite("HilbertVis")

Using R version 2.10.0, biocinstall version 2.5.11.
Installing Bioconductor version 2.5 packages:
[1] "HilbertVis"
Please wait...
Warning in install.packages(pkgs = pkgs, repos = repos, ...) :
  argument 'lib' is missing: using '/usr/lib64/R/library'
Warning in install.packages(pkgs = pkgs, repos = repos, ...) :
  'lib = "/usr/lib64/R/library"' is not writable
Would you like to create a personal library
to install packages into?  (y/n)

Saturday, 17 December 2011

Estimation of bacterial diversity using Next Generation Sequencing of 16S rDNA: a comparison of different workflows.. [BMC Bioinformatics. 2011] - PubMed - NCBI
BMC Bioinformatics. 2011 Dec 14;12(1):473. [Epub ahead of print]
Barriuso J, Valverde JR, Mellado RP.




Next generation sequencing (NGS) enables a more comprehensive analysis of bacterial diversity from complex environmental samples. NGS data can be analysed using a variety of workflows. We test several simple and complex workflows, including frequently used as well as recently published tools, and report on their respective accuracy and efficiency under various conditions covering different sequence lengths, number of sequences and real world experimental data from rhizobacterial populations of glyphosate-tolerant maize treated or untreated with two different herbicides representative of differential diversity studies.


Alignment and distance calculations affect OTU estimations, and multiple sequence alignment exerts a major impact on the computational time needed. Generally speaking, most of the analyses produced consistent results that may be used to assess differential diversity changes, however, dataset characteristics dictate which workflow should be preferred in each case.


When estimating bacterial diversity, ESPRIT as well as the web-based workflow, RDP pyrosequencing pipeline, produced good results in all circumstances, however, its computational requirements can make method-combination workflows more attractive, depending on sequence variability, number and length.

Transformations for the Compression of FASTQ Quality Scores of Next Generation Sequencing Data.[Bioinformatics. 2011] - PubMed - NCBI

Bioinformatics. 2011 Dec 13. [Epub ahead of print]


Department of Computational Biology, Graduate School of Frontier Science, University of Tokyo, 5-1-5 Kashiwanoha, Kashiwa-shi, Chiba-ken, 277-8561, Japan.



The growth of next generation sequencing means that more effective and efficient archiving methods are needed to store the generated data for public dissemination and in anticipation of more mature analytical methods later. This paper examines methods for compressing the quality score component of the data to partly address this problem.


We compare several compression policies for quality scores, in terms of both compression effectiveness and overall efficiency. The policies employ lossy and lossless transformations with one of several coding schemes. Experiments show that both lossy and lossless transformations are useful, and that simple coding methods, which consume less computing resources, are highly competitive, especially when random access to reads is needed.Availability and Implementation: Our C++ implementation, released under the Lesser General Public License, is available for download at

Targeted sequencing library preparation by genomic DNA circularization.




For next generation DNA sequencing, we have developed a rapid and simple approach for preparing DNA libraries of targeted DNA content. Current protocols for preparing DNA for next-generation targeted sequencing are labor-intensive, require large amounts of starting material, and are prone to artifacts that result from necessary PCR amplification of sequencing libraries. Typically, sample preparation for targeted NGS is a two-step process where (1) the desired regions are selectively captured and (2) the ends of the DNA molecules are modified to render them compatible with any given NGS sequencing platform.


In this proof-of-concept study, we present an integrated approach that combines these two separate steps into one. Our method involves circularization of a specific genomic DNA molecule that directly incorporates the necessary components for conducting sequencing in a single assay and requires only one PCR amplification step. We also show that specific regions of the genome can be targeted and sequenced without any PCR amplification.


We anticipate that these rapid targeted libraries will be useful for validation of variants and may have diagnostic application.

Thursday, 15 December 2011

Metrics for Coverage uniformity and evenness of coverage

Revisiting this issue which was mostly covered in this post using genomecoveragebed . However, chanced upon Illumina's comparison on the SBL sequencing platform version 4 with their Truseq ( I wasn't aware of an alternative name for that particular platform!)  

My question is 
Do you use similar metrics to QC between your NGS samples for possible mishandling? 

What happens when you have 100 or even 1000 of these samples? Would checking the  skewness or the kurtosis of the peaks be useful?


BMC Bioinformatics | Abstract | Estimation of bacterial diversity using Next Generation Sequencing of 16S rDNA: a comparison of different workflows.

Wednesday, 14 December 2011

Efficient de novo assembly of large genomes using compressed data structures SGA (String Graph Assembler) [software]

De novo genome sequence assembly is important both to generate new sequence assemblies for previously uncharacterized genomes and to identify the genome sequence of individuals in a reference-unbiased way. We present memory efficient data structures and algorithms for assembly using the FM-index derived from the compressed Burrows-Wheeler transform, and a new assembler based on these called SGA (String Graph Assembler). We describe algorithms to error correct, assemble and scaffold large sets of sequence data. SGA uses the overlap-based string graph model of assembly, unlike most de novo assemblers that rely on de Bruijn graphs, and is simply parallelizable. We demonstrate the error correction and assembly performance of SGA on 1.2 billion sequence reads from a human genome, which we are able to assemble using 54 GB of memory. The resulting contigs are highly accurate and contiguous, while covering 95% of the reference genome (excluding contigs less than 200bp in length). Because of the low memory requirements and parallelization without requiring inter-process communication, SGA provides the first practical assembler to our knowledge for a mammalian-sized genome on a low-end computing cluster.

Tuesday, 13 December 2011

Chromium Blog: Bringing improved PDF support to Google Chrome

Chromium Blog: Bringing improved PDF support to Google Chrome: Millions of web users rely on PDF files every day to consume a wide variety of text and media content. To enable this, a number of plug-ins ...

sounds good but .. trying to access ejournals on my institute's network I encountered this error message.

We have implemented a temporary block on Google Chrome's access to Libraries' e-resources for the following reason: The built-in PDF plug-in API in Chrome (which reportedly improves the browser's support for PDF: results in the same PDF document being downloaded multiple times. This method of downloading PDF files currently interfers with the method used by e-resources publishers and our library proxy violation prevention system to detect systematic and massive downloading, causing both the publishers and the library proxy to mistakenly consider it as a violation. We have encountered a number of false alarms before we activated the block.

We have since explored the possible solution(s) to this new problem. We found that the workaround to resolve this would involve significant structural system change and costs to the Library. While we continue to actively source for other more cost effective solutions and alternatives, we advise Chrome users to use other web browsers to access our e-resources in the meantime.

We seek your understanding in this matter and apologize for any inconvenience caused.

This moment reminds of the adage
'If it ain't broke, don't fix it'
BUT if u ask me, publishers should have a different channel for massive download for automated literature search and data mining. I mean ppl access your journal and leaving keyword search online in a variety of places should be good for you right? What's more if it's subscription based. Somebody paid for it already. If u wish to charge more, just implement a different download channel for data miners.
I wonder if open access journals try to limit data miners as well ... hmmm something to ask around ...

Monday, 12 December 2011

CNAnorm is a Bioconductor package to estimate Copy Number Aberrations (CNA) in cancer samples.


CNAnorm is a Bioconductor package to estimate Copy Number Aberrations (CNA) in cancer samples.

It is described in the paper:

Gusnanto, A., Wood, H.M., Pawitan, Y., Rabbitts, P. and Berri, S. Correcting for cancer genome size and tumour cell content enables better estimation of copy number alterations from next generation sequence data. 2011. Bioinformatics, epub ahead of print.

CNAnorm performs ratio, GC content correction and normalization of data obtained using very low coverage (one read every 100-10,000 bp) high throughput sequencing. It performs a "discrete" normalization looking for the ploidy of the genome. It also provides tumour content if at least two ploidy states can be found.


Get the latest version of CNAnorm and its documentation from Bioconductor. Prerequisite: you need a Fortran compiler, make and DNAcopy from Bioconductor

You can also download the perl script (version 0.3.3) to convert sam/bam files to the text files required by CNAnorm. For documentation on usage, run the script without arguments


For further information on both programs, please contact Stefano Berri

Additional data files

GC content

We provide gc1000Base.txt.gz, an example file for GC content (build GRCh37/hg19) to optionally use with It provides average GC content every 1000 bp. The size of the window in the GC content file should be at least an order of magnitude smaller than the window used for CNAnorm to minimise boundary effects. If you require higher resolution, you candowload the gc5Base tables from UCSD and/or make your own. The smaller the window size in the GC content file, the larger this will be, and the longer it will take to to process it.

LS041 bam files

We provide the bam files used to produce the dataset included in CNAnorm
LS041_tumour.bam (139 MB)
LS041_control.bam (130 MB)

To produce a text file suitable as input for CNAnorm you can enter the following

perl --gc_file gc1000Base.txt.gz LS041_tumour.bam LS041_control.bam >

It will produce this file

You need samtools installed in a directory in your $PATH if your input files are bam format

Would you use SOLID for de novo assembly?

There's an interesting 'debate' over here at recommendations for de novo assembly. 
interesting tidbits of info. 
PE 5500xl transcriptome datasets are available through solidsoftwaretools ( 
(but don't be too excited, it's mapped data only)
 The data provided is the mapped colorspace output from LifeScope? v2.0 using default parameters.  This PDF contains a high level summary of the data.   

There's some discussion on whether to go for paired ends, ECC and that sort and comments on how the kit's chemistry introduces biases that work fine for RNA seq but might hurt assembly. 

my personal experience with SOLID 4 data is that the short reads coupled with high degree of 'noise' from reads that won't map anyway if u had a reference genome. This creates a big problem for assemblers especially if you have limited ram to work with. 
new assemblers that use less ram might be helpful in  resolving SOLID assemblies. SOLID is brilliant for RNA-seq quantification as they have a nice kit that allows for strand specificity. 

 454 or Sanger would undoubtedly be helpful for de novo assembly but realistically cost is always an issue. 

Your choice still boils down to a combination of factors. Combination of data from different platforms would be the best but it might not be viable till seq costs drop. 

Why Illumina reads may have uneven coverage? - SEQanswers

Check out this forum thread about cases where uneven coverage of NGS seq. (the thread is about Illumina but I think all Platforms have their own quirks)

The consensus is that GC bias and/or enrichment PCR step might be creating platform specific biases. 

Whole genome sequencing projects like in 
Complete Khoisan and Bantu genomes from southern Africa : Article : Nature where WGS data on different platforms are available have the ability to shed some light on such issues. 

So when in doubt, (and have plenty of $ ) sequence with multiple platforms. 

How much coverage / throughput for my RNA-seq?

One of the earliest questions to bug anyone planning an RNA-seq experiment has to be the throughput (how many reads do I need?)

If you are dealing with human samples, you have the benefit of extensive publications with example coverages and some papers that test the limits of detection. All of this info is nicely summarised here in experimental design considerations in RNA-Seq.

Bashir et al. have  concluded that more than 90% of the transcripts in human samples are adequately covered with just one million  sequence reads.  Wang et al. showed that 8 million reads are sufficient to reach RNA-Seq saturation for most  samples

The ENCODE consortium also has published a Guidelines for Experiments within you can read RNA Standards v1.0 (May 2011) and also RNA-seq Best Practices (2009)

Experiments whose purpose is to evaluate the similarity between the
transcriptional profiles of two polyA+ samples may require only modest depths of
sequencing (e.g. 30M pair-end reads of length > 30NT, of which 20-25M are
mappable to the genome or known transcriptome, Experiments whose purpose is
discovery of novel transcribed elements and strong quantification of known
transcript isoforms requires more extensive sequencing. The ability to detect
reliably low copy number transcripts/isoforms depends upon the depth of
sequencing and on a sufficiently complex library.

RNA-seq blog also covers this issue in How Many Reads are Enough? Where they cited an article on RNA-seq in chicken lungs

The analysis from the current study demonstrated that 30 M (75 bp) reads is sufficient to detect all annotated genes in chicken lungs. Ten million (75 bp) reads could detect about 80% of annotated chicken genes.

There are also papers that showed that RNA-seq gives reproducible results when sequenced from the same RNA-seq library which means that if coverage isn't enough, it is possible to sequence more using the same library and not have it affect your results. The real issue then becomes whether  you have planned for additional sequencing with your budget.

Au, K.F., Jiang, H., Lin, L., Xing, Y. & Wong, W.H. Detection of splice junctions from paired-end RNA-seq data by  SpliceMap. Nucleic acids research 38, 4570-4578 (2010).

Maher, C.A., Palanisamy, N., Brenner, J.C., Cao, X., Kalyana-Sundaram, S., Luo, S., Khrebtukova, I., Barrette, T.R.,  Grasso, C., Yu, J., Lonigro, R.J., Schroth, G., Kumar-Sinha, C. & Chinnaiyan, A.M. Chimeric transcript discovery by  paired-end transcriptome sequencing. Proceedings of the National Academy of Sciences of the United States of America   106, 12353-12358 (2009).

Bashir, A., Bansal, V. & Bafna, V. Designing deep sequencing experiments: detecting structural variation and estimating  transcript abundance. BMC genomics 11, 385 (2010).

Wang, Z., Gerstein, M. & Snyder, M. RNA-Seq: a revolutionary tool for transcriptomics. Nature reviews 10, 57-63 (2009).

Wang Y, Ghaffari N, Johnson CD, Braga-Neto UM, Wang H, Chen R, Zhou H. (2011) Evaluation of the coverage and depth of transcriptome by RNA-Seq in chickens. BMC Bioinformatics Proceedings of the Eighth Annual MCBIOS Conference. Computational Biology and Bioinformatics for a New Decade, College Station, TX, USA. 1-2 April 2011. [article

Thursday, 8 December 2011

A reference transcriptome for the cauliflower coral – Pocillopora damicornis | RNA-Seq Blog
By identifying changes in coral gene expression that are triggered by particular environmental stressors, we can begin to characterize coral stress responses at the molecular level, which should lead to the development of more powerful diagnostic tools for evaluating the health of corals in the field.
With the goal of identifying genetic variants that are more or less resilient in the face of particular stressors, a team led by researchers at  Boston & Stanford Universities performed deep mRNA sequencing of the cauliflower coral, Pocillopora damicornis, a geographically widespread Indo-Pacific species that exhibits a great diversity of colony forms and is able to thrive in habitats subject to a wide range of human impacts. They isolated RNA from colony fragments ("nubbins") exposed to four environmental stressors (heat, desiccation, peroxide, and hypo-saline conditions) or control conditions. The RNA was pooled and sequenced using the 454 platform. Description. Both the raw reads (n = 1,116,551) and the assembled contigs (n = 70,786; mean length = 836 nucleotides) were deposited in a new publicly available relational database called PocilloporaBase (
P. damicornis now joins the handful of coral species for which extensive transcriptomic data are publicly available. Through PocilloporaBase (, one can obtain assembled contigs and raw reads and query the data according to a wide assortment of attributes including taxonomic origin, PFAM motif, KEGG pathway, and GO annotation.
  • Traylor-Knowles N et al. (2011) Production of a reference transcriptome and a transcriptomic database (PocilloporaBase) for the cauliflower coral, Pocillopora damicornis. BMC Genomics 12, 585. [article]

Datanami, Woe be me