Wednesday, 29 September 2010

SeqCentral not just another company using Cloud computing for genomics

It is not everyday you see a bioinformatics company get featured on Techcrunch. Not by my count anyway. SeqCentral's website hasn't officially launched yet but they did launch on Techcrunch disrupt (check their twitter).
SeqCentral, LLC, aims to provide and harness the high-performance computational power of the cloud as a usable, collaborative, online service to all members of the computational genomics community.

Using the cloud to do computational genomics isn't terribly new. But the collaborative aspect is. 

SeqCentral will allow scientists to compare their data to others to see if their sequencing is new or if it is “known.” The startup will bring in public data from universities, research organizations, and companies and allow you compare your sequencing to this existing data.

And SeqCentral, which costs $99 per year for scientists, wants to help you do more than just be able to find additional data, but also aims to connect members of the genomics community, encouraging collaboration around sequencing.

Other than the cheap price, which is 'democratizing', I find the idea of asking scientists to share their new sequencing data using their platform daring!

Granted that it only speeds up the sharing process if the data is due to be in public domain (due to public research funding). I had the crazy idea that if the clients would allow sequencing providers can share info that they have clients sequencing same samples (cancer cell lines for example) this way both Client A and Client B can save $ on sequencing coverage. Ultimately, I felt that it wouldn't work out as MTAs are such a hassle. Good Luck to SeqCentral though! I am rooting for ya..

Tuesday, 28 September 2010

Commentary on Personal Genomics testing

I was reading the usual "say no to personal genetics tests" blog post

I shall go out on a limb here and declare up front, I am ambivalent about personal genetic tests. In addition, I am wholly against unscrupulous consumerism that overpromises (which might be genetic tests for factors that can't possibly be done now with our current scientific knowledge e.g. intelligence testing, for a list of good examples you may refer to "Some of the 40 behavioral genes that are tested  here." There are a few of the behavioural genes which I believe are bona fide but 'self detoxification' has to be a joke, I hope)

That being said, I would like to offer the flipside of the story. 
The term 'increased risk' is contentious. If you recall the "Toyota Recall" fiasco, did you manage to get any numbers on the increased risk of driving one of the affected cars? The difference is 0.00028 percent according to this page. 
Would that stop you from not sending your car in?
Of course not, and I am not urging you to stop as well. 

But the point is not wholly about the increased risk, but rather your right to know the risks that you are taking. Knowledge is a double edged sword, I recall a friend being tormented by her positive results for early Down's Syndrome screening for her unborn child. Only the negative results from her amniotic fluid test set her mind at rest. Yet I believe no one would ask that the results of the first test be kept private from a patient to prevent undue worry. And I am sure the doctor explained the test fully. But it is only natural for my friend to be worried. 

I do agree with Taralyn, without a healthcare professional explaining the results of the test, the potential for abuse and fear mongering is there. However, she should expect the average consumer who chooses a genetic test, is not an average consumer. He or she is likely an educated consumer, who has an idea of what the test portends or has a known family history of cancer / genetic disease and wishes to have the knowledge to better manage his/her lifestyle.

And to Taralyn, I would like to answer with a "YES" to your question of "Will I ever be able to air-mail a swab of my saliva for my genetic read-out?" Not because you should but you can if you so choose to. This is because the technology is here.

Tuesday, 21 September 2010

The Broad’s Approach to Genome Sequencing (part II)

Bio-IT World | Since 2001, computer scientist Toby Bloom has been the head of informatics for the production sequencing group at the Broad Institute. Her team is the one that has to cope with the “data deluge” brought about by next-generation sequencing. Kevin Davies spoke with Bloom about her team’s operation, successes and ongoing challenges.

They have to deal with 0.5-1 terabases a day! That's a lot to handle!

Whole transcriptome of a single cell using NGS

I think the holy grail of gene profiling has to be single molecule sequencing from a single cell. Imagine the amount of de novo transcriptomics projects that will spring up when that becomes an eventuality!
Posted by Alison Leon on Aug 12, 2010 2:00:42 PM
Are you struggling with conducting gene expression analysis from limited sample amounts? Or perhaps you're trying to keep up with new developments in stem cell research. If so, you might be interested in a recent Cell Stem Cell publication discussing single-cell RNA-Seq analysis (Tang et al., Cell Stem Cell, 2010). In their research, Tang and colleagues trace the conversion of mouse embryonic stem cells from the inner cell mass (ICM) to pluripotent embryonic stem cells (ESCs), revealing molecular changes in the process. This is a follow-on to two previous papers, in which the proof of concept (Tang et al., Nature Methods, 2009) and protocols (Tang et al., Nature Protocols, 2010) for these experiments were detailed.

Using the SOLiD™ System (16-plex, 50-base reads) and whole transcriptome software tools, researchers performed whole transcriptome analysis at the single-cell level (an unprecedented resolution!) to determine gene expression levels and identify novel splice junctions. 385 genes in 74 individual cells were monitored during their transition from ICM to ESC. Validation with TaqMan® assays corroborated this method’s high sensitivity and reproducibility.

According to Tang et al., this research could form the basis for future studies involving regulation and differentiation of stem cells in adults. In addition, further knowledge about developing stem cells could lead to information about how disease tissues, including cancers, develop.

Simple Copy Number Determination with Reference Query Pyrosequencing (RQPS)

Simple Copy Number Determination with Reference Query Pyrosequencing (RQPS)

Zhenyi Liu1,3, Daniel L. Schneider1, Kerry Kornfeld1, and Raphael Kopan1,2,3 1 Department of Developmental Biology, School of Medicine, Washington University, St. Louis, Missouri 63110, USA
2 Division of Dermatology, Department of Medicine, School of Medicine, Washington University, St. Louis, Missouri 63110, USA

The accurate measurement of the copy number (CN) for an allele is often desired. We have developed a simple pyrosequencing-based method, reference query pyrosequencing (RQPS), to determine the CN of any allele in any genome by taking advantage of the fact that pyrosequencing can accurately measure the molar ratio of DNA fragments in a mixture that differ by a single nucleotide. The method involves the preparation of an RQPS probe, which contains two linked DNA fragments that match a reference allele with a known CN and a query allele with an unknown CN. 

Monday, 20 September 2010

Pysam - python module for read and manipulating sam files

Pysam is a python module for reading and manipulating Samfiles. It's a lightweight wrapper of the samtools C-API. 

Looks promising! but failed at compile as I as missing a module on CentOS 5.4

python build
Traceback (most recent call last):
  File "", line 9, in ?
    import os, sys, glob, shutil, hashlib
ImportError: No module named hashlib

To be continued....

Friday, 17 September 2010

The Broad’s Approach to Genome Sequencing (part I)

In Part I of a two-part exclusive interview, Nicol and Nusbaum share some of the keys to the Broad’s sequencing success with Bio-IT World chief editor Kevin Davies. 

NUSBAUM: The cost of storage is coming down very slowly [compared to sequencing costs]. It’s not very hard to foresee a time when storage is half the [total] cost [of sequencing].
NICOL: Or we store it as DNA – and resequence it!
NUSBAUM: It’s been a couple of years since we saved the primary [raw image] data. It is cheaper to redo the sequence and pull it out of the freezer. There are 5,000 tubes in a freezer. Storing a tube isn’t very expensive. Storing 1 Terabyte of data that comes out of that tube costs half as much as the freezer! People [like Ewan Birney at EBI] are working on very elaborate algorithms for storing data, because you can’t compress bases any more than nature already has. The new paradigm is, the bases are here, only indicate the places where the bases are different . . . In 2-3 years, you’ll wonder about even storing the bases. And forget about quality scores.
The cost of DNA sequencing might not matter in a few years. People are saying they’ll be able to sequence the human genome for $100 or less. That’s lovely, but it still could cost you $2,500 to store the data, so the cost of storage ultimately becomes the limiting factor, not the cost of sequencing. We can quibble about the dollars and cents, but you can’t argue about the trends at all.

My comments:
It seems terribly weird that DNA is more 'efficient' than HDD (or archival tapes) for storing data. Granted cost wise, it might be cheaper to store DNA but they neglected to factor in the time required to find that tube, process it. And the cost to implement a LIMS so that you CAN find that tube, and the man hours used to seq the DNA.
Looking forward to Part II of the interview!
what are your thoughts?

Wednesday, 15 September 2010

Myrna-calculate differential gene expression on Elastic MapReduce or local Hadoop

The software, termed “Myrna” was funded in part by Amazon Web Services (in addition to the Bloomberg School of Public Health and the National Institutes of Health) was, not surprisingly, making use of compute resources from Amazon. In order to test Myrna, researchers rented time and storage resources from AWS and were able to realize solid performance and cost savings. According to the study's authors, “Myrna calculated differential expression from 1.1 billion RNA sequences reads in less than two hours at a cost of about $66.”

Myrna is a cloud computing tool for calculating differential gene expression in large RNA-seq datasets. Myrna uses Bowtie for short read alignment and R/Bioconductor for interval calculations, normalization, and statistical testing. These tools are combined in an automatic, parallel pipeline that runs in the cloud (Elastic MapReduce in this case) on a local Hadoop cluster, or on a single computer, exploiting multiple computers and CPUs wherever possible. 

also see

Cloud computing method greatly increases gene analysis

LIfe Tech Single Molecule Seqeuncer can seq up to 250 Kbases!

Just saw this article in Genomeweb
250 kilobases is really amazing! The limit is 500 Kbases though.
as a product if they can do it cheaply, it's a godsend for de novo transcriptomics. (but it might be an overkill now that I see what it can do)

Primed DNA templates tethered to a slide are illuminated with a 405-nanometer laser, and as polymerases synthesize DNA at a rate of between two and 10 nucleotides per second, signals are recorded inside the evanescent field of a total internal reflection fluorescence microscope.
Life Tech's technology faces the same challenge as Pacific Biosciences' in that the laser inactivates the polymerase after a certain time — in Life Tech's case after synthesizing about a kilobase of DNA. But unlike PacBio, Life Tech does not immobilize the polymerase, so it can wash off "dead" enzyme and replace it with a new batch of polymerase that picks up where the old one left off..... Beechem, who said he was not at liberty to discuss how the DNA is made to lie down on the slide, showed that this "top-down" sequencing approach has allowed him and his colleagues to analyze DNA as long as 250 kilobases with about 30 polymerases generating sequence information at the same time. The result is a large number of paired-end reads on the same DNA molecule.....
He stressed, however, that the method is currently not close to becoming a product.
While the size limit is currently about 500 kilobases of DNA, the ultimate goal, he said, is to analyze entire chromosomes this way. Researchers could, for example, study DNA from a tumor and its normal control at the same time, or focus in on specific regions of a chromosome....

NHGRI funds development of third generation DNA sequencing technologies

More than $18 million in grants to spur the development of a third generation of DNA sequencing technologies was announced today by the National Human Genome Research Institute (NHGRI). The new technologies will sequence a person's DNA quickly and cost-effectively so it routinely can be used by biomedical researchers and health care workers to improve the prevention, diagnosis and treatment of human disease.
"NHGRI and its grantees have made significant progress toward the goal of developing DNA sequencing technologies to sequence a human genome for $1,000 or less," said Eric D. Green, M.D, Ph.D., director of NHGRI, one of the National Institutes of Health. "However, we must continue to support and encourage innovative approaches that hold the most promise for advancing our knowledge of human health and disease." 

$1,000 Genome Grants
NHGRI's Revolutionary Genome Sequencing Technologies grants have as their goal the development of breakthrough technologies that will enable a human-sized genome to be sequenced for $1,000 or less. Grant recipients and their approximate funding are:
Adam Abate, Ph.D., GnuBIO Inc., New Haven, Conn.
$240,000 (1 year)
Microfluidic DNA Sequencing

Jeremy S. Edwards, Ph.D., University of New Mexico Health Sciences Center, Albuquerque
$2.7 million (3 years)
Polony Sequencing and the $1000 Genome

Javier A. Farinas, Ph.D., Caerus Molecular Diagnostics Inc., Los Altos, Calif.
$500,000 (2 years)
Millikan Sequencing by Label-Free Detection of Nucleotide Incorporation

M. Reza Ghadiri, Ph.D., Scripps Research Institute, La Jolla, Calif.
$5.1 million (4 years)
Single-Molecule DNA Sequencing with Engineered Nanopores

Steven J. Gordon, Ph.D., Intelligent Bio-Systems Inc., Waltham, Mass.
$2.6 million (2 years)
Ordered Arrays for Advanced Sequencing Systems

Xiaohua Huang, Ph.D., University of California San Diego
$800,000 (2 years)
Direct Real-Time Single Molecule DNA Sequencing

Stuart Lindsay, Ph.D., Arizona State University, Tempe
$860,000 (3 years)
Tunnel Junction for Reading All Four DNA Bases with High Discrimination

Amit Meller, Ph.D., Boston University
$4.1 million (4 years)
Single Molecule Sequencing by Nanopore-Induced Photon Emission

Murugappan Muthukumar, Ph.D., University of Massachusetts, Amherst
$800,000 (3 years)
Modeling Macromolecular Transport for Sequencing Technologies

Dean Toste, Ph.D., University of California, Berkeley
$430,000 (2 years)
Base-Selective Heavy Atom Labels for Electron Microscopy-Based DNA Sequencing

To read the grant abstracts and for more details about the NHGRI genome technology program, go to:

Should I remove duplicates in my NGS data?

interesting posts here about PCR duplicates in NGS data

links compiled by malachig

Redundant reads are removed from ChIP-seq, what about RNA-seq
Duplicate reads removal

Threshold for duplicate removal

Heavy read stacking on the solexa platform
Should identical reads be reduced to a single count?

Removing duplicate reads from multigig .csfasta

Source of duplicate reads and possible ways of reducing

Wednesday, 8 September 2010

NIH Awards $42M for Human Microbiome Project

excerpted from 
NEW YORK (GenomeWeb News) – The National Institutes of Health today announced it has awarded about $42 million in new funds in connection with the Human Microbiome Project.
The new funding seeks to expand the scope of eight demonstration projects to link changes in the human microbiome to health and disease, and to support the development of new technologies for the identification and characterization of microbial communities in the human microbiome.
The list of award winners can be found here.
The $157 million, five-year Human Microbiome Project was launched in 2008 and a year later, 15 one-year disease demonstration projects were funded to study the microbiomes of healthy volunteers and those with specific diseases at body sites thought to have a microbiome association.
In a statement, Anthony Fauci, director of the National Institute of Allergy and Infectious Diseases and co-chair of the Human Microbiome Project's Implementation Group, said the additional funding announced today is for those studies "that hold the most promise for improving our understanding of how human health and disease are influenced by the human microbiome."
Criteria used to evaluate which initial projects would be expanded included the potential of each study to achieve the goals of the disease demonstration project; clinical relevance; and scientific merit, NIH said.

Monday, 6 September 2010

Evaluation of next generation sequencing platforms for population targeted sequencing studies

I came across this paper earlier but didn't have time to blog much about it.
Papers that compare the sequencing platforms are getting rarer as the hype for NGS dies down and people are more interested in the next next gen seq machines (usually termed single molecule seq )
targetted reseq is a popular use of NGS as prices for human whole genome reseq is still not within reach for most. (see Exome sequencing: the sweet spot before whole genomes. )

There are inherent biases that people should be aware of before they jump right into it.

1)The NGS technologies generate a large amount of sequence but, for the platforms that produce short-sequence reads, greater than half of this sequence is not usable. 
  • On average, 55% of the Illumina GA reads pass quality filters, of which approximately 77% align to the reference sequence
  • ABI SOLiD, approximately 35% of the reads pass quality filters, and subsequently 96% of the filtered reads align to the reference sequenc
  • n contrast to the platforms generating short-read lengths, approximately 95% of the Roche 454 reads uniquely align to the target sequence.

Admittedly, the numbers have changed for this now that Illumina has longer read lengths. (the paper tested 36 bp vs 35 bp )

2) For PCR-based targetted sequencing, they observed that the mapped sequences corresponding to the 50 bp at the ends and the overlapping intervals of the amplicons have extremely high coverage. 
  • These regions, representing about 2.3% (approximately 6 kb) of the targeted intervals, account for up to 56% of the sequenced base pairs for Illumina GA technology.
  • For the ABI SOLiD platform an amplicon end depletion protocol was employed to remove the overrepresented amplicon ends; this was partially successful and resulted in the ends accounting for up to 11% of the sequenced base pairs.  
  • For the Roche 454 technology, overrepresentation of amplicon ends versus internal bases is substantially less, with the ends composing only 5% of the total sequenced bases; this is likely due to library preparation process differences between Roche 454 and the short-read length platforms.
The overrepresentation of amplicon end sequences is not only wasteful for the sequencing yield but also decreases the expected average coverage depth across the targeted intervals. Therefore, to accurately assess the consequences of sequence coverage on data quality, we removed the 50 bp at the ends of the amplicons from subsequent analyses. 

I am not sure if this has changed since.

Note: Will update thoughts when i have more time.

Other Interesting papers
WGS vs exome seq
Whole-exome sequencing identifies recessive WDR62 mutations in severe brain malformations.
Identification by whole-genome resequencing of gene defect responsible for severe hypercholesterolemia.

Exome sequencing: the sweet spot before whole genomes.
Whole human exome capture for high-throughput sequencing.
Screening the human exome: a comparison of whole genome and whole transcriptome sequencing.
Novel multi-nucleotide polymorphisms in the human genome characterized by whole genome and exome sequencing. 

Family-based analysis and exome seq
Molecular basis of a linkage peak: exome sequencing and family-based analysis identify a rare genetic variant in the ADIPOQ gene in the IRAS Family Study.

Friday, 3 September 2010

SeqAlign, an FPGA-based accelerator for the two common DNA Sequence alignment algorithms: Smith-Waterman (for local alignments) and Needleman-Wunsch (for global alignments).

Using FPGA for seq align is  a new concept to me, making a board do alignment of reads!
Now most computers only have one processor, so we can’t really take advantage of this. Fortunately, wonderful little devices called Field-Programmable Gate Arrays exist (see my Non-Von1 for another fun example), that let you make your own computer circuitry. Imagine if we could build an array of processors, connected in a line, where each processor holds one “nucleotide” from the top sequence. We could then “feed” in the sequence from one side, and calculate the entire matrix just like in the picture above! This is called a “systolic array“, and is an awesome way to compute a matrix!

This is the dev kit which I found on amazon (? no idea amazon sold such stuff)

it does have some limitations in its current guise
The USB interface to this board does indeed suck. It takes less than 5 microseconds to compute the whole matrix, but initiating a driver call from windows to send a byte to the device takes a whopping 370 microseconds!! Oy! Fortunately, there are tricks you can play! Thanks to the use of FIFOs, buffering and the creative application of backpressuring, I was able to sustain about 1.8 GCUPS on this bad boy! 

Check out: The SeqAlign project


Datanami, Woe be me