Thursday, 29 July 2010

Palmapper and Oqtans

Oqtans looks like an interesting tool for analysing RNA-seq data based on Galaxy framework. Unfortunately the server is down. But it did point to another mapping tool which I am curious to try out PALmapper. What piqued my interest is this in the abstract
"align around 7 million reads per hour on a single AMD CPU core (similar speed as TopHat [3]). Our study for C. elegans furthermore shows that PALMapperPALMapper is considerably more accurate than TopHat (47% and 81%, respectively)." predicts introns with very high sensitivity (72%) and specificity (82%) when using the annotation as ground truth.

Wednesday, 28 July 2010

Sort 1 Terabyte of data in a 60 seconds?

UC San Diego Computer scientists did that according to HPCwire.

"SAN DIEGO, July 27 -- Computer scientists from the University of California, San Diego broke "the terabyte barrier" -- and set a new world record -- when they sorted more than one terabyte of data (1,000 gigabytes or 1 million megabytes) in just 60 seconds on a computing cluster at Calit2. During this 2010 "Sort Benchmark" competition -- the "World Cup of data sorting" -- the computer scientists from the UC San Diego Jacobs School of Engineering also tied a world record for fastest data sorting rate. They sorted one trillion data records in 172 minutes – and did so using just a quarter of the computing resources of the other record holder....."

Gosh and I felt I was having trouble sorting mere gigabytes of data ...

"Sorting is also an interesting proxy for a whole bunch of other data processing problems. Generally, sorting is a great way to measure how fast you can read a lot of data off a set of disks, do some basic processing on it, shuffle it around a network and write it to another set of disks," explained Rasmussen. "Sorting puts a lot of stress on the entire input/output subsystem, from the hard drives and the networking hardware to the operating system and application software."

 I would love to find out how you can sort items bigger than ramspace efficiently and see how they can be implemented on bioinformatics problems.

Do check out
They have various metrics for sorting most of which I had never thought of before!
I love the category of PennySort in particular!

Metric: Amount of data that can be sorted for a penny's worth of system time.
Originally defined in AlphaSort paper.

There's also JouleSort

 Metric: Amount of energy required to sort either 108, 109 or 1010 records (10 GB, 100 GB or 1 TB).
Originally defined in JouleSort paper.

Wednesday, 21 July 2010

Google Chrome in CentOS? You will have to wait

Rant warning:
Another crippling experience of working with CentOS.

I still can't install chrome despite google having an official linux port due to an outdated package (lsb) on CentOS 5.4

Others are having the same issues.

Wednesday, 14 July 2010

Shiny new tool to index NGS reads G-SQZ

This is a long over due tool for those trying to do non-typical analysis with your reads.
Finally you can index and compress your NGS reads

Bioinformatics. 2010 Jul 6. [Epub ahead of print]
G-SQZ: Compact Encoding of Genomic Sequence and Quality Data.

Tembe W, Lowey J, Suh E.

Translational Genomics Research Institute, 445 N 5th Street, Phoenix, AZ 85004, USA.

SUMMARY: Large volumes of data generated by high-throughput sequencing instruments present non-trivial challenges in data storage, content access, and transfer. We present G-SQZ, a Huffman coding-based sequencing-reads specific representation scheme that compresses data without altering the relative order. G-SQZ has achieved from 65% to 81% compression on benchmark datasets, and it allows selective access without scanning and decoding from start. This paper focuses on describing the underlying encoding scheme and its software implementation, and a more theoretical problem of optimal compression is out of scope. The immediate practical benefits include reduced infrastructure and informatics costs in managing and analyzing large sequencing data. AVAILABILITY: Academic/non-profit: Source: available at no cost under a non-open-source license by requesting from the web-site; Binary: available for direct download at no cost. For-Profit: Submit request for for-profit license from the web-site. CONTACT: Waibhav Tembe (

read the discussion thread in seqanswers for more tips and benchmarks

I am not affliated with the author btw

the nuts and bolts behind ABI's SAET

I really do not like to use tools that I have no idea what they are trying to do. 
ABI's SAET SOLiD™ Accuracy Enhancer Tool (SAET) is a one example that had extremely brief documentation except what it promised to do 

  • The SOLiD™ Accuracy Enhancer Tool (SAET) uses raw data generated by SOLiD™ Analyzer to correct miscalls within reads prior to mapping or contig assembly.  
  • Use of SAET, on various datasets of whole or sub-genomes of < 200 Mbp in size and of varying complexities, readlengths, and sequence coverages, has demonstrated improvements in mapping, SNP calling, and de novo assembly results.
  • For denovo applications, the tool reduces miscall rate substantially

    Recently attended an ABI's talk and finally someone explained it in a nice diagram. It is akin to Softgenetic's condensation tool.( I made the link ). Basically, it groups reads by similarity and where they find a mismatch that is not supported by high quality reads they correct the low quality read to reach a 'consensus'. I see it as a batch correction of sequencing errors which one can typically do by eye (for small regions). This correction isn't without its flaws. I now understand why such an error correction isn't implemented on the instrument. And is presented as a user choice. My rough experience with this tool is that it increases mapping by ~ 10% how this 10% would affect your results is debatable.

Datanami, Woe be me