Sunday, 30 December 2012

Period of ineptitude now : INSIDE THE BOX: Solutions for Data Sharing in Life Sciences - Bio-IT World

Coming from a wet lab background, I am amused that there's a consensus that current journal publications do not enable the sharing of data in an easily accessible manner.
At the dawn of molecular biology, where cloning a single gene warrants a publication in JBC, sharing is well, rare. :-)
Short of a few labs, I think few labs will readily share their cloned transcripts, cell lines or antibodies.
Part of the reason is of course logistics, which may include lengthy MTA discussions with the university or perhaps it might represent problems with customs with different countries.
There's of course the selfish (gene) hypothesis. If you have ongoing research, it's natural to not want another lab to have a head start to catch up with your post grad student who has been slogging to get his/her publication out.

But I think that it's great that we are all moving towards an era of open science, both in publishing in readily accessible journals and openly sharing data and looking for open collaborations.

It's quite awkward to have competition for publicly funded science, as it seems to suggest we have ran out of interesting questions to ask of the world.
if someone sees a potential for my data, that I overlooked, I will be most happy if something came out of that same data. Because sharing is caring :-)

Tuesday, 25 December 2012

Exome array analysis identifies new loci and low-frequency variants influencing insulin processing and secretion.

 2012 Dec 23. doi: 10.1038/ng.2507. [Epub ahead of print]

Exome array analysis identifies new loci and low-frequency variants influencing insulin processing and secretion.


Center for Statistical Genetics, Department of Biostatistics, University of Michigan, Ann Arbor, Michigan, USA.


Insulin secretion has a crucial role in glucose homeostasis, and failure to secrete sufficient insulin is a hallmark of type 2 diabetes. Genome-wide association studies (GWAS) have identified loci contributing to insulin processing and secretion; however, a substantial fraction of the genetic contribution remains undefined. To examine low-frequency (minor allele frequency (MAF) 0.5-5%) and rare (MAF < 0.5%) nonsynonymous variants, we analyzed exome array data in 8,229 nondiabetic Finnish males using the Illumina HumanExome Beadchip. We identified low-frequency coding variants associated with fasting proinsulin concentrations at the SGSM2 and MADD GWAS loci and three new genes with low-frequency variants associated with fasting proinsulin or insulinogenic index: TBC1D30, KANK1 and PAM. We also show that the interpretation of single-variant and gene-based tests needs to consider the effects of noncoding SNPs both nearby and megabases away. This study demonstrates that exome array genotyping is a valuable approach to identify low-frequency variants that contribute to complex traits.

Sunday, 23 December 2012

Top Scientific Discoveries of 2012 | Wired Science |

Seems odd that rare variants is listed there as a top discovery.

New AWS high storage instance

The High Storage Eight Extra Large (hs1.8xlarge) instances are a great fit for applications that require high storage depth and high sequential I/O performance. Each instance includes 117 GiB of RAM, 16 virtual cores (providing 35 ECU of compute performance), and 48 TB of instance storage across 24 hard disk drives capable of delivering up to 2.4 GB per second of I/O performance.

Genome Biology | Abstract | Ray Meta: scalable de novo metagenome assembly and profiling

Abstract (provisional)
Voluminous parallel sequencing datasets, especially metagenomic experiments, require distributed computing for de novo assembly and taxonomic profiling. Ray Meta is a massively distributed metagenome assembler that is coupled with Ray Communities, which profiles microbiomes based on uniquely-colored k-mers. It can accurately assemble and profile a three billion read metagenomic experiment representing 1,000 bacterial genomes of uneven proportions in 15 hours with 1,024 processor cores, using only 1.5 GB per core. The software will facilitate the processing of large and complex datasets, and will help in generating biological insights on specific environments. Ray Meta is open source and available at

Saturday, 22 December 2012

Article: Gattaca Alert? Or Should We Welcome the New Age of Eugenics?

Sent via Flipboard

The Evolution of Cavities – Phenomena: The Loom

Interesting article on S. mutans which causes cavities.
Wonder what are the 148 unique to human microbiome genes that were picked up.

I doubt that removing S mutans will be an entirely good idea though, it will create a void for another species to exploit and who knows what diseases worse than cavities will prevail

Tuesday, 18 December 2012

Bowtie 2 2.0.4 released: Fixed issue whereby --un, --al, --un-conc and --al-conc options would incorrectly suppress SAM output.

---------- Forwarded message ----------
From: Langmead, Ben <>
Date: Tue, Dec 18, 2012 at 10:29 AM

Please make the switch from 2.0.3, as the first issue listed below is major.

Bowtie 2 version 2.0.4  - December 17, 2012
   * Fixed issue whereby --un, --al, --un-conc and --al-conc options would
     incorrectly suppress SAM output.
   * Fixed minor command-line parsing issue in wrapper script.
   * Fixed issue on Windows where wrapper script would fail to find
     bowtie2-align.exe binary.
   * Updated some of the index-building scripts and documentation.
   * Updated author's contact info in usage message.



Ben Langmead
Department of Computer Science
Johns Hopkins University
3400 North Charles St
Baltimore, MD 21218-2682

Monday, 17 December 2012

Fwd: [Bowtie-bio-announce] Bowtie 0.12.9 released

From: Langmead, Ben
Date: Monday, December 17, 2012

Bowtie version 0.12.9 - December 16, 2012
   * Fixed a bug whereby read names would not be truncated at first
     whitespace character in unmapped or maxed-out SAM records.
   * Fixed errors and warnings when compiling with clang++.
   * Fixed most errors and warnings when compiling with recent versions
     of g++, though you may need to add EXTRA_FLAGS=-Wno-enum-compare
     to avoid all warnings.



Ben Langmead
Department of Computer Science
Johns Hopkins University
3400 North Charles St
Baltimore, MD 21218-2682

Bowtie-bio-announce mailing list

Sent from Gmail Mobile

Thursday, 13 December 2012

Effects of OTU Clustering and PCR Artifacts on M... [Microb Ecol. 2012] - PubMed - NCBI

Next-generation sequencing has increased the coverage of microbial diversity surveys by orders of magnitude, but differentiating artifacts from rare environmental sequences remains a challenge. Clustering 16S rRNA sequences into operational taxonomic units (OTUs) organizes sequence data into groups of 97 % identity, helping to reduce data volumes and avoid analyzing sequencing artifacts by grouping them with real sequences. Here, we analyze sequence abundance distributions across environmental samples and show that 16S rRNA sequences of >99 % identity can represent functionally distinct microorganisms, rendering OTU clustering problematic when the goal is an accurate analysis of organism distribution. Strict postsequencing quality control (QC) filters eliminated the most prevalent artifacts without clustering. Further experiments proved that DNA polymerase errors in polymerase chain reaction (PCR) generate a significant number of substitution errors, most of which pass QC filters. Based on our findings, we recommend minimizing the number of PCR cycles in DNA library preparation and applying strict postsequencing QC filters to reduce the most prevalent artifacts while maintaining a high level of accuracy in diversity estimates. We further recommend correlating rare and abundant sequences across environmental samples, rather than clustering into OTUs, to identify remaining sequence artifacts without losing the resolution afforded by high-throughput sequencing

Wednesday, 12 December 2012

Article: Cross-biome metagenomic analyses of soil microbial communities and their functional attributes

Cross-biome metagenomic analyses of soil microbial communities and their functional attributes

Sent via Flipboard

Sent from my phone

Article: AJHG - Improved Heritability Estimation from Genome-wide SNPs

AJHG - Improved Heritability Estimation from Genome-wide SNPs

Estimation of narrow-sense heritability, h2, from genome-wide SNPs genotyped in unrelated individuals has recently attracted interest and offers several advantages over traditional pedigree-based methods. With the use of this approach, it has been estimated that over half the heritability of human height can be attributed to the ∼300,000 SNPs on a genome-wide genotyping array. In comparison, only 5%–10% can be explained by SNPs reaching genome-wide significance. We investigated via simulation the validity of several key assumptions underpinning the mixed-model analysis used in SNP-based h2 estimation. Although we found that the method is reasonably robust to violations of four key assumptions, it can be highly sensitive to uneven linkage disequilibrium (LD) between SNPs: contributions toh2 are overestimated from causal variants in regions of high LD and are underestimated in regions of low LD. The overall direction of the bias can be up or down depending on the genetic architecture of the trait, but it can be substantial in realistic scenarios. We propose a modified kinship matrix in which SNPs are weighted according to local LD. We show that this correction greatly reduces the bias and increases the precision of h2 estimates. We demonstrate the impact of our method on the first seven diseases studied by the Wellcome Trust Case Control Consortium. Our LD adjustment revises downward the h2 estimate for immune-related diseases, as expected because of high LD in the major-histocompatibility region, but increases it for some nonimmune diseases. To calculate our revised kinship matrix, we developed LDAK, software for computing LD-adjusted kinships.

Sent via Flipboard

Sent from my phone

Article: A high-performance computing toolset for relatedness and principal component analysis of SNP data

A high-performance computing toolset for relatedness and principal component analysis of SNP data

Summary: Genome-wide association studies are widely used to investigate the genetic basis of diseases and traits, but they pose many computational challenges. We developed gdsfmt and SNPRelate (R packages for multi-core symmetric multiprocessing computer architectures) to accelerate two key computations on SNP data: principal component analysis (PCA) and relatedness analysis using identity-by-descent measures. The kernels of our algorithms are written in C/C++ and highly optimized. Benchmarks show the uniprocessor implementations of PCA and identity-by-descent are ~8–50 times faster than the implementations provided in the popular EIGENSTRAT (v3.0) and PLINK (v1.07) programs, respectively, and can be sped up to 30–300-fold by using eight cores. SNPRelate can analyse tens of thousands of samples with millions of SNPs. For example, our package was used to perform PCA on 55 324 subjects from the 'Gene-Environment Association Studies' consortium studies.

Availability and implementation:gdsfmt and SNPRelate are available from R CRAN (, including a vignette. A tutorial can be found at

Sent via Flipboard

Sent from my phone

Article: Tools for mapping high-throughput sequencing data

Tools for mapping high-throughput sequencing data

Motivation: A ubiquitous and fundamental step in high-throughput sequencing analysis is the alignment (mapping) of the generated reads to a reference sequence. To accomplish this task, numerous software tools have been proposed. Determining the mappers that are most suitable for a specific application is not trivial.

Results: This survey focuses on classifying mappers through a wide number of characteristics. The goal is to allow practitioners to compare the mappers more easily and find those that are most suitable for their specific problem.

Availability: A regularly updated compendium of mappers can be found at

Sent via Flipboard

Sent from my phone

Article: RetroSeq: Transposable element discovery from Illumina paired-end sequencing data

RetroSeq: Transposable element discovery from Illumina paired-end sequencing data

Sent via Flipboard

Sent from my phone

U.K. Unveils Plan to Sequence Whole Genomes of 100,000 Patients - ScienceInsider

100 million pounds for 100,000 WGS samples. 

Hmm how much goes into the bioinformatics and IT infrastructure ?

I suddenly recall British telecom foray into a "Siri " for medical genomics and how everything ties in together now. 

With genetic technology advancing quickly, the prime minister of the United Kingdom announced today an ambitious plan to fully sequence the genomes of 100,000 Britons with cancer and rare diseases. Although many countries are touting their efforts to decode their citizens' DNA in the name of treating and curing disease, the new project is unusual because it will decode entire genomes, not just parts of them.

Prime Minister David Cameron said in a statement that the government's National Health Service (NHS) has earmarked £100 million, or about $160 million, to the effort. The money is part of £600 million ($965 million) announced last week for research in the coming years. The sequencing is expected to take 3 to 5 years.

Sent from my phone

Tuesday, 11 December 2012

Fwd: [bedtools-discuss] pybedtools version 0.6.2

---------- Forwarded message ----------
From: "Ryan Dale"

> Hi all,
> On the heels of BEDTools 2.17 release, I've just released the corresponding pybedtools v0.6.2.
> pybedtools v0.6.2 wraps the tools new in BEDTools 2.17, and adds lots of bugfixes and features of its own.
> The complete documentation is at, and the specific changes and new features in this version are listed at
> As always, comments, suggestions, and bug reports are all welcome.
> -ryan

Nature editors-"a cross between rock star goddess and Darth Vade" The Nature of the Knight Bus | Story Collider Magazine

" I may look like a standard soccer mom now, kids, but back in the day people regarded me as a cross between a rock-star goddess and Darth Vader. That's right, I was indeed an editor for Nature for nearly seven years, handling papers in genetics and genomics. There is a generally accepted hierarchy in science journals, and Nature is always at or near the top. The team of biology editors at Nature can only publish between about 8 and 10 percent of the submissions they receive, so most of your job as a manuscript editor is not publication of work but instead dealing out rejection. "

my hat's off to Michael Eisen

odd way to promote Fedora

Hmm I won't have done a facebook ad with the exact words to describe Fedora

 Fedora is a Linux OS, a collection of software to run on your computer.
Join us today.
Like · 1,299 people like this.

even a random quote like
"Fedora has [...] released an amazingly rock-solid operating system." 
− Jack Wallen,

would have enticed me to click like if I didn't know Linux

Monday, 10 December 2012

Gabe Rudy "GATK is a Research Tool. Clinics Beware." | Our 2 SNPs…(R)

Gabe points out in great detail a bug he found in GATK's variant caller which has be widely regarded as a reliable SNP caller. 

I think in general the 'unreliable' nature of next gen seq data has researchers often seeking multiple sources of confirmation for variants before moving to publication. 

though I am frankly surprised that GATK turned up an error but as Gabe points out it might be common to find Heisen Bugs in software

and it's a poignant reminder that DTC genetic testing needs more work to avoid mistakes like these that might be detrimental to personalised medicine 

"But my scary homozygous insertion (row 2) shows 153 reference bases and no reads supporting the insertion. Yet it was still called a homozygous variant!
I promptly sent an email off to 23andMe's exome team letting them know about what is clearly a bug in the GATK variant caller. They confirmed it was a bug that went away after updating to a newer release. I talked to 23andMe's bioinformatician behind the report face-to-face a bit at this year's ASHG conference, and it sounds like it was most likely a bug in the tool's multi-sample variant calling mode as this phantom insertion was a real insertion in one of the other samples.
Since there were 8,242 other InDels that match this pattern, I am most likely not looking at random noise but real "leaked" variants from other members of the 23andMe Exome Pilot Program. (Edit: After some analysis with a fixed version of GATK, Eoghan from 23andMe found that these genotypes where not leaked from other samples but completely synthetic.)" 

Benevolent_Dictator_for_Life for Python joins Dropbox

Guido has parted "as best friends" from Google to join Dropbox. 
Looking forward to seeing nicer python APIs that might be able to integrate with Linux CLI / NGS pipelines. 

Hmm imagine storing your seq on Dropbox and enabling access to AWS and/or Galaxy and/or 23andme analysis .. 

This would feel like I actually OWN my genomic seq


Slides for ASHG 2012 1000 Genomes Tutorial Wednesday 7th November 7-9:30pm | 1000 Genomes


The tutorial being held at the San Francisco Marriot Marquis from 7pm to 9:00pm on Wednesday 7th November

The 1000 Genomes Project has released the sequence data and an integrated set of variants, genotypes, and haplotypes for the 1092 samples in the phase 1 set, and the sequence data for the phase 2 set. This tutorial describes the data sets, how to access them, and how to use them.

The topics being covered are

1.  (15 min talk, 3 min questions)  Description of the 1000 Genomes data – Mark DePristo [slides]
2.  (15 min talk, 3 min questions)  How to access the data – Laura Clarke [slides]
3.  (15 min talk, 3 min questions)  Structural variants  -- Ryan Mills   [slides]
4.  (15 min talk, 3 min questions)  Population genetic and admixture analyses – Eimear Kenny [slides]
5.  (15 min talk, 3 min questions)  Functional analyses – Ekta Khurana [slides]
6.  (15 min talk, 3 min questions)  How to use the data in disease studies  -- Stephan Ripke
7.  (12 min)   Q&A

A poster was also presented on Wednesday 7th. A copy of the poster is also available on the ftp site

Wednesday, 5 December 2012

bash function to inspect col data files

Found this neat gem!

 #usage filetopreview  
 #inspect function credit :  
 i() {  
      (head -n 5;tail -n 5) < "$1" | column -t   
 #calls the function  
 i $1  

Mounting Amazon S3 buckets in GenomeSpace

Neat feature!

New Feature: Mounting Amazon AWS S3 Buckets

The GenomeSpace Data Manager was originally built to save the files you upload to GenomeSpace in an Amazon Simple Storage System (S3) bucket that is managed by GenomeSpace itself. However you can add additional Amazon S3 buckets to GenomeSpace that you or a third party has set up to make the file contents available to your GenomeSpace and your GenomeSpace tools. For buckets that are publicly accessible, you only need to tell GenomeSpace the name of the bucket to mount it.  However, for private buckets, or those with limited non-public accessibility, the process is more complex, requiring you to set up a sub-account and the minimal permissions in Amazon to share the bucket with GenomeSpace.  Once a bucket has been mounted in GenomeSpace, you can share it with other GenomeSpace users using the standard GenomeSpace sharing dialogs.

For details on how to mount a bucket into your GenomeSpace, follow the steps in the documentation.

Tuesday, 4 December 2012

The non-human primate reference transcript... [Nucleic Acids Res. 2012] - PubMed - NCBI

RNA-based next-generation sequencing (RNA-Seq) provides a tremendous amount of new information regarding gene and transcript structure, expression and regulation. This is particularly true for non-coding RNAs where whole transcriptome analyses have revealed that the much of the genome is transcribed and that many non-coding transcripts have widespread functionality. However, uniform resources for raw, cleaned and processed RNA-Seq data are sparse for most organisms and this is especially true for non-human primates (NHPs). Here, we describe a large-scale RNA-Seq data and analysis infrastructure, the NHP reference transcriptome resource (; it presently hosts data from12 species of primates, to be expanded to 15 species/subspecies spanning great apes, old world monkeys, new world monkeys and prosimians. Data are collected for each species using pools of RNA from comparable tissues. We provide data access in advance of its deposition at NCBI, as well as browsable tracks of alignments against the human genome using the UCSC genome browser. This resource will continue to host additional RNA-Seq data, alignments and assemblies as they are generated over the coming years and provide a key resource for the annotation of NHP genomes as well as informing primate studies on evolution, reproduction, infection, immunity and pharmacology.

NCBI Remap aka 'liftover' tool

NCBI Remap is a tool that allows users to project annotation data from one coordinate system  to another. This remapping (sometimes called 'liftover') uses genomic alignments to project features from one sequence to the other. For each feature on the source sequence, we perform a base by base analysis of each feature on the source sequence in order to project the feature through the alignment to the new sequence.
We support three variations of Remap. Assembly-Assembly allows the remapping of features from one assembly to another. RefSeqGene allows for the remapping of features from assembly sequences to RefSeqGene sequences (including transcript and protein sequences annoted on the RefSeqGene) or from RefSeqGene sequences to an assembly. Alt loci remap allows for the mapping of features between the Primary assembly and the alternate loci and Patches available for GRC assemblies.

What's new

With the November 2012 update, we added the following features:
  • Alt locus remap: remap features between the primary assembly and the alternate loci/patches in GRC assemblies.
  • Clinical Remap: When you run this we will now make a call to the variation reporter and insert the results into Clincal Remap.
  • Added support for upload of compressed files. Currently GZip (.gz) and BZip2 (.bz) files are supported.
  • Improved HGVS nomenclature.

you can access the tool here

Saturday, 1 December 2012

Ray tutorial in wikibooks

Date: 1 December, 2012 7:46:55 AM GMT+08:00

Subject: [Denovoassembler-users] Ray tutorial in wikibooks


A tutorial on using Ray was added in wikibooks [1].

Thank you.



Friday, 30 November 2012

GalaxyUpdates/2012_12 - Galaxy Wiki

Enis Afgan, Brad Chapman and James Taylor, "CloudMan as a platform for tool, data, and analysis distribution." BMC Bioinformatics 2012, 13:315

Jeremy Goecks, Nate Coraor, The Galaxy Team, Anton Nekrutenko & James Taylor, "NGS analyses by visualization with Trackster." Nature Biotechnology 30, 1036–1039 (2012)

Samantha Baldwin, Roopashree Revanna, Susan Thomson, et al., "A Toolkit for bulk PCR-based marker design from next-generation sequence data: application for development of a framework linkage map in bulb onion (Allium cepa L.)," BMC Genomics, Vol. 13, No. 1. (2012), 637

Jeremy C. Morgan, Robert W. Chapman, Paul E. Anderson, "A next generation sequence processing and analysis platform with integrated cloud-storage and high performance computing resources. Proceedings of the ACM Conference on Bioinformatics, Computational Biology and Biomedicine

Bo Liu, Borja Sotomayor, Ravi Madduri, Kyle Chard, "Deploying Bioinformatics Workflows on Clouds with Galaxy and Globus Provision." Third International Workshop on Data Intensive Computing in the Clouds (DataCloud 2012)
These papers were among 37 papers added to the Galaxy CiteULike group since the last Galaxy Update.

Article: Gattaca Alert? Or Should We Welcome the New Age of Eugenics?

Sent via Flipboard

Bleh eugenics. I think Nature has its own way of bringing balance back. Have seen a lot of doomsday predictions of how a species has so little living members that it will breed itself into extinction very soon hence it doesn't represent a viable population anymore. Hence that argument against eugenics (creating a homogenous population that isn't resilient  to change ) seems void.

On the flipside I think attempts at making faster stronger smarter and more good looking humans is also doomed to failure. At least what I observe is that people who are doing well seem to prefer to have less kids.

if anything u would have thought that thousands of years of "selective breeding "would have brought us closer to being perfect as a species.

Random late night thoughts

Thursday, 29 November 2012

Article: Dell releases powerful, well-supported Linux Ultrabook

I want this over a Macbook!

In our recent ZaReason UltraLap 430 review, Ars alum Ryan Paul lamented that even though putting Linux on laptops is easier today than ever, it's still not perfect. Some things (particularly components like trackpads and Wi-Fi chips) take some fiddling to get working. Major OEMs aren't yet puttin...

Sent via Flipboard

Article: The Dyslexia Candidate Locus on 2p12 Is Associated with General Cognitive Ability and White Matter Structure

Open Access

Research Article

Thomas S. Scerri1¤, Fahimeh Darki2, Dianne F. Newbury1, Andrew J. O. Whitehouse3, Myriam Peyrard-Janvid4, Hans Matsson4, Qi W. Ang5, Craig E. Pennell5, Susan Ring6, John Stein7, Andrew P. Morris1, Anthony P. Monaco1, Juha Kere4,8,9, Joel B. Talcott10, Torkel Kling...

Sent via Flipboard

Article: Hack could let browsers use cloud to carry out big attacks on the cheap

Scientists have devised a browser-based exploit that allows them to carry out large-scale computations on cloud-based services for free, a hack they warn could be used to wage powerful online attacks cheaply and anonymously.

The method, described in a research paper scheduled to be presented at...

Sent via Flipboard

Wednesday, 28 November 2012

Detecting Rare Variant Effects Using Extreme... [Genet Epidemiol. 2012] - PubMed - NCBI

 2012 Nov 26. doi: 10.1002/gepi.21699. [Epub ahead of print]

Detecting Rare Variant Effects Using Extreme Phenotype Sampling in Sequencing Association Studies.


Department of Biostatistics, Harvard School of Public Health, Boston, Massachusetts.


In the increasing number of sequencing studies aimed at identifying rare variants associated with complex traits, the power of the test can be improved by guided sampling procedures. We confirm both analytically and numerically that sampling individuals with extreme phenotypes can enrich the presence of causal rare variants and can therefore lead to an increase in power compared to random sampling. Although application of traditional rare variant association tests to these extreme phenotype samples requires dichotomizing the continuous phenotypes before analysis, the dichotomization procedure can decrease the power by reducing the information in the phenotypes. To avoid this, we propose a novel statistical method based on the optimal Sequence Kernel Association Test that allows us to test for rare variant effects using continuous phenotypes in the analysis of extreme phenotype samples. The increase in power of this method is demonstrated through simulation of a wide range of scenarios as well as in the triglyceride data of the Dallas Heart Study.

Sent from my phone

Tuesday, 27 November 2012

I want to seq every martian for $1000

This is pretty funny poke at the Ion Torrent (&Proton) vs MiSeq debate!

Other memorable quotes

"Now you're telling me Martians with long stretches of repeats can't be sequenced?"

I wonder if there will be a video retort from the other camp.

Monday, 26 November 2012

One for all, all for one « Wellcome Trust Sanger Institute Blog

did you know ? 

One key reason for this discrepancy is that HIV-1 is one of the most genetically diverse viruses known. The HIV-1 diversity within just one infected person at any one time is as great as the diversity of influenza viruses worldwide in an entire year. For example, there are as many as four genetic groups of HIV-1, nine subtypes and 55 circulating recombinant forms, or forms that have swapped their genetic material. This extensive genetic diversity has limited our ability to rapidly and cost-effectively sequence HIV-1 genomes from different populations and geographical regions.

I also didn't know that plus the diversity of human influenza viruses is numbered ~90,000. I must admit that I have had always assumed that microbial genomes are easy to work with but the diversity within a species itself sounds like it might be a mind boggling task! 

New Galaxy CloudMan Release

From: Enis Afgan
Date: 26 November 2012 11:16
Subject: [galaxy-user] New Galaxy CloudMan Release
To: Galaxy-user <galaxy-user>

We just released an update to CloudMan. CloudMan offers an easy way to get a personal and completely functional instance of Galaxy in the cloud in just a few minutes, without any manual configuration.

This update brings a large number of updates and new features, the most prominent ones being:
- Support for Eucalyptus cloud middleware. Thanks to Alex Richter. Also, CloudMan can now run on the HPcloud in basic mode (note that there is no public image available on the HPcloud at the moment and one would thus need to be built by you).
- Added a new file system management interface on the CloudMan Admin page, allowing control and providing insight into each available file system
- Added quite a few new user data options. See the UserData page for details. Thanks to John Chilton.
- Galaxy can now be run in multi-process mode. Thanks to John Chilton.
Added Galaxy Reports app as a CloudMan service. Thanks to John Chilton.
- Introduced a new format for cluster configuration persistence, allowing more flexibility in how services are maintained
- Added a new file system service for instance's transient storage, allowing it to be used across the cluster over NFS. The file system is available at /mnt/transient_nfs just know that any data stored there will not be preserved after a cluster is terminated.
- Support for Ubuntu 12.10
- Worker instances are now also SGE submit hosts

This update comes as a result of 175 code changesets; for a complete list of changes, see the commit messages

Any new cluster will automatically start using this version of CloudMan. Existing clusters will be given an option to do an automatic update once the main interface page is refreshed.

Let us know what you think,

The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at  Please keep all replies on the list by
using "reply all" in your mail client.  For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:

FAQ: How much sequencing is needed for ... [soil metagenomics]?

Titus (Living in an Ivory Basement) blogs about the about of sequencing coverage required for various NGS applications and explains how he gets the number (via an this the spreadsheet ) 

Fig 2 which shows the coverage required (in non log Y scale) is really his main point of how crazy a lot of coverage that will make Illumina a very happy company if everyone started doing soil metagenomes

I must say frankly I am surprised that marine samples or even human gut samples have such a vast difference in species diversity compared to soil. 

Definitely thought provoking when you think about the search for extra terrestrial life perhaps it's easier  assume we will find an alien bacteria first than anything else .. (X-files will be terribly less exciting though if it's all about sending space soil probes to retrieve alien bacteria that eats your flesh)

haha sorry random thought on a Monday ... 

Wednesday, 21 November 2012

A unified method for detecting secondar - PubMed Mobile

Next-generation sequencing has made possible the detection of rare variant (RV) associations with quantitative traits (QT). Due to high sequencing cost, many studies can only sequence a modest number of selected samples with extreme QT. Therefore association testing in individual studies can be underpowered. Besides the primary trait, many clinically important secondary traits are often measured. It is highly beneficial if multiple studies can be jointly analyzed for detecting associations with commonly measured traits. However, analyzing secondary traits in selected samples can be biased if sample ascertainment is not properly modeled. Some methods exist for analyzing secondary traits in selected samples, where some burden tests can be implemented. However p-values can only be evaluated analytically via asymptotic approximations, which may not be accurate. Additionally, potentially more powerful sequence kernel association tests, variable selection-based methods, and burden tests that require permutations cannot be incorporated. To overcome these limitations, we developed a unified method for analyzing secondary trait associations with RVs (STAR) in selected samples, incorporating all RV tests. Statistical significance can be evaluated either through permutations or analytically. STAR makes it possible to apply more powerful RV tests to analyze secondary trait associations. It also enables jointly analyzing multiple cohorts ascertained under different study designs, which greatly boosts power. The performance of STAR and commonly used RV association tests were comprehensively evaluated using simulation studies. STAR was also implemented to analyze a dataset from the SardiNIA project where samples with extreme low-density lipoprotein levels were sequenced. A significant association between LDLR and systolic blood pressure was identified, which is supported by pharmacogenetic studies. In summary, for sequencing studies, STAR is an important tool for detecting secondary-trait RV associations.

Sent from my phone

Tuesday, 20 November 2012

Amazon Glacier stores data for as little as $0.01 per gigabyte per month

Cheap cloud storage! 

Amazon Web Services

Dear Amazon Web Services Customer,

We are pleased to introduce a new storage option for Amazon S3 that enables you to utilize Amazon Glacier's extremely low-cost storage service for data archival. Amazon Glacier stores data for as little as $0.01 per gigabyte per month, and is optimized for data that is infrequently accessed and for which retrieval times of several hours are suitable. With the new Amazon Glacier storage option for Amazon S3, you can define rules to automatically archive sets of Amazon S3 objects to Amazon Glacier for even lower cost storage.

To store Amazon S3 objects using the Amazon Glacier storage option, you define archival rules for a set of objects in your Amazon S3 bucket, specifying a prefix and a time period. The prefix (e.g. "logs/") identifies the object(s) subject to the rule, and the time period specifies either the number of days from object creation date (e.g. 180 days) or the specified date after which the object(s) should be archived (e.g. June 1st 2013). Going forward, any Amazon S3 standard or Reduced Redundancy Storage objects past the specified time period and having names beginning with the specified prefix are then archived to Amazon Glacier. To restore Amazon S3 data stored using the Amazon Glacier option, you first initiate a restore job using the Amazon S3 API or the Amazon S3 Management Console. Restore jobs typically complete in 3 to 5 hours. Once the job is complete, you can access your data through an Amazon S3 GET request.

You can easily configure rules to archive your Amazon S3 objects to the new Amazon Glacier storage option by opening the Amazon S3 Management Console and following these simple steps:

  1. Select the Amazon S3 bucket containing the objects that you wish to archive to Amazon Glacier.
  2. Click on "Properties. Under the "Lifecycle" tab, click "Add rule."
  3. Enter an object prefix in the "Object prefix:" input box. This rule is now applicable to all objects with names that start with the specified prefix.
  4. Choose whether you want to archive your objects based on the age of a given object or based on a specified date. Click the "Add Transition" button and specify the age or date value. Click the "Save" button.

The Amazon Glacier storage option for Amazon S3 is currently available in the US-Standard, US-West (N. California), US-West (Oregon), EU-West (Ireland), and Asia Pacific (Japan) Regions. You can learn more by visiting the Amazon S3 Developer Guide or joining our Dec 12 webinar.

The Amazon S3 Team

AWS Blog  ln brk  Facebook  Twitter  YouTube  Slidesharere: Invent

We hope you enjoyed receiving this message. If you wish to remove yourself from receiving future product announcements and the monthly AWS Newsletter, please update your communication preferences.

Amazon Web Services, Inc. is a subsidiary of, Inc. is a registered trademark of, Inc. This message produced and distributed by Amazon Web Services, Inc., 410 Terry Ave. North, Seattle, WA 98109-5210.

9.2% of Singapore's males had a childhood dream to be a scientist



For males, 11.4 per cent of those surveyed in Singapore wanted to be engineers; 9.2 per cent wanted to be scientists; and 8.5 per cent, airplane or helicopter pilots. Following close are those in the health profession – doctors/nurses/paramedics came in at 6.3 per cent. Police officers made the cut-off at 5.5 per cent. 

A total of 8,000 professionals all over the world took part in the survey.

Globally, males wanted to be engineers (10.9 per cent), pilots (10 per cent), scientists (7.7 per cent), doctors/nurses/paramedics (5.3 per cent) and astronauts (4 per cent). Apart from the last entry, Singaporeans, it seems, reach for similar stars. 

Singapore-based females seem to be the nurturing sort, with 'teacher' leading the surveyed pack at 14.8 per cent. Doctors/nurses/paramedics follow with 13 per cent; lawyers at 8.7 per cent; journalist/novelist at 4.3 per cent; and fashion designer and stylist at 4.3 per cent. 

Internationally, females aspired to be teachers (10.7 per cent); doctors/nurses/paramedics (9.5 per cent); journalists/novelists (6.8 per cent); veterinarians (5.4 per cent); and lawyers (5.2 per cent). 

Though aspiring, 'Superhero', 'Prince/Princess' and 'Ninja' clocked in at 1.3 per cent, 0.5 per cent and 0.3 per cent respectively. 

Article: VirusSeq: Software to identify viruses and their integration sites using nextgeneration sequencing of human cancer tissue

VirusSeq: Software to identify viruses and their integration sites using nextgeneration sequencing of human cancer tissue

Sent via Flipboard

Sent from my phone

Sunday, 18 November 2012

uBiome -- Sequencing Your Microbiome for $59!

This should be fun! Love the simplicity of "Science press here"
Have signed up for the early adopter and looking forward to receiving my sample kit ;)

Join me in helping make it happen for uBiome -- Sequencing Your Microbiome on @indiegogo

uBiome -- Sequencing Your Microbiome
By joining uBiome, you can explore your own microbiome and participate in the exciting scientific discovery to unlock this mystery. In order for it to work, we need lots of samples and a little bit of money. By pulling together as a group, we can do cutting-edge biomedical research at a fraction of the normal cost. It's called citizen science.

Friday, 16 November 2012

Article: PLOS Genetics: Lessons from Model Organisms: Phenotypic Robustness and Missing Heritability in Complex Disease

Fascinating read. 

PLOS Genetics: Lessons from Model Organisms: Phenotypic Robustness and Missing Heritability in Complex Disease

Sent via Flipboard

Sent from my phone

Article: 7 Python Libraries you should know about

7 Python Libraries you should know about

Sent via Flipboard

Sent from my phone

Life Optimizations and the Tempation to optimize

Down with flu so am kinda doing the next best thing to working: reading random articles on the net. 

anyway I chanced on this article on the gliffy blog on optimization for programmer time which led me to a recorded talk by Jonathon Blow on programming. It's an interesting talk, general enough to offer casual programmers useful nuggets of info and the anecdote on the asset loading in Doom (the video game) was like a blast to the past omg ... 

stuff that I took away from the talk was that the 
 "industry average" for programmers is to generate ~  3,250 lines of code / year 

Optimization is not ALWAYS bad.

It is good when you are optimizing things that actually matter.

 It is only bad when you are optimizing for the wrong thing.

Hence when you optimize for Speed and Space don't forget to optimize for "years of my life 
per program implementation (life) "

I am so guilty of this, fixating on how to 'efficiently' max out the nodes/cores of our shared cluster that I forget that I can actually use the time 'wasted' to continue on another side project/ catch up on emails / or even that coffee / toilet break I have been postphoning
trying to see my time as a limited resource for which I need to fit everything within 3250 lines of code that I can generate in a year should be an interesting paradigm shift. 

Jonathan makes the rather daring claim that "almost all applied CS research papers are bad" and "this isn't fooling anyone any more ... " 

the reason for this is that the papers 

propose  adding a lot of complexity for a very marginal benefit

doesn't work in all cases (limited inputs, robustness) "supported" by bogus numbers, unfair comparisons

Hmm sounds like some papers I have seen lately ... 

his list of criteria for someone he would like to hire is something that I feel should exist in all job descriptions :

  • gets things done quickly
  • gets things done robustly
  • makes things simple
  • finishes what he writes (for real)
  • broad knowledge of advanced   ideas and techniques (but only uses them when genuinely helpful)

Datanami, Woe be me