Sunday, 30 December 2012

Period of ineptitude now : INSIDE THE BOX: Solutions for Data Sharing in Life Sciences - Bio-IT World

Coming from a wet lab background, I am amused that there's a consensus that current journal publications do not enable the sharing of data in an easily accessible manner.
At the dawn of molecular biology, where cloning a single gene warrants a publication in JBC, sharing is well, rare. :-)
Short of a few labs, I think few labs will readily share their cloned transcripts, cell lines or antibodies.
Part of the reason is of course logistics, which may include lengthy MTA discussions with the university or perhaps it might represent problems with customs with different countries.
There's of course the selfish (gene) hypothesis. If you have ongoing research, it's natural to not want another lab to have a head start to catch up with your post grad student who has been slogging to get his/her publication out.

But I think that it's great that we are all moving towards an era of open science, both in publishing in readily accessible journals and openly sharing data and looking for open collaborations.

It's quite awkward to have competition for publicly funded science, as it seems to suggest we have ran out of interesting questions to ask of the world.
if someone sees a potential for my data, that I overlooked, I will be most happy if something came out of that same data. Because sharing is caring :-)

Tuesday, 25 December 2012

Exome array analysis identifies new loci and low-frequency variants influencing insulin processing and secretion.

 2012 Dec 23. doi: 10.1038/ng.2507. [Epub ahead of print]

Exome array analysis identifies new loci and low-frequency variants influencing insulin processing and secretion.


Center for Statistical Genetics, Department of Biostatistics, University of Michigan, Ann Arbor, Michigan, USA.


Insulin secretion has a crucial role in glucose homeostasis, and failure to secrete sufficient insulin is a hallmark of type 2 diabetes. Genome-wide association studies (GWAS) have identified loci contributing to insulin processing and secretion; however, a substantial fraction of the genetic contribution remains undefined. To examine low-frequency (minor allele frequency (MAF) 0.5-5%) and rare (MAF < 0.5%) nonsynonymous variants, we analyzed exome array data in 8,229 nondiabetic Finnish males using the Illumina HumanExome Beadchip. We identified low-frequency coding variants associated with fasting proinsulin concentrations at the SGSM2 and MADD GWAS loci and three new genes with low-frequency variants associated with fasting proinsulin or insulinogenic index: TBC1D30, KANK1 and PAM. We also show that the interpretation of single-variant and gene-based tests needs to consider the effects of noncoding SNPs both nearby and megabases away. This study demonstrates that exome array genotyping is a valuable approach to identify low-frequency variants that contribute to complex traits.

Sunday, 23 December 2012

Top Scientific Discoveries of 2012 | Wired Science |

Seems odd that rare variants is listed there as a top discovery.

New AWS high storage instance

The High Storage Eight Extra Large (hs1.8xlarge) instances are a great fit for applications that require high storage depth and high sequential I/O performance. Each instance includes 117 GiB of RAM, 16 virtual cores (providing 35 ECU of compute performance), and 48 TB of instance storage across 24 hard disk drives capable of delivering up to 2.4 GB per second of I/O performance.

Genome Biology | Abstract | Ray Meta: scalable de novo metagenome assembly and profiling

Abstract (provisional)
Voluminous parallel sequencing datasets, especially metagenomic experiments, require distributed computing for de novo assembly and taxonomic profiling. Ray Meta is a massively distributed metagenome assembler that is coupled with Ray Communities, which profiles microbiomes based on uniquely-colored k-mers. It can accurately assemble and profile a three billion read metagenomic experiment representing 1,000 bacterial genomes of uneven proportions in 15 hours with 1,024 processor cores, using only 1.5 GB per core. The software will facilitate the processing of large and complex datasets, and will help in generating biological insights on specific environments. Ray Meta is open source and available at

Saturday, 22 December 2012

Article: Gattaca Alert? Or Should We Welcome the New Age of Eugenics?

Sent via Flipboard

The Evolution of Cavities – Phenomena: The Loom

Interesting article on S. mutans which causes cavities.
Wonder what are the 148 unique to human microbiome genes that were picked up.

I doubt that removing S mutans will be an entirely good idea though, it will create a void for another species to exploit and who knows what diseases worse than cavities will prevail

Tuesday, 18 December 2012

Bowtie 2 2.0.4 released: Fixed issue whereby --un, --al, --un-conc and --al-conc options would incorrectly suppress SAM output.

---------- Forwarded message ----------
From: Langmead, Ben <>
Date: Tue, Dec 18, 2012 at 10:29 AM

Please make the switch from 2.0.3, as the first issue listed below is major.

Bowtie 2 version 2.0.4  - December 17, 2012
   * Fixed issue whereby --un, --al, --un-conc and --al-conc options would
     incorrectly suppress SAM output.
   * Fixed minor command-line parsing issue in wrapper script.
   * Fixed issue on Windows where wrapper script would fail to find
     bowtie2-align.exe binary.
   * Updated some of the index-building scripts and documentation.
   * Updated author's contact info in usage message.



Ben Langmead
Department of Computer Science
Johns Hopkins University
3400 North Charles St
Baltimore, MD 21218-2682

Monday, 17 December 2012

Fwd: [Bowtie-bio-announce] Bowtie 0.12.9 released

From: Langmead, Ben
Date: Monday, December 17, 2012

Bowtie version 0.12.9 - December 16, 2012
   * Fixed a bug whereby read names would not be truncated at first
     whitespace character in unmapped or maxed-out SAM records.
   * Fixed errors and warnings when compiling with clang++.
   * Fixed most errors and warnings when compiling with recent versions
     of g++, though you may need to add EXTRA_FLAGS=-Wno-enum-compare
     to avoid all warnings.



Ben Langmead
Department of Computer Science
Johns Hopkins University
3400 North Charles St
Baltimore, MD 21218-2682

Bowtie-bio-announce mailing list

Sent from Gmail Mobile

Thursday, 13 December 2012

Effects of OTU Clustering and PCR Artifacts on M... [Microb Ecol. 2012] - PubMed - NCBI

Next-generation sequencing has increased the coverage of microbial diversity surveys by orders of magnitude, but differentiating artifacts from rare environmental sequences remains a challenge. Clustering 16S rRNA sequences into operational taxonomic units (OTUs) organizes sequence data into groups of 97 % identity, helping to reduce data volumes and avoid analyzing sequencing artifacts by grouping them with real sequences. Here, we analyze sequence abundance distributions across environmental samples and show that 16S rRNA sequences of >99 % identity can represent functionally distinct microorganisms, rendering OTU clustering problematic when the goal is an accurate analysis of organism distribution. Strict postsequencing quality control (QC) filters eliminated the most prevalent artifacts without clustering. Further experiments proved that DNA polymerase errors in polymerase chain reaction (PCR) generate a significant number of substitution errors, most of which pass QC filters. Based on our findings, we recommend minimizing the number of PCR cycles in DNA library preparation and applying strict postsequencing QC filters to reduce the most prevalent artifacts while maintaining a high level of accuracy in diversity estimates. We further recommend correlating rare and abundant sequences across environmental samples, rather than clustering into OTUs, to identify remaining sequence artifacts without losing the resolution afforded by high-throughput sequencing

Wednesday, 12 December 2012

Article: Cross-biome metagenomic analyses of soil microbial communities and their functional attributes

Cross-biome metagenomic analyses of soil microbial communities and their functional attributes

Sent via Flipboard

Sent from my phone

Article: AJHG - Improved Heritability Estimation from Genome-wide SNPs

AJHG - Improved Heritability Estimation from Genome-wide SNPs

Estimation of narrow-sense heritability, h2, from genome-wide SNPs genotyped in unrelated individuals has recently attracted interest and offers several advantages over traditional pedigree-based methods. With the use of this approach, it has been estimated that over half the heritability of human height can be attributed to the ∼300,000 SNPs on a genome-wide genotyping array. In comparison, only 5%–10% can be explained by SNPs reaching genome-wide significance. We investigated via simulation the validity of several key assumptions underpinning the mixed-model analysis used in SNP-based h2 estimation. Although we found that the method is reasonably robust to violations of four key assumptions, it can be highly sensitive to uneven linkage disequilibrium (LD) between SNPs: contributions toh2 are overestimated from causal variants in regions of high LD and are underestimated in regions of low LD. The overall direction of the bias can be up or down depending on the genetic architecture of the trait, but it can be substantial in realistic scenarios. We propose a modified kinship matrix in which SNPs are weighted according to local LD. We show that this correction greatly reduces the bias and increases the precision of h2 estimates. We demonstrate the impact of our method on the first seven diseases studied by the Wellcome Trust Case Control Consortium. Our LD adjustment revises downward the h2 estimate for immune-related diseases, as expected because of high LD in the major-histocompatibility region, but increases it for some nonimmune diseases. To calculate our revised kinship matrix, we developed LDAK, software for computing LD-adjusted kinships.

Sent via Flipboard

Sent from my phone

Article: A high-performance computing toolset for relatedness and principal component analysis of SNP data

A high-performance computing toolset for relatedness and principal component analysis of SNP data

Summary: Genome-wide association studies are widely used to investigate the genetic basis of diseases and traits, but they pose many computational challenges. We developed gdsfmt and SNPRelate (R packages for multi-core symmetric multiprocessing computer architectures) to accelerate two key computations on SNP data: principal component analysis (PCA) and relatedness analysis using identity-by-descent measures. The kernels of our algorithms are written in C/C++ and highly optimized. Benchmarks show the uniprocessor implementations of PCA and identity-by-descent are ~8–50 times faster than the implementations provided in the popular EIGENSTRAT (v3.0) and PLINK (v1.07) programs, respectively, and can be sped up to 30–300-fold by using eight cores. SNPRelate can analyse tens of thousands of samples with millions of SNPs. For example, our package was used to perform PCA on 55 324 subjects from the 'Gene-Environment Association Studies' consortium studies.

Availability and implementation:gdsfmt and SNPRelate are available from R CRAN (, including a vignette. A tutorial can be found at

Sent via Flipboard

Sent from my phone

Article: Tools for mapping high-throughput sequencing data

Tools for mapping high-throughput sequencing data

Motivation: A ubiquitous and fundamental step in high-throughput sequencing analysis is the alignment (mapping) of the generated reads to a reference sequence. To accomplish this task, numerous software tools have been proposed. Determining the mappers that are most suitable for a specific application is not trivial.

Results: This survey focuses on classifying mappers through a wide number of characteristics. The goal is to allow practitioners to compare the mappers more easily and find those that are most suitable for their specific problem.

Availability: A regularly updated compendium of mappers can be found at

Sent via Flipboard

Sent from my phone

Article: RetroSeq: Transposable element discovery from Illumina paired-end sequencing data

RetroSeq: Transposable element discovery from Illumina paired-end sequencing data

Sent via Flipboard

Sent from my phone

U.K. Unveils Plan to Sequence Whole Genomes of 100,000 Patients - ScienceInsider

100 million pounds for 100,000 WGS samples. 

Hmm how much goes into the bioinformatics and IT infrastructure ?

I suddenly recall British telecom foray into a "Siri " for medical genomics and how everything ties in together now. 

With genetic technology advancing quickly, the prime minister of the United Kingdom announced today an ambitious plan to fully sequence the genomes of 100,000 Britons with cancer and rare diseases. Although many countries are touting their efforts to decode their citizens' DNA in the name of treating and curing disease, the new project is unusual because it will decode entire genomes, not just parts of them.

Prime Minister David Cameron said in a statement that the government's National Health Service (NHS) has earmarked £100 million, or about $160 million, to the effort. The money is part of £600 million ($965 million) announced last week for research in the coming years. The sequencing is expected to take 3 to 5 years.

Sent from my phone

Tuesday, 11 December 2012

Fwd: [bedtools-discuss] pybedtools version 0.6.2

---------- Forwarded message ----------
From: "Ryan Dale"

> Hi all,
> On the heels of BEDTools 2.17 release, I've just released the corresponding pybedtools v0.6.2.
> pybedtools v0.6.2 wraps the tools new in BEDTools 2.17, and adds lots of bugfixes and features of its own.
> The complete documentation is at, and the specific changes and new features in this version are listed at
> As always, comments, suggestions, and bug reports are all welcome.
> -ryan

Nature editors-"a cross between rock star goddess and Darth Vade" The Nature of the Knight Bus | Story Collider Magazine

" I may look like a standard soccer mom now, kids, but back in the day people regarded me as a cross between a rock-star goddess and Darth Vader. That's right, I was indeed an editor for Nature for nearly seven years, handling papers in genetics and genomics. There is a generally accepted hierarchy in science journals, and Nature is always at or near the top. The team of biology editors at Nature can only publish between about 8 and 10 percent of the submissions they receive, so most of your job as a manuscript editor is not publication of work but instead dealing out rejection. "

my hat's off to Michael Eisen

odd way to promote Fedora

Hmm I won't have done a facebook ad with the exact words to describe Fedora

 Fedora is a Linux OS, a collection of software to run on your computer.
Join us today.
Like · 1,299 people like this.

even a random quote like
"Fedora has [...] released an amazingly rock-solid operating system." 
− Jack Wallen,

would have enticed me to click like if I didn't know Linux

Monday, 10 December 2012

Gabe Rudy "GATK is a Research Tool. Clinics Beware." | Our 2 SNPs…(R)

Gabe points out in great detail a bug he found in GATK's variant caller which has be widely regarded as a reliable SNP caller. 

I think in general the 'unreliable' nature of next gen seq data has researchers often seeking multiple sources of confirmation for variants before moving to publication. 

though I am frankly surprised that GATK turned up an error but as Gabe points out it might be common to find Heisen Bugs in software

and it's a poignant reminder that DTC genetic testing needs more work to avoid mistakes like these that might be detrimental to personalised medicine 

"But my scary homozygous insertion (row 2) shows 153 reference bases and no reads supporting the insertion. Yet it was still called a homozygous variant!
I promptly sent an email off to 23andMe's exome team letting them know about what is clearly a bug in the GATK variant caller. They confirmed it was a bug that went away after updating to a newer release. I talked to 23andMe's bioinformatician behind the report face-to-face a bit at this year's ASHG conference, and it sounds like it was most likely a bug in the tool's multi-sample variant calling mode as this phantom insertion was a real insertion in one of the other samples.
Since there were 8,242 other InDels that match this pattern, I am most likely not looking at random noise but real "leaked" variants from other members of the 23andMe Exome Pilot Program. (Edit: After some analysis with a fixed version of GATK, Eoghan from 23andMe found that these genotypes where not leaked from other samples but completely synthetic.)" 

Benevolent_Dictator_for_Life for Python joins Dropbox

Guido has parted "as best friends" from Google to join Dropbox. 
Looking forward to seeing nicer python APIs that might be able to integrate with Linux CLI / NGS pipelines. 

Hmm imagine storing your seq on Dropbox and enabling access to AWS and/or Galaxy and/or 23andme analysis .. 

This would feel like I actually OWN my genomic seq


Slides for ASHG 2012 1000 Genomes Tutorial Wednesday 7th November 7-9:30pm | 1000 Genomes


The tutorial being held at the San Francisco Marriot Marquis from 7pm to 9:00pm on Wednesday 7th November

The 1000 Genomes Project has released the sequence data and an integrated set of variants, genotypes, and haplotypes for the 1092 samples in the phase 1 set, and the sequence data for the phase 2 set. This tutorial describes the data sets, how to access them, and how to use them.

The topics being covered are

1.  (15 min talk, 3 min questions)  Description of the 1000 Genomes data – Mark DePristo [slides]
2.  (15 min talk, 3 min questions)  How to access the data – Laura Clarke [slides]
3.  (15 min talk, 3 min questions)  Structural variants  -- Ryan Mills   [slides]
4.  (15 min talk, 3 min questions)  Population genetic and admixture analyses – Eimear Kenny [slides]
5.  (15 min talk, 3 min questions)  Functional analyses – Ekta Khurana [slides]
6.  (15 min talk, 3 min questions)  How to use the data in disease studies  -- Stephan Ripke
7.  (12 min)   Q&A

A poster was also presented on Wednesday 7th. A copy of the poster is also available on the ftp site

Wednesday, 5 December 2012

bash function to inspect col data files

Found this neat gem!

 #usage filetopreview  
 #inspect function credit :  
 i() {  
      (head -n 5;tail -n 5) < "$1" | column -t   
 #calls the function  
 i $1  

Mounting Amazon S3 buckets in GenomeSpace

Neat feature!

New Feature: Mounting Amazon AWS S3 Buckets

The GenomeSpace Data Manager was originally built to save the files you upload to GenomeSpace in an Amazon Simple Storage System (S3) bucket that is managed by GenomeSpace itself. However you can add additional Amazon S3 buckets to GenomeSpace that you or a third party has set up to make the file contents available to your GenomeSpace and your GenomeSpace tools. For buckets that are publicly accessible, you only need to tell GenomeSpace the name of the bucket to mount it.  However, for private buckets, or those with limited non-public accessibility, the process is more complex, requiring you to set up a sub-account and the minimal permissions in Amazon to share the bucket with GenomeSpace.  Once a bucket has been mounted in GenomeSpace, you can share it with other GenomeSpace users using the standard GenomeSpace sharing dialogs.

For details on how to mount a bucket into your GenomeSpace, follow the steps in the documentation.

Tuesday, 4 December 2012

The non-human primate reference transcript... [Nucleic Acids Res. 2012] - PubMed - NCBI

RNA-based next-generation sequencing (RNA-Seq) provides a tremendous amount of new information regarding gene and transcript structure, expression and regulation. This is particularly true for non-coding RNAs where whole transcriptome analyses have revealed that the much of the genome is transcribed and that many non-coding transcripts have widespread functionality. However, uniform resources for raw, cleaned and processed RNA-Seq data are sparse for most organisms and this is especially true for non-human primates (NHPs). Here, we describe a large-scale RNA-Seq data and analysis infrastructure, the NHP reference transcriptome resource (; it presently hosts data from12 species of primates, to be expanded to 15 species/subspecies spanning great apes, old world monkeys, new world monkeys and prosimians. Data are collected for each species using pools of RNA from comparable tissues. We provide data access in advance of its deposition at NCBI, as well as browsable tracks of alignments against the human genome using the UCSC genome browser. This resource will continue to host additional RNA-Seq data, alignments and assemblies as they are generated over the coming years and provide a key resource for the annotation of NHP genomes as well as informing primate studies on evolution, reproduction, infection, immunity and pharmacology.

NCBI Remap aka 'liftover' tool

NCBI Remap is a tool that allows users to project annotation data from one coordinate system  to another. This remapping (sometimes called 'liftover') uses genomic alignments to project features from one sequence to the other. For each feature on the source sequence, we perform a base by base analysis of each feature on the source sequence in order to project the feature through the alignment to the new sequence.
We support three variations of Remap. Assembly-Assembly allows the remapping of features from one assembly to another. RefSeqGene allows for the remapping of features from assembly sequences to RefSeqGene sequences (including transcript and protein sequences annoted on the RefSeqGene) or from RefSeqGene sequences to an assembly. Alt loci remap allows for the mapping of features between the Primary assembly and the alternate loci and Patches available for GRC assemblies.

What's new

With the November 2012 update, we added the following features:
  • Alt locus remap: remap features between the primary assembly and the alternate loci/patches in GRC assemblies.
  • Clinical Remap: When you run this we will now make a call to the variation reporter and insert the results into Clincal Remap.
  • Added support for upload of compressed files. Currently GZip (.gz) and BZip2 (.bz) files are supported.
  • Improved HGVS nomenclature.

you can access the tool here

Saturday, 1 December 2012

Ray tutorial in wikibooks

Date: 1 December, 2012 7:46:55 AM GMT+08:00

Subject: [Denovoassembler-users] Ray tutorial in wikibooks


A tutorial on using Ray was added in wikibooks [1].

Thank you.



Datanami, Woe be me