Saturday, 31 March 2012
Friday, 30 March 2012
Professor Jian Wang, President of BGI, said, "In the past, it took a year to conduct a project on the genomics association study of 500 human samples, but now with "Tianhe", 3 hours is enough. We believe this will broaden the applications of Tianhe-1A in life science and greatly accelerate the development of frontier of science and technology."
Wednesday, 28 March 2012
BMC Bioinformatics | Analysis of high-depth sequence data for studying viral diversity: a comparison of next generation sequencing platforms using Segminator II.
Analysis of high-depth sequence data for studying viral diversity: a comparison of next generation sequencing platforms using Segminator II.
BMC Bioinformatics | Mapsembler, targeted and micro assembly of large NGS datasets on a desktop computer
Mapsembler, targeted and micro assembly of large NGS datasets on a desktop computer.
Tuesday, 27 March 2012
[Velvet-users] A strategy for scaling genome and transcriptome contig assembly (digital normalization)
---------- Forwarded message ----------
From: "C. Titus Brown"
Date: Mar 25, 2012 12:48 PM
Subject: [Velvet-users] A strategy for scaling genome and transcriptome contig assembly (digital normalization)
last week I posted a preprint of a paper discussing a strategy for coverage normalization, data reduction, and error elimination:
we call this strategy 'digital normalization' and it can yield good to spectacular reductions in data size and memory usage for assembly. In the paper we test it with Velvet, Oases, and Trinity on a variety of data sets.
On the paper site,
I just posted a tutorial for running it on microbial genomes prior to Velvet assembly, and on the Trinity paper's yeast mRNAseq data set prior to Oases or Trinity assembly.
The tutorial uses an Amazon EC2 instance for reproducibility, but with a bit of hopefully obvious tweaking the commands should work on any Linux system. Note, you'll need about 15 gb of RAM for the yeast Oases & Trinity assemblies.
Let me know if you have any questions (but be please to ask just on the relevant mailing list -- I'm sending this to velvet, oases, and trinity lists).
Velvet-users mailing list
Saturday, 24 March 2012
Flow cytometric chromosome sorting in plants: The next generation.
ARIEL and AMELIA: Testing for an Accumulation of Rare Variants Using Next-Generation Sequencing Data.
ARIEL and AMELIA: Testing for an Accumulation of Rare Variants Using Next-Generation Sequencing Data.
Thursday, 22 March 2012
Multi-threaded BAM compression and sorting The multi-threaded sort/merge/view is available at the "mt" branch:
From: Heng Li
Date: Thu, Mar 22, 2012 at 11:32 AM
Subject: [Samtools-help] Multi-threaded BAM compression and sorting
To convert coordinated sorted BAM back to FASTQ, the recommended way is to sort BAM by name and then convert the name sorted BAM to fastq. This is important because some mapper such as BWA assumes the input is random. They may have some troubles if we directly convert a coordinate sorted BAM with Picard's bam2fastq. While novosort, it does not sort by name. As I need to do BAM=>FASTQ for some huge BAMs, I added multi-threading to "sort", "merge" and "view".
This is not a full parallelization in that not all the steps are parallelized. Thus the efficiency is not scaled linearly with the number of threads. It is not recommended to use more than 8 threads. With 4 threads, time on sorting is reduced to 40% according to limited test. It may save you half a day if you have a huge BAM.
All my changes are naive and simple. It is possible to speed up sorting and compression further, but so far as I can see, this needs quite a lot of code restructuring and development time. For coordinate sorting, novosort scales much better with the number of threads (though I do not know if multi-threaded novosort is free to use beyond 15 days). Nils' multi-threaded bgzip should also do better on compression. These are sophisticated implementations. Mine is not.
The multi-threaded sort/merge/view is available at the "mt" branch:
The samtools/bgzf APIs stay the same except a few new functions to enable threading. In addition to multithreading, there are a few other improvements to sorting (some are based on Nils'):
1) @HD-SO tag is properly set (finally).
2) The compression level can be changed on the command line (-l).
3) Coordinate sorting considers strand as part of the key.
4) Improved alpha-numeric comparison between query names. The previous version was slower and did not work when there is a large integer.
5) Supporting "K/M/G" with option "-m". The maximum memory is estimated a little more accurately.
6) I kept claiming samtools sort was stable (i.e. the relative order of two records having the same coordinate are retained), but this was not true. The new sort is truly stable. This also means under the same compression level, sort always produces exactly the same output. For endusers, stable sorting is largely irrelevant. This just makes me feel more comfortable, "in theory".
The Wellcome Trust Sanger Institute is operated by Genome Research
Limited, a charity registered in England with number 1021457 and a
company registered in England with number 2742969, whose registered
office is 215 Euston Road, London, NW1 2BE.
contributed by iceman (see comments )
relevant links to an alternative implementation:
CloVR: A virtual machine for automated and portable sequence analysis from the desktop using cloud computing
CloVR is a VM that executes on a desktop (or laptop) computer, providing the ability to run analysis pipelines on local resources (Figure (Figure1).1). CloVR is invoked using one of two supported VM players, VMware  and VirtualBox ; at least one of which is freely available on all major desktop platforms: Windows, Unix/Linux, and Mac OS. On a local computer, CloVR utilizes local disk storage and compute resources, as supported by the VM player, including multi-core CPUs if available. To access data stored on the local computer, users can copy files into a "shared folder" that is accessible on both the VM and the local desktop and uses available hard drive space on the computer. Once inside the shared folder, CloVR can read this data for processing. Similarly, CloVR writes output data to this shared folder, making the pipeline output available on the desktop. This shared folder feature is supported by both VMware and VirtualBox.
Cloud BioLinux: pre-configured and on-demand bioinformatics computing for the genomics community BMC Bioinformatics | Abstract |
you would have to first download the data
Downloading the data for the tutorialWe assume you have a working copy of PLINK/SEQ already installed. The data for this tutorial are in two archives you need to download:
- pseq-tut1.tar.gz [ 1.1M ] : VCFs and a few auxiliary data files
- resources-hg18-0.02.tar.gz [ 1.1G ] : a (relatively large) bundle of resource databases (RefSeq genes, dbSNP variants, hg18 sequence)
UPDATE: a better page to describe the resources can be found here
although only hg19 is avail
Wednesday, 21 March 2012
DeconSeq @ SourceForge.net automatically detect and efficiently remove sequence condaminations from genomic and metagenomic datasets.
Sequences obtained from impure nucleic acid preparations may contain DNA from sources other than the sample. Those sequence contaminations are a serious concern to the quality of the data used for downstream analysis, possibly causing misassembly of sequence contigs and erroneous conclusions. Therefore, the removal of sequence contaminants presents a necessary step for all metagenomic projects.
DeconSeq is distributed under the GNU Public License (GPL). All its source codes are freely available to both academic and commercial users. The latest version can be downloaded at the SourceForge download page.
Web versionTOP OF PAGE
The interactive web interface of DeconSeq can be used to automatically detect and efficiently remove sequence condaminations from genomic and metagenomic datasets.
- Computer connected to the Internet
- Up-to-date Web browser (Firefox, Safari, Chrome, Internet Explorer, ...)
- FASTA file with sequence data
- FASTQ file (as alternative format to trim sequence and quality data)
Upload data to the DeconSeq web version
To upload a new dataset in FASTA or FASTQ format to DeconSeq, follow these steps:
1. Go to http://deconseq.sourceforge.net
2. Click on "Use DeconSeq" in the top menu on the right (the latest DeconSeq web version should load)
3. Select your FASTA or FASTQ file
4. Select the retain and remove (optional) database(s)
5. Click "Submit"
Tuesday, 20 March 2012
http://samba.org/rsync/) has a couple issues with mirroring large (> 100K) directory trees.
rsync's memory usage is directly proportional to the number of files in a tree. Large directories take a large amount of RAM.
rsync can recover from previous failures, but always determines the files to transfer up-front. If the connection fails before that determination can be made, no forward progress in the mirror can occur.
The solution? Chop up the workload by using perl to recurse the directory tree, building smallish lists of files to transfer with rsync. Most of the time these small lists of files transfer over fine, but if they fail, this script can look for that specific failure and retry that set a couple times before giving up.
Monday, 19 March 2012
The software automatically copes with data in a variety of formats and allows transparent retrieval of sequence data from the web. Since extensive libraries are provided with the package providing a platform to allow scientists to develop and release software in the true open source spirit. EMBOSS also integrates a range of currently available packages and tools for sequence analysis into a seamless whole.
For more information see:
Saturday, 17 March 2012
Broad's Heng Li Wins 2012 Benjamin Franklin Award - Bio-IT World
(Sent from Flipboard)
Sent from my iPad
Using Excel for Bioinformatics Data: Five Issues, Five Solutions
Nice tips !
I had been wondering about the Issue: Mistaken SYLK Files for the longest time! Forgot how I solved it eventually. MacOS excel also appears to have different default behavior. In windows I used to be able to open .csv files and there will be a dialog box asking me what are the field separators in the file (:,;,|,tab,space)
Now my just run everything thru sed to change it to commas so that it works.
Sed 's/tab/,/g' file > file.csv
Thursday, 15 March 2012
Wednesday, 14 March 2012
Abstract Due to its cost effectiveness, next generation sequencing of pools of individuals (Pool-Seq) is becoming a popular strategy for characterizing variation in population samples. Since Pool-Seq provides genome-wide SNP frequency data, it is possible to use them for demographic inference and/or the identification of selective sweeps. Here, we introduce a statistical method that is designed to detect selective sweeps from pooled data by accounting for statistical challenges associated with Pool-Seq, namely sequencing errors and random sampling among chromosomes. This allows for an efficient use of the information : all base calls are included in the analysis, but the higher credibility of regions with higher coverage and base calls with better quality scores is accounted for.
The Pistoia Alliance Sequence Squeeze Competition
The first 40 entries received will each receive a US$20 Amazon Web Services voucher.
(Only one voucher per person.)
The volume of next-generation sequencing data is a big problem. Data volumes are growing rapidly as sequencing technology improves. Individual runs are providing many more reads than before, and decreasing run times mean that more data can today be generated by a single machine in one day than a single machine could have produced in the whole of 2005.
Storing millions of reads and their quality scores uncompressed is impractical, yet current compression technologies are becoming inadequate. There is a need for a new and novel method of compressing sequence reads and their quality scores in a way that preserves 100% of the information whilst achieving much-improved linear (or, even better, non-linear) compression ratios.
The Pistoia Alliance, in the interests of promoting pre-competitive collaboration, is putting forward a prize fund of US$15,000 to the best novel open-source NGS compression algorithm submitted before the closing date of 15 March 2012.
Follow us on Twitter - @SeqSqueeze
Find these keys and edit them (note that the \033 you can get by pressing the esc key):
End - send string to shell: \033[4~
Home - send string to shell: \033[1~
Page down - send string to shell: \033[6~
Page up - send string to shell: \033[5~
Shift page down - scroll to next page in buffer
Shift page up - scroll to previous page in buffer
Easyfig enables the creation of linear comparison figures showing BLAST matches between multiple genomic loci or prokaryote genomes. Easyfig has an easy-to-use graphical user interface and is able to launch BLAST searches interactively.
WITHDRAWN: Evaluation of next-generation sequencing software in mapping and assembly.