Wednesday, 28 September 2011
[bedtools-discuss] pybedtools: a flexible Python library for manipulating genomic datasets and annotations
Tuesday, 27 September 2011
if you didn't know already the Torrent Server uses SGE, for those of us who are more used to PBS or LSF, this guide below might help you walk through some of the commands if you need to say create your own reference index or do tmap on the TS
Array jobs for clusters running SGEhttp://wiki.gridengine.info/wiki/index.php/Simple-Job-Array-Howto
Converting between PBS and SGE scriptshttp://wiki.ibest.uidaho.edu/index.php/Tutorials:_SGE_PBS_Converting
Above url has a fantastic conversion table that lists options that are new to me as well!
For a more concise, quick and dirty guide see here
Monday, 26 September 2011
Early MiSeq Users Say Data Quality Matches HiSeq; Cite Speed as Advantage for Range of Applications | In Sequence | Sequencing | GenomeWeb
and seq quality nearly matches that of hi seq but it costs more per base though it produces data faster.
I don't have any 1st hand accounts of MiSeq data even though a friend has asked if I know anyone who might be able to serve as a consultant for a sequencing provider company in China using MiSeq. I also wondered if BGI has early access to MiSeq as well.
According to Nusbaum, the instrument is "pretty easy" to use, runs fast, and provides high-quality data, although at a greater cost per base than the HiSeq. It has been running according to Illumina's specifications, he said, and so far, there have been no serious problems with the machine.
According to Illumina, MiSeq produces more than 120 megabases of data with 35-base reads in four hours, and more than 1 gigabase of data with paired 150-base reads in 27 hours, including amplification and sequencing, and the number of unpaired reads exceeds 3.4 million.
The base accuracy of the data "is similar to what we see for the HiSeq," Nusbaum said. Toward the ends of the reads, the quality is even slightly higher than for HiSeq, probably because the sample spends less time on the machine.
Initially, the Broad plans to use the platform for "any kind of urgent project where turnaround time trumps cost of the data," Nusbaum said. This includes, for example, R&D projects, because "you get your answer in a day rather than in a week and a half."
In addition, projects that "fit nicely onto a small platform" will be run on MiSeq in the future at the Broad; these could include, for example, viral and microbial sequencing projects.
For Ion Torrent vs MiSeq. Read this in depth independent analysis
For Ion Torrent vs 454
Thursday, 22 September 2011
|1.||Phased whole-genome genetic risk in a family quartet using a major allele reference sequence.|
|Dewey FE, Chen R, Cordero SP, Ormond KE, Caleshu C, Karczewski KJ, Whirl-Carrillo M, Wheeler MT, Dudley JT, Byrnes JK, Cornejo OE, Knowles JW, Woon M, Sangkuhl K, Gong L, Thorn CF, Hebert JM, Capriotti E, David SP, Pavlovic A, West A, Thakuria JV, Ball MP, Zaranek AW, Rehm HL, Church GM, West JS, Bustamante CD, Snyder M, Altman RB, Klein TE, Butte AJ, Ashley EA.|
|PLoS Genet. 2011 Sep;7(9):e1002280. Epub 2011 Sep 15.|
|PMID: 21935354 [PubMed - in process]|
|2.||ChIP-Seq: technical considerations for obtaining high-quality data.|
|Kidder BL, Hu G, Zhao K.|
|Nat Immunol. 2011 Sep 20;12(10):918-22. doi: 10.1038/ni.2117.|
|PMID: 21934668 [PubMed - in process]|
|3.||Next-Generation Sequencing Reveals HIV-1-Mediated Suppression of T Cell Activation and RNA Processing and Regulation of Noncoding RNA Expression in a CD4+ T Cell Line.|
|Chang ST, Sova P, Peng X, Weiss J, Law GL, Palermo RE, Katze MG.|
|MBio. 2011 Sep 20;2(5). pii: e00134-11. doi: 10.1128/mBio.00134-11. Print 2011.|
|PMID: 21933919 [PubMed - in process]|
Wednesday, 21 September 2011
Summary: Here, we present ContEst, a tool for estimating the level of cross-individual contamination in next-generation sequencing data. We demonstrate the accuracy of ContEst across a range of contamination levels, sources and read depths using sequencing data mixed in silico at known concentrations. We applied our tool to published cancer sequencing datasets and report their estimated contamination levels.
Availability and Implementation: ContEst is a GATK module, and distributed under a BSD style license at http://www.broadinstitute.org/cancer/cga/contest
Supplementary information: Supplementary data is available at Bioinformatics online.
In 2000, Hanahan and Weinberg published a landmark article in which they described the "hallmarks of cancer" – six biological capabilities acquired during the multi-step development of human tumors. It went on to become the most-cited Cell article of all time. In a follow-up article this year, the authors revisit their conceptual framework for cancer biology, incorporating the remarkable progress in cancer research that was made over the last decade.
The authors conclude that their six hallmarks – sustained proliferative signaling, evading growth suppression, resisting cell death, replicative immortality, induction of angiogenesis, and invasion/metastasis – continue to provide a useful conceptual framework for understanding the biology of cancer. Further, they present two new hallmarks – reprogramming of energy metabolism and evasion of immune destruction – that have emerged as critical capabilities of cancer cells.
In coming years, thousands of tumors will be characterized by ever-more high-throughput technologies, such as massively parallel sequencing. Collecting the data is no longer the obstacle; instead, the true challenges lie in analysis and interpretation. Hanahan and Weinberg humbly describe their hallmarks as "organizing principles" for thinking about why cancer cells do what they do. Conceivably, fitting new catalogues of genetic alterations to this model of acquired capabilities will help us better understand the relationship between genotype (genetic susceptibility and somatic mutation) and phenotype (tumor development, growth, and metastasis).
Hanahan D, & Weinberg RA (2011). Hallmarks of cancer: the next generation. Cell, 144 (5), 646-74 PMID: 21376230
Tuesday, 20 September 2011
Sept. 19, 2011, 10:29 a.m. EDT
Partek(R) Wins Prestigious Illumina(R) iDEA Award
Partek software and algorithms show promising ability to substantially improve the scientific utility of next generation sequencing data
ST. LOUIS, Sep 19, 2011 (BUSINESS WIRE) -- Partek Incorporated, a global leader in bioinformatics software, announced today their receipt of the Most Creative Algorithm award, Commercial category, in the Illumina Data Excellence Award (iDEA) challenge for innovation in genomic data visualization and algorithmic analysis.
According to the judges, Partek was awarded the prestigious award for their entire, comprehensive start-to-finish data analysis tool set--Partek(R) Flow(TM), Partek(R) Genomics Suite(TM), and Partek(R) Pathway(TM)--as well as a number of useful novel algorithms. The most revolutionary of the algorithms being Partek's Gene-Specific Model. The model works on the assertion that one single statistical test does not optimally fit all genes, due to the fact that each gene may have a different distribution and be influenced by different biological factors. Therefore the Gene-Specific Model evaluates many models and distributions for each gene and selects the model that best fits that gene individually. This method results in two important advantages: first, more statistical power and more reliable findings as a result of a better model fit; and secondly, more information about which genes are influenced by which biological factors. This allows researchers to ascertain exactly how genes are affected by specific factors, in turn yielding a more statistically accurate analysis.
Tom Downey, President of Partek Incorporated had this to say, "People have been debating what is the proper distribution and statistical test for next generation sequencing data for years. We've pointed out the real elephant in the room on this debate, which is that there is not one single distribution and single statistical test that fits all genes or transcripts. For example, some genes are gender-specific, and others are not. Some genes follow a Poisson distribution and others do not."
To learn more about Partek's award winning data analysis, register at www.partek.com to view the webinar.
Partek Incorporated ( www.partek.com ) develops and globally markets quality software for life sciences research. Its flagship product, Partek(R) Genomics Suite(TM), provides innovative solutions for integrated genomics. Partek Genomics Suite is unique in supporting all major microarray and next-generation sequencing platforms. Workflows offer streamlined analysis for: Gene Expression, miRNA Expression, Exon, Copy Number, Allele-Specific Copy Number, LOH, Association, Trio analysis, Tiling, ChIP-Seq, RNA-Seq, DNA-Seq, DNA Methylation and qPCR. Since 1993, Partek, headquartered in St. Louis, Missouri USA, has been turning data into discovery(R).
Partek and Genomics Suite are trademarks of Partek Incorporated. The names of other companies or products mentioned herein may be the trademarks of their respective owners.
SOURCE: Partek Incorporated
Motivation: Similarity clustering of next-generation sequences (NGS) is an important computational problem to study the population sizes of DNA/RNA molecules and to reduce the redundancies in NGS data. Currently, most sequence clustering algorithms are limited by their speed and scalability, and thus cannot handle data with tens of millions of reads.
Results: Here, we introduce SEED—an efficient algorithm for clustering very large NGS sets. It joins sequences into clusters that can differ by up to three mismatches and three overhanging residues from their virtual center. It is based on a modified spaced seed method, called block spaced seeds. Its clustering component operates on the hash tables by first identifying virtual center sequences and then finding all their neighboring sequences that meet the similarity parameters. SEED can cluster 100 million short read sequences in <4 h with a linear time and memory performance. When using SEED as a preprocessing tool on genome/transcriptome assembly data, it was able to reduce the time and memory requirements of the Velvet/Oasis assembler for the datasets used in this study by 60–85% and 21–41%, respectively. In addition, the assemblies contained longer contigs than non-preprocessed data as indicated by 12–27% larger N50 values. Compared with other clustering tools, SEED showed the best performance in generating clusters of NGS data similar to true cluster results with a 2- to 10-fold better time performance. While most of SEED's utilities fall into the preprocessing area of NGS data, our tests also demonstrate its efficiency as stand-alone tool for discovering clusters of small RNA sequences in NGS data from unsequenced organisms.
Availability: The SEED software can be downloaded for free from this site: http://manuals.bioinformatics.ucr.edu/home/seed.
Supplementary information: Supplementary data are available at Bioinformatics online
online gamers playing on Foldit have deciphered a protein folding puzzle that had bamboozled scientists and automated computers
Ever heard of protein folding? Neither had we. But the online game Foldit lets you do just that, so you can have fun while you contribute to the progress of science. Yes, the science that actual, professional scientists spend time working with!
The Sydney Morning Herald points us to an interesting article published in the Nature Structural & Molecular Biology journal [PDF], which describes how online gamers playing on Foldit have deciphered a puzzle that had bamboozled scientists and automated computers working on the problem for over a decade.
They figured out the protein structure of a monomeric protease enzyme, which is "a cutting agent in the complex molecular tailoring of retroviruses, a family that includes HIV". The understanding of this structure is an important step towards discovering the causes of many diseases related to this enzyme and coming up with treatments for them.
Almost three years ago we announced results of the first ever "petasort" (sorting a petabyte-worth of 100-byte records, following the Sort Benchmark rules). It completed in just over six hours on 4000 computers. Recently we repeated the experiment using 8000 computers. The execution time was 33 minutes, an order of magnitude improvement.
Our sorting code is based on MapReduce, which is a key framework for running multiple processes simultaneously at Google. Thousands of applications, supporting most services offered by Google, have been expressed in MapReduce. While not many MapReduce applications operate at a petabyte scale, some do. Their scale is likely to continue growing quickly. The need to help such applications scale motivated us to experiment with data sets larger than one petabyte. In particular, sorting a ten petabyte input set took 6 hours and 27 minutes to complete on 8000 computers. We are not aware of any other sorting experiment successfully completed at this scale.
We are excited by these results. While internal improvements to the MapReduce framework contributed significantly, a large part of the credit goes to numerous advances in Google's hardware, cluster management system, and storage stack.
What would it take to scale MapReduce by further orders of magnitude and make processing of such large data sets efficient and easy? One way to find out is to join Google's systems infrastructure team. If you have a passion for distributed computing, are an expert or plan to become one, and feel excited about the challenges of exascale then definitely consider applying for a software engineering position with Google.
- Construction of and alignment to an ethnicity-specific major allele reference sequence yielded improved alignment and more accurate genotyping, especially at disease-associated loci.
- Mendelian inheritance state analysis in the family structure enabled identification and removal of >90% of variants arising from sequencing errors.
- Per-trio phasing, inheritance state of adjacent variants, and population-level linkage disequilibrium data were integrated to provide long-range phased haplotypes.
- By fine-mapping recombination events to sub-kilobase resolution, the authors were able to perform sequence-based human lymphocyte antigen (HLA) typing.
- A curated database of genotype-phenotype correlations made it possible to construct comprehensive genetic risk profiles, including multigenic risk of inherited thrombophilia, common disease susceptibility, and pharmacogenomics.
BGI has them :(
ucdavis has one :(
Titus Brown recommends 512 Gb or even 1 Tb (shudder)
Jerm makes a case for owning one here http://jermdemo.blogspot.com/2011/06/big-ass-servers-and-myths-of-clusters.html
Nick Loman is already doing market research on buying one,
More importantly seqanswers wiki suggests that you shld own one for de novo assembly ;)
Do you own one? how often does it get used?
More MPI or memory efficient de Brujin assemblers are being pushed out now ... is throwing more ram at the problem really something that is still required?
Hmmm I don't have access to one but my limited experience with a 256 Gb ram machine for a de novo assembly of a fish transcriptome didn't give me the contigs that I wanted. (it ran out of memory midway :( )
Monday, 19 September 2011
Low cost short read sequencing technology has revolutionised genomics, though it is only just
becoming practical for the high quality de novo assembly of a novel large genome. We describe
the Assemblathon 1 competition, which aimed to comprehensively assess the state of the art in
de novo assembly methods when applied to current sequencing technologies. In a collaborative
effort teams were asked to assemble a simulated Illumina HiSeq dataset of an unknown,
simulated diploid genome. A total of 41 assemblies from 17 different groups were received. Novel haplotype aware assessments of coverage, contiguity, structure, base calling and copy number
regions of uncertainty. assembly problem there are a number of key considerations, notably (1) the length of the reads,
were made. We establish that within this benchmark (1) it is possible to assemble the genome to
a high level of coverage and accuracy, and that (2) large differences exist between the
assemblies, suggesting room for further improvements in current methods. The simulated
benchmark, including the correct answer, the assemblies and the code that was used to evaluate
the assemblies is now public and freely available from http://www.assemblathon.org/.
excerpted from Introduction
As the field of sequencing has changed so has the field of sequence assembly, for a recent
review see Miller et al. (2010). In brief, using Sanger sequencing, contigs were initially built using
overlap or string graphs (Myers 2005) (or data structures closely related to them), in tools such
as Phrap (http://www.phrap.org/), GigAssembler (Kent and Haussler, 2001), Celera (Myers et al.
2000) (Venter et al. 2001), ARACHNE (Batzoglou et al. 2002), and Phusion (Mullikin and Ning
2003), which were used for numerous high quality assemblies such as human (Lander et al.
2001) and mouse (Mouse Genome Sequencing Consortium et al. 2002). However, these
programs were not generally efficient enough to handle the volume of sequences produced by the
While some maintained the overlap graph approach, e.g. Edena (Hernandez et al. 2008) and
Newbler (http://www.454.com/), others used word look-up tables to greedily extend reads, e.g.
SSAKE (Warren et al. 2007), SHARCGS (Dohm et al. 2007), VCAKE (Jeck et al. 2007) and
OligoZip (http://linux1.softberry.com/berry.phtml?topic=OligoZip). These word look-up tables were
then extended into de Bruijn graphs to allow for global analyses (Pevzner et al. 2001), e.g. Euler
(Chaisson and Pevzner 2008), AllPaths (Butler et al. 2008) and Velvet (Zerbino and Birney 2008).
As projects grew in scale further engineering was required to fit large whole genome datasets into
memory ((ABySS (Simpson et al. 2009), Meraculous (in submission)), (SOAPdenovo (Li et al.
2010), Cortex (in submission)). Now, as improvements in sequencer technology are extending the
length of "short reads", the overlap graph approach is being revisited, albeit with optimized
programming techniques, e.g. SGA (Simpson and Durbin 2010), as are greedy contig extension
In general, most sequence assembly programs are multi stage pipelines, dealing with correcting
measurement errors within the reads, constructing contigs, resolving repeats (i.e. disambiguating
false positive alignments between reads) and scaffolding contigs in separate phases. Since a
number of solutions are available for each task, several projects have been initiated to explore the
parameter space of the assembly problem, in particular in the context of short read sequencing
((Phillippy et al. 2008), (Hubis et al. 2011), (Alkan et al. 2011), (Narzisi and Mishra 2011), (Zhang et al. 2011) and (Lin et al. 2011)).
Saturday, 17 September 2011
High-throughput sequencing confers a deep view of seasonal community dynamics in pelagic marine environments
Gilbert et al. (2011, 2010) show that even in bacterial communities, there are definite seasonal patterns and peaks in community diversity. Figuring out what causes these patterns is sometimes surprisingly easy – it looks like shifting day length accounts for 65% of the changes in bacterial diversity (I'm sure the authors' jaws dropped when they saw this result…). Even more ridiculous (in a good way), the specific bacterial assemblage—the 'fingerprint' of species present in the community—could predict the month with 100% accuracy. And no surprise, only 2% of the 100 most abundant taxa they observed could be identified down to species level. (Previously undiscovered diversity is so old hat these days. But still cool).
Sent via TweetDeck (www.tweetdeck.com)
Gilbert JA, Steele JA, Caporaso JG, Steinbrück L, Reeder J, Temperton B, Huse S, McHardy AC, Knight R, Joint I, Somerfield P, Fuhrman JA, & Field D (2011a). Defining seasonal marine microbial community dynamics. The ISME journal PMID: 21850055
Gilbert, J., Field, D., Swift, P., Thomas, S., Cummings, D., Temperton, B., Weynberg, K., Huse, S., Hughes, M., Joint, I., Somerfield, P., & Mühling, M. (2010). The Taxonomic and Functional Diversity of Microbes at a Temperate Coastal Site: A ‘Multi-Omic’ Study of Seasonal and Diel Temporal Variation PLoS ONE, 5 (11) DOI: 10.1371/journal.pone.0015545
Fuhrman JA, Hewson I, Schwalbach MS, Steele JA, Brown MV, & Naeem S (2006). Annually reoccurring bacterial communities are predictable from ocean conditions. Proceedings of the National Academy of Sciences of the United States of America, 103 (35), 13104-9 PMID: 16938845
reposted the summary links here for convenience.
Tutorial covering RNA-seq analysis (tool under "NGS: RNA Analysis")
FAQ to help with troubleshooting (if needed):
For visualization, an update that allows the use of a user-specified
fasta reference genome is coming out very soon. For now, you can view
annotation by creating a custom genome build, but the actual reference
will be not included. Use "Visualization -> New Track Browser" and
follow the instructions for "Is the build not listed here? Add a Custom
Help for using the tool is available here:
Currently, RNA-seq analysis for SOLiD data is available only on Galaxy test server:
Please note that there are quotas associated with the test server:
[Credit : Jennifer Jackson ]
Another helpful resource (non-Galaxy related though) is
http://seqanswers.com/wiki/How-to/RNASeq_analysis written by Matthew Young
and the discussion on this wiki @ seqanswers
As well as this review paper in Genome Biology RNA-seq Review
Stephen mentions this tutorial as well in this blog
His post and the discussion thread is here.
kevin:waiting for the next common question to come next, is there Ion Torrent Support on Galaxy ?)
|1.||Next-generation human genetics.|
ABSTRACT: The field of human genetics is being reshaped by exome and genome sequencing. Several lessons are evident from observing the rapid development of this area over the past 2 years, and these may be instructive with respect to what we should expect from 'next-generation human genetics' in the next few years.
|Genome Biol. 2011 Sep 14;12(9):408. [Epub ahead of print]|
|PMID: 21920048 [PubMed - as supplied by publisher]|
|2.||Next-generation diagnostics for inherited skin disorders.|
|Lai-Cheong JE, McGrath JA.|
|J Invest Dermatol. 2011 Oct;131(10):1971-3. doi: 10.1038/jid.2011.253.|
|PMID: 21918571 [PubMed - in process] Free Article |
AbstractIdentifying genes and mutations in the monogenic inherited skin diseases is a challenging task. Discoveries are cherished but often gene-hunting efforts have gone unrewarded because technology has failed to keep pace with investigators' enthusiasm and clinical resources. But times are changing. The recent arrival of next-generation sequencing has transformed what can now be achieved.
|3.||Whole cancer genome sequencing by next-generation methods.|
|Ross JS, Cronin M.|
AbstractTraditional approaches to sequence analysis are widely used to guide therapy for patients with lung and colorectal cancer and for patients with melanoma, sarcomas (eg, gastrointestinal stromal tumor), and subtypes of leukemia and lymphoma. The next-generation sequencing (NGS) approach holds a number of potential advantages over traditional methods, including the ability to fully sequence large numbers of genes (hundreds to thousands) in a single test and simultaneously detect deletions, insertions, copy number alterations, translocations, and exome-wide base substitutions (including known "hot-spot mutations") in all known cancer-related genes. Adoption of clinical NGS testing will place significant demands on laboratory infrastructure and will require extensive computational expertise and a deep knowledge of cancer medicine and biology to generate truly useful "clinically actionable" reports. It is anticipated that continuing advances in NGS technology will lower the overall cost, speed the turnaround time, increase the breadth of genome sequencing, detect epigenetic markers and other important genomic parameters, and become applicable to smaller and smaller specimens, including circulating tumor cells and circulating free DNA in plasma.
|Am J Clin Pathol. 2011 Oct;136(4):527-39.|
|PMID: 21917674 [PubMed - in process]|
|4.||A novel application of pattern recognition for accurate SNP and indel discovery from high-throughput data: Targeted resequencing of the glucocorticoid receptor co-chaperone FKBP5 in a Caucasian population.|
|Pelleymounter LL, Moon I, Johnson JA, Laederach A, Halvorsen M, Eckloff B, Abo R, Rossetti S.|
|Mol Genet Metab. 2011 Aug 24. [Epub ahead of print] |
AbstractThe detection of single nucleotide polymorphisms (SNPs) and insertion/deletions (indels) with precision from high-throughput data remains a significant bioinformatics challenge. Accurate detection is necessary before next-generation sequencing can routinely be used in the clinic. In research, scientific advances are inhibited by gaps in data, exemplified by the underrepresented discovery of rare variants, variants in non-coding regions and indels. The continued presence of false positives and false negatives prevents full automation and requires additional manual verification steps. Our methodology presents applications of both pattern recognition and sensitivity analysis to eliminate false positives and aid in the detection of SNP/indel loci and genotypes from high-throughput data. We chose FK506-binding protein 51(FKBP5) (6p21.31) for our clinical target because of its role in modulating pharmacological responses to physiological and synthetic glucocorticoids and because of the complexity of the genomic region. We detected genetic variation across a 160kb region encompassing FKBP5. 613 SNPs and 57 indels, including a 3.3kb deletion were discovered. We validated our method using three independent data sets and, with Sanger sequencing and Affymetrix and Illumina microarrays, achieved 99% concordance. Furthermore we were able to detect 267 novel rare variants and assess linkage disequilibrium. Our results showed both a sensitivity and specificity of 98%, indicating near perfect classification between true and false variants. The process is scalable and amenable to automation, with the downstream filters taking only 1.5h to analyze 96 individuals simultaneously. We provide examples of how our level of precision uncovered the interactions of multiple loci, their predicted influences on mRNA stability, perturbations of the hsp90 binding site, and individual variation in FKBP5 expression. Finally we show how our discovery of rare variants may change current conceptions of evolution at this locus.
|PMID: 21917492 [PubMed - as supplied by publisher]|
Essentially they promise to store all of your hdd content in encrypted format in the cloud.
Nothing new? Well they are only going to charge you USD$10 / month for it.
How are they going to achieve that?
The company has propriety data de-duplication algorithms that can reduce most users file storage footprint to 25 Gb of data each (assuming we share similar files like mp3 and that )
Hmmm imagine the potential for storing NGS data on the cloud for cheap! (Well we won't exactly be bankrupting them if most ppl are storing human genome sequences which will be very very similar right?)
The company is aggressive about data de-duplication, and furthermore, most users have less than 25GB of data. With cheap bandwidth and cheap storage, it works. The 8-person company has raised $1.3 million and counts Andreessen Horowitz and the CrunchFund as its backers.
Interested? Sign up for a trial using this url to help push me up to the front of the queue for the beta :)
How to Make Your Hard Drive Infinite - Technology Review
RT @phylogenomics Video "UCLA: 12 file sharing myths in two minutes" mostly makes me think about how openness makes life so much easier http://t.co/wkUUlmrP
RT @phylogenomics Video "UCLA: 12 file sharing myths in two minutes" mostly makes me think about how openness makes life so much easier http://t.co/wkUUlmrP
Friday, 16 September 2011
PGM does seem to be the most promising platform with room to grow
I am curious though how much more wells they can squeeze into the chip size without having to upgrade the machine or doing 'dual core' tricks to double throughput.
But as I understand, they cannot load all of the wells with beads as the software actually uses the empty wells to be the noise filter at the processing stage.
They have been getting inhouse throughput of
50.3 Mbp (~600k reads) on the 314 Chip
330 Mbp on the 316 Chip
The longest read that they officially have without errors is 341 bp (though I guess it's a matter of chance that the sequence matches the 'samba' random cycle that one can achieve longer reads)
one also can do miRNA sequencing with 5 ng of miRNA although the number of reads might be a tad limiting based on the transcriptome complexity of your organism.
Would be interesting to see what numbers are coming out from Broad and BGI though. Please post in comments if you have them.
Will update if i remember more stuff.
What is interesting is that they have been pushing the throughput envelope but they are more careful about pushing new protocols without extensive testing.
I like the direction they are going ahead with releasing public data and allowing fair comparisons and I hope that other vendors take up the same direction.
I do understand why they wish to keep all the discussions ( uncensored ) within their Ion Community to make it a vibrant supportive community. I don't really like the idea that they made the Torrent Users section only for someone with a PGM serial number.
This makes life hard for labs sequencing with providers or core labs.
The IT leaders of 1000 Genomes Project describe how they must "distressingly often resort to shipping hard disks around to transfer data between centers, rather than use the internet, or even via Aspera which is faster than ftp [file transfer protocol]." The issue is so dire that BGI has established an open access journal, Giga Science, to deal with the problem of data dissemination and organization.
Thursday, 15 September 2011
Full-length transcriptome assembly from RNA-Seq data without a reference genome.
Broad Institute of Massachusetts Institute of Technology and Harvard, Cambridge, Massachusetts, USA.
Massively parallel sequencing of cDNA has enabled deep and efficient probing of transcriptomes. Current approaches for transcript reconstruction from such data often rely on aligning reads to a reference genome, and are thus unsuitable for samples with a partial or missing reference genome. Here we present the Trinity method for de novo assembly of full-length transcripts and evaluate it on samples from fission yeast, mouse and whitefly, whose reference genome is not yet available. By efficiently constructing and analyzing sets of de Bruijn graphs, Trinity fully reconstructs a large fraction of transcripts, including alternatively spliced isoforms and transcripts from recently duplicated genes. Compared with other de novo transcriptome assemblers, Trinity recovers more full-length transcripts across a broad range of expression levels, with a sensitivity similar to methods that rely on genome alignments. Our approach provides a unified solution for transcriptome reconstruction in any sample, especially in the absence of a reference genome.
- [PubMed - in process]
Wednesday, 14 September 2011
Apologies! After digging in the Ion Community a little more, I think this is the updated link for V1.4 TS
But the manually created reference index doesn't appear in the final dropdown menu when I try to do realignment (it does appear in the reference tab)
Don't really understand this line "As of release 1.1.0, only the "tmap-f1" index_type is supported."
as the index i created had the info.txt with tmap-f2
In anycase, if you don't mind fiddling with the web browser and you met with 'file deleted' or job started and you still do not have ur index . you can
sudo /etc/init.d/ionJobServer restart
Adapted from the original doc here
a New Genome Index
$ cd /results/referenceLibrary/tmap-f2/ $ build_genome_index.pl --fasta A_flavithermus.fasta -s A_flavithermus
-l "Anoxybacillus flavithermus WK1 chromosome complete genome" Copying A_flavithermus.fasta to A_flavithermus/A_flavithermus.fasta... ...copy complete Making tmap index... ...tmap index complete Making samtools index... ...samtools index complete
$ ls -1 /results/referenceLibrary/tmap-f1/e_coli/ A_flavithermus.fasta A_flavithermus.fasta.fai A_flavithermus.fasta.md5 A_flavithermus.fasta.tmap.anno A_flavithermus.fasta.tmap.bwt A_flavithermus.fasta.tmap.pac A_flavithermus.fasta.tmap.rbwt A_flavithermus.fasta.tmap.rpac A_flavithermus.fasta.tmap.rsa A_flavithermus.fasta.tmap.sa A_flavithermus.info.txt samtools.log tmap.log
$ sudo updateref List of library -> ampl_valid -> vibrio_fisch -> e_coli_k12 -> e_coli_dh10b -> rhodopalu
$ sudo updateref -p /mnt/PGM_Data/PGM_config