Kevin's GATTACA World: 2018

Saturday 29 September 2018

Koala Genome assembled on AWS

Excerpted from AWS blog

Five years ago, a research team led by Dr. Rebecca Johnson (Director of the Australian Museum Research Institute) set out to learn more about koala populations, genetics, and diseases. As a biologically unique animal with a limited appetite, maintaining a healthy and genetically diverse population are both key elements of any conservation plan. In addition to characterizing the genetic diversity of koala populations, the team wanted to strengthen Australia’s ability to lead large-scale genome sequencing projects.
Inside the Koala Genome
Last month the team published their results in Nature Genetics. Their paper (Adaptation and Conservation Insights from the Koala Genome) identifies the genomic basis for the koala’s unique biology.

This work was performed on AWS. The research team used cfnCluster to create multiple clusters, each with 500 to 1000 vCPUs, and running Falcon from Pacific Biosciences. All in all, the team used 3 million EC2 core hours, most of which were EC2 Spot Instances.

Tuesday 11 September 2018

BioBloom tools: fast, accurate and memory-efficient host species sequence screening using bloom filters

https://academic.oup.com/bioinformatics/article/30/23/3402/207237

Justin Chu Sara Sadeghi Anthony Raymond Shaun D. Jackman Ka Ming NipRichard Mar Hamid Mohamadi Yaron S. Butterfield A. Gordon Robertson Inanç Birol

Bioinformatics, Volume 30, Issue 23, 1 December 2014, Pages 3402–3404,https://doi.org/10.1093/bioinformatics/btu558

Published:

20 August 2014

Abstract

Large datasets can be screened for sequences from a specific organism, quickly and with low memory requirements, by a data structure that supports time- and memory-efficient set membership queries. Bloom filters offer such queries but require that false positives be controlled. We present BioBloom Tools, a Bloom filter-based sequence-screening tool that is faster than BWA, Bowtie 2 (popular alignment algorithms) and FACS (a membership query algorithm). It delivers accuracies comparable with these tools, controls false positives and has low memory requirements.

Availability and implementaion:www.bcgsc.ca/platform/bioinfo/software/biobloomtools

Tuesday 20 March 2018

JD: Sr. Software DevOps Engineer at Guardant Health

https://jobs.smartrecruiters.com/GuardantHealth/743999667525776-sr-software-devops-engineer
Gotta love this line
“We wanted flying cars and instead we got 140 characters” is a much-repeated complaint about Silicon Valley. But with all due respect to flying cars, we believe that our mission is even more critical.

notable skills in the JD to pursue
Ansible / Chef
Docker

This paragraph sounds exactly like what I face on a daily basis

Your troubleshooting skills are excellent, and you enjoy a good daily challenge in supporting rapid growth and a diverse set of end user needs. You have the ability to maintain day to day support while running various key projects that move the business forward by automating and creating new tools that facilitate management of the environment.

Friday 23 February 2018

Exploring the 1000 genome dataset with Hail on Amazon EMR and Amazon Athena

Blog post from Roy Hasson

https://aws.amazon.com/blogs/big-data/genomic-analysis-with-hail-on-amazon-emr-and-amazon-athena/?nc1=b_rp

Genomics analysis has taken off in recent years as organizations continue to adopt the cloud for its elasticity, durability, and cost. With the AWS Cloud, customers have a number of performant options to choose from. These options include AWS Batch in conjunction with AWS Lambda and AWS Step Functions; AWS Glue, a serverless extract, transform, and load (ETL) service; and of course, the AWS big data and machine learning workhorse Amazon EMR.

For this task, we use Hail, an open source framework for exploring and analyzing genomic data that uses the Apache Spark framework. In this post, we use Amazon EMR to run Hail. We walk through the setup, configuration, and data processing. Finally, we generate an Apache Parquet–formatted variant dataset and explore it using Amazon Athena.

Kevin's GATTACA World