Wednesday 28 September 2011

[bedtools-discuss] pybedtools: a flexible Python library for manipulating genomic datasets and annotations

Hi all -

Version 0.5 of pybedtools is released.  Pybedtools is an interface to BEDTools using the Python programming language.  In addition to wrapping all the BEDTools programs (including the latest multiBamCov, tagBam, and nucBed programs) and making them accessible from within Python, it extends BEDTools by allowing feature-by-feature manipulation of BED/GFF/GTF/BAM/SAM/VCF files.

There's lots more that pybedtools provides . . . as a brief example, here's the complete code that identifies genes that are <5kb from intergenic SNPs, given a file of genes and a file of SNPs:

from pybedtools import BedTool
snps = BedTool('snps.bed.gz')
genes = BedTool('hg19.gff')
intergenic_snps = (snps - genes)
nearby = genes.closest(intergenic_snps, d=True, stream=True)
for gene in nearby:
    if int(gene[-1]) < 5000:
        print gene.name

Note the (snps - genes) line, which does a subtractBed call, and the feature-level access to results from closest(), which wraps BEDTools' closestBed program.  How this compares to Bash and BEDTools programs alone is left as an exercise to the reader . . . or you can just check http://packages.python.org/pybedtools/sh-comparison.html

You can get a brief overview of pybedtools in:

Pybedtools: a flexible Python library for manipulating genomic datasets and annotations
Ryan K. Dale, Brent S. Pedersen, and Aaron R. Quinlan
Bioinformatics (2011) first published online September 23, 2011
doi:10.1093/bioinformatics/btr539

You can get more details, including installation instructions, in the documentation at http://packages.python.org/pybedtools/

The latest source can always be found on github: https://github.com/daler/pybedtools

Comments, bug reports, bug fixes, and suggestions are always welcome -- either through the github interface or via email.

happy intersecting, 

-ryan

No comments:

Post a Comment

Datanami, Woe be me