Kevin's GATTACA World: Sort::Rank

Wednesday 4 July 2012

Sort::Rank - search.cpan.org

I have a problem where I needed the non unique rank of SNP id based on a score. I thought that this seemed a terribly popular thing to do but so far most of the options that turned out are Excel and this was the only Perl module that seems to be able to do it.

Sort::Rank - Sort arrays by some score and organise into ranks. http://search.cpan.org/~andya/Sort-Rank-v0.0.2/lib/Sort/Rank.pm

If I needed a unique rank then I could have done it this way

sort -n -k 3 snp_score.csv | nl > ranksorted-snp-score.csv

I could conceivably parse the file again and edit the ranking based on uniqueness of the score value ...

Anyone has a better / faster way to sort 2 million records?

inputfile looks like this

rs  bla 0.8200E+02
rs  bla 1.9200E+02
rs  bla 1.7200E+02
rs  bla 1.8200E+02   
rs  bla 1.8200E+02

I want to get something like

 1  rs  bla 0.8200E+02
 2  rs  bla 1.7200E+02
 3  rs  bla 1.8200E+02
 3  rs  bla 1.8200E+02
 4  rs  bla 1.9200E+02

1 comment:

Nathan Watson-Haigh5 July 2012 at 06:14
Interesting.

I know R can do this but I don't know how fast or memory efficient it would be on this many rows.

Off the top of my head while not at a computer, I'd benchmark "sort | awk" where awk would increment the rank each time the score changed.

Also, try using --stable to disable last resort comparison as this might speed up sort if you have many scores that are the same.

GNU sort can be parallelised but then I don't think you can take advantage of executing awk in parallel using a pipe. Since the output of the sort is delayed until the end of the sort.

I might have dabble today if I get the chance!

Nathan
ReplyDelete
Replies

Add comment

Kevin's GATTACA World

Wednesday 4 July 2012

Sort::Rank - search.cpan.org

1 comment:

Datanami, Woe be me

Analytics code

Contributors