Friday, January 30, 2015

Perl: Ordering in Perl



Abstract: How get a data set by order using array and hash.


Ordering a data set is so popular in data analysis. A data set could be stored into an array or hash in Perl. Here is examples how to order an array. The elements would be ordered by alphabet characters in array.
my @array= sort @array; # the default
my @array= sort {$a cmp $b} @array; # the same as above
my @array= sort {$b cmp $a} @array; # the same as above

The code would be like this if you want them ordered by the values of numbers.
my @array= sort {$a <=> $b} @array; # increasing order
my @array= sort {$b <=> $a} @array; # decreasing order

The hash can meet more complicated data though the array type is qualified for dealing with some simple data sets. For an example data, the first column is GI no., and the second column is the length of DNA sequencing.
GI:389886562 2536
GI:215277009 15009
GI:301173067 138
...................

One method is that firstly read all data into a hash, and them order the hash by the keys or values of the hash based on your requirements. Like this:
#! /usr/bin/perl -w
use strict;
use warnings;

my %hash; #initiate a hash
open my ($IN), "<", $file or die; #open the data file by read mode
while (<$IN>){
chomp($_);
my ($GI, $len)=split("\t", $_);
$hash{$GI}=$len;
}
close($IN);

foreach my $key(sort (keys %hash) ){ #order by GI no.
print "$key\t", "$hash{$key}\n";
}
foreach my $key(sort {$hash{$a} <=> $hash{b}} (keys %hash) ){ #order by length of sequences
print "$key\t", "$hash{$key}\n";
}

However, we might face up with more complicated data sets. There are many characters for a record. For example, here is the result from high-throughput sequencing. The first column stores read counts, and the second is read sequences, and the third is their lengths, and the fourth is the frequency in biological samples. The work would be order these sequence records by read counts firstly (the first column), and then the length of sequences (the third column), and finally the fourth column.
12334 ATGTCGTGACGT 12 5
337 ATGTCGTGACGTATGTCGTGACGT 24 3
190 ATGTCGTGAC 10 8
...................

Microsoft Office-Excel can finish this job theoretically, but the number of records, length of each record are limited. It would be so tedious, too. Here is the power of promming. Array and hash combined would take care of everything, even so complicated data sets.
#! /usr/bin/perl -w
use strict;
use warnings;

#import data
my @array;
open my ($IN), "<", $file or die;
while (<$IN>){
chomp($_);
my ($read_counts, $seq, $len, $sample_num)=split("\t", $_);
my %hash=(read_counts =>$read_counts,
seq =>$seq,
seq_len =>$len,
sample_num =>$sample_num,
);
push(@array, \%hash);
}
close($IN);

#order them
foreach my $pointer (sort { $b->{read_counts} <=> $a->{read_counts} or $b->{seq_len} <=> $a->{seq_len} or $b->{sample_num} <=> $a->{sample_num} } @array){
print "$pointer->{read_counts}\t", "$pointer->{seq}\t",
"$pointer->{seq_len}\t", "$pointer->{sample_num}\n";
}

Please note that this method would be great for thousands or millions of records. You might as well think of another patterns (namely SQL database) if there were billions data sets. The usage of computer memory in Perl is so luxry, and the time for ordering is so slow.


Writing data: 2013.05.24

No comments:

Post a Comment