Abstract:
How get a data set by order using array and hash.
Ordering
a data set is so popular in data analysis. A data set could be stored
into an array or hash in Perl. Here is examples how to order an
array. The elements would be ordered by alphabet characters in array.
my
@array= sort @array; # the default
my
@array= sort {$a cmp $b} @array; # the same as above
my
@array= sort {$b cmp $a} @array; # the same as above
The
code would be like this if you want them ordered by the values of
numbers.
my
@array= sort {$a <=> $b} @array; # increasing order
my
@array= sort {$b <=> $a} @array; # decreasing order
The
hash can meet more complicated data though the array type is
qualified for dealing with some simple data sets. For an example
data, the first column is GI no., and the second column is the length
of DNA sequencing.
GI:389886562 2536
GI:215277009 15009
GI:301173067 138
...................
One
method is that firstly read all data into a hash, and them order the
hash by the keys or values of the hash based on your requirements.
Like this:
#!
/usr/bin/perl -w
use
strict;
use
warnings;
my
%hash; #initiate a hash
open
my ($IN), "<", $file or die; #open the data
file by read mode
while
(<$IN>){
chomp($_);
my
($GI, $len)=split("\t", $_);
$hash{$GI}=$len;
}
close($IN);
foreach
my $key(sort (keys %hash) ){ #order by GI no.
print
"$key\t", "$hash{$key}\n";
}
foreach
my $key(sort {$hash{$a} <=> $hash{b}} (keys %hash) ){
#order by length of sequences
print
"$key\t", "$hash{$key}\n";
}
However,
we might face up with more complicated data sets. There are many
characters for a record. For example, here is the result from
high-throughput sequencing. The first column stores read counts, and
the second is read sequences, and the third is their lengths, and the
fourth is the frequency in biological samples. The work would be
order these sequence records by read counts firstly (the first
column), and then the length of sequences (the third column), and
finally the fourth column.
12334 ATGTCGTGACGT
12 5
337 ATGTCGTGACGTATGTCGTGACGT
24 3
190 ATGTCGTGAC
10 8
...................
Microsoft
Office-Excel can finish this job theoretically, but the number of
records, length of each record are limited. It would be so tedious,
too. Here is the power of promming. Array and hash combined would
take care of everything, even so complicated data sets.
#!
/usr/bin/perl -w
use
strict;
use
warnings;
#import
data
my
@array;
open
my ($IN), "<", $file or die;
while
(<$IN>){
chomp($_);
my
($read_counts, $seq, $len, $sample_num)=split("\t", $_);
my
%hash=(read_counts =>$read_counts,
seq
=>$seq,
seq_len
=>$len,
sample_num
=>$sample_num,
);
push(@array,
\%hash);
}
close($IN);
#order
them
foreach
my $pointer (sort { $b->{read_counts} <=> $a->{read_counts}
or $b->{seq_len} <=> $a->{seq_len} or $b->{sample_num}
<=> $a->{sample_num} } @array){
print
"$pointer->{read_counts}\t", "$pointer->{seq}\t",
"$pointer->{seq_len}\t",
"$pointer->{sample_num}\n";
}
Please
note that this method would be great for thousands or millions of
records. You might as well think of another patterns (namely SQL
database) if there were billions data sets. The usage of computer
memory in Perl is so luxry, and the time for ordering is so slow.
Writing
data: 2013.05.24
No comments:
Post a Comment