Wednesday, March 30, 2016

Perl: Storage of huge hash (1)


Abstract: Disk storage of huge hash will save RAM memory and speed up searching of hash.

Perl default slurp and store a hash into RAM memory. An empty hash would occupy 120 byte in memory. The memory usage of a hash would increase with the increasing keys or structure such as nested hash. For storage or even searching, a hash with thousands or millions of keys might be acceptable. However, that is always be overflow for memory if a file with the size of dozens or hundreds GB was loaded into a huge hash. Instead of a common hash, loading hash into a DBM file would be better because:
1. the DBM file tied with a hash is stored into hard disk. DBM file allows huge data. 
2. Indexing or searching through DBM file would be faster even if the RAM memory is enough and a common huge hash can be created.
3. Hash tied with a DBM file can be shared by other threads. Because DBM file can be accessed by other threads. Note that the attribute of DBM file should be set as read-only in that case.

Example 1: a text file with ~98MB size and ~0.8 million lines. The file was loaded into a hash with million key/value pairs, which use about 191MB RAM. If the file was loaded into a DBM file tied with a hash, the tied hash used 130MB disk room, and the DBM file occupy ~42MB disk. Once DBM file is produced, the time for get the hash will be saved the next time.

Here is the Perl code:
use MLDBM qw(DB_File);
use Fcntl;
use Devel::Size qw(size total_size);


sub flatfile_to_hash{
 my($infile)=@_;
 
 my $n=0;
 my %hash;
 open my($IN), "<", $infile or die;
 while(<$IN>){
  chomp($_);
  my($key1, $key2, $key3, $key4, $value)=split(',', $_);
  $hash{$key2.'-'.$key4}=$_;
  $n++;
 }
 #untie %hash;
 print "$n\n";
 return(\%hash);
}
#
sub flatfile_to_DBM_dict{
 my($infile)=@_;
 my $DBM_file=$infile.'.db';
 my $n=1;
 my %hash;
 unless (-f $DBM_file){
  print "Generate $DBM_file\n";
  tie(%hash, 'MLDBM', $DBM_file, O_CREAT|O_RDWR, 0666) or die "can't open tie to DBM file: $!";
  open my($IN), "<", $infile or die;
  while(<$IN>){
   chomp($_);
   my($key1, $key2, $key3, $key4, $value)=split(',', $_);
   $hash{$key2.'-'.$key4}=$value;
   $n++;
   #if ($n % 10000 ==0) {print "$n\n"};
  }
  untie %hash;
  print "$n\n";
 }
 #
 tie %hash, 'MLDBM', $DBM_file, O_RDWR, 0666 or die "Can't initialize MLDBM file: $!\n";
 #untie %hash;
 return(\%hash);
}

#a empty hash
my %hash;
my $size=Devel::Size::total_size(\%hash);
printf( "Size of an empty hash: %s byte\n",  $size);

print "store a hash into RAM\n";
my $pointer=flatfile_to_hash('/home/yuan/phip/ref_seq/virus_dependent_peptides.csv');
%hash=%$pointer;
my $num=keys %hash;
$size=int(Devel::Size::total_size(\%hash)/1024/1024);
printf( "Size of a hash with %s keys: %s MB\n", $num, $size);


print "store a hash into DBM file\n";
$pointer=flatfile_to_DBM_dict('/home/yuan/phip/ref_seq/virus_dependent_peptides.csv');
%hash=%$pointer;
$num=keys %hash;
$size=int(Devel::Size::total_size(\%hash)/1024/1024);
printf( "Size of a hash with %s keys: %s MB\n", $num, $size);
#foreach my $a(keys %hash){
# print "$a\t$hash{$a}\n";
#}


No comments:

Post a Comment