Abstract: Disk storage of huge hash will save RAM memory and speed up searching of hash. Perl default slurp and store a hash into RAM memory. An empty hash would occupy 120 byte in memory. The memory usage of a hash would increase with the increasing keys or structure such as nested hash. For storage or even searching, a hash with thousands or millions of keys might be acceptable. However, that is always be overflow for memory if a file with the size of dozens or hundreds GB was loaded into a huge hash. Instead of a common hash, loading hash into a DBM file would be better because: 1. the DBM file tied with a hash is stored into hard disk. DBM file allows huge data. 2. Indexing or searching through DBM file would be faster even if the RAM memory is enough and a common huge hash can be created. 3. Hash tied with a DBM file can be shared by other threads. Because DBM file can be accessed by other threads. Note that the attribute of DBM file should be set as read-only in that case. Example 1: a text file with ~98MB size and ~0.8 million lines. The file was loaded into a hash with million key/value pairs, which use about 191MB RAM. If the file was loaded into a DBM file tied with a hash, the tied hash used 130MB disk room, and the DBM file occupy ~42MB disk. Once DBM file is produced, the time for get the hash will be saved the next time. Here is the Perl code: use MLDBM qw(DB_File); use Fcntl; use Devel::Size qw(size total_size); sub flatfile_to_hash{ my($infile)=@_; my $n=0; my %hash; open my($IN), "<", $infile or die; while(<$IN>){ chomp($_); my($key1, $key2, $key3, $key4, $value)=split(',', $_); $hash{$key2.'-'.$key4}=$_; $n++; } #untie %hash; print "$n\n"; return(\%hash); } # sub flatfile_to_DBM_dict{ my($infile)=@_; my $DBM_file=$infile.'.db'; my $n=1; my %hash; unless (-f $DBM_file){ print "Generate $DBM_file\n"; tie(%hash, 'MLDBM', $DBM_file, O_CREAT|O_RDWR, 0666) or die "can't open tie to DBM file: $!"; open my($IN), "<", $infile or die; while(<$IN>){ chomp($_); my($key1, $key2, $key3, $key4, $value)=split(',', $_); $hash{$key2.'-'.$key4}=$value; $n++; #if ($n % 10000 ==0) {print "$n\n"}; } untie %hash; print "$n\n"; } # tie %hash, 'MLDBM', $DBM_file, O_RDWR, 0666 or die "Can't initialize MLDBM file: $!\n"; #untie %hash; return(\%hash); } #a empty hash my %hash; my $size=Devel::Size::total_size(\%hash); printf( "Size of an empty hash: %s byte\n", $size); print "store a hash into RAM\n"; my $pointer=flatfile_to_hash('/home/yuan/phip/ref_seq/virus_dependent_peptides.csv'); %hash=%$pointer; my $num=keys %hash; $size=int(Devel::Size::total_size(\%hash)/1024/1024); printf( "Size of a hash with %s keys: %s MB\n", $num, $size); print "store a hash into DBM file\n"; $pointer=flatfile_to_DBM_dict('/home/yuan/phip/ref_seq/virus_dependent_peptides.csv'); %hash=%$pointer; $num=keys %hash; $size=int(Devel::Size::total_size(\%hash)/1024/1024); printf( "Size of a hash with %s keys: %s MB\n", $num, $size); #foreach my $a(keys %hash){ # print "$a\t$hash{$a}\n"; #}
Wednesday, March 30, 2016
Perl: Storage of huge hash (1)
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment