Thursday, December 17, 2015

Perl: Split string



Abstract: Split string using different patterns.


1. The function split()
The basic function for splitting strings in Perl is split(). The separated character is a space, and the output is an array.
my $s='ab d ge h';
my @arr=split / /, $s;
print join("\t", @arr), "\n";
The output:
ab d ge h

There are some other coding patterns depending requirements. The output array can be wrote as scalars:
$s='ab d, ge h';
my($a, $b)=split /,/, $s;
printf("%s, %s\n", $a, $b);
The output:
ab d, ge h

Or set the limits of number of splitted parts:
#assign to scalars and limit to two parts
$s='ab d ge h';
my($a, $b)=split / /, $s, 2;
printf("%s, %s\n", $a, $b);
The output:
ab, d ge h

2. Regular expression combined with split()
split a string according to one or more blank space:
#make slice
my $s = "1695 root 20 0 450148 135584 53660 S 4.3 0.4 96:29.74 Xorg";
my ($PID, $CPU) = (split /\s+/, $s)[0, 8];
printf("%s##%s\n", $PID, $CPU);
The output:
1695##4.3

Split a string by multiple characters:
my $s = 'fname=Foo&lname=Bar&email=foo@bar.com';
my @words = split /[=&]/, $r;
print join('_', @words), "\n";
The output:
fname_Foo_lname_Bar_email_foo@bar.com

Split string and reserve split pattern characters:
my $s='a-d-ge-h';
$s=~s/-/-@#%/g;
@arr= split /@#%/, $s;
print join(', ', @arr), "\n";
The output:
a-, d-, ge-, h

3.hanle DNA or protein sequences
Split a DNA sequence into nucleotides:
$s='ATTGTGCGCGGATGCAAACTCTAATC';
@arr=split //, $s;
print join("\t", @arr), "\n";
The output:
A T T G T G C G C G G A T G C A A A CTC T A A T C

Translate DNA sequence into amino acid sequence by splitting a string with 3 characters per piece;
@arr=split /(...)/, $s;
@arr=grep {/.+/} @arr; #then filter the empty entries
print '3 chr:', join(',', @arr), "\n";
The output:
3 chr:ATT,GTG,CGC,GGA,TGC,AAA,CTC,TAA,TC

Walk sequence and generate sequence stretches:
$s='MRSLLFVVGAWVAALVTNLTPDAALASGTTTTAAAGNT';
my $L=22;
my $step=2;
my @arr;
for(my $i; $i<length($s)-$L+$step; $i+=$step){
my $sub=substr($s, $i, $L);
push(@arr, $sub);
printf("%s%s\n", "--"x($i), $sub);
}
The output:
MRSLLFVVGAWVAALVTNLTPD
----SLLFVVGAWVAALVTNLTPDAA
--------LFVVGAWVAALVTNLTPDAALA
------------VVGAWVAALVTNLTPDAALASG
----------------GAWVAALVTNLTPDAALASGTT
--------------------WVAALVTNLTPDAALASGTTTT
------------------------AALVTNLTPDAALASGTTTTAA
----------------------------LVTNLTPDAALASGTTTTAAAG
--------------------------------TNLTPDAALASGTTTTAAAGNT















No comments:

Post a Comment