Abstract:
Split string using different patterns.
1.
The function split()
The
basic function for splitting strings in Perl is split(). The
separated character is a space, and the output is an array.
my
$s='ab d ge h';
my
@arr=split / /, $s;
print
join("\t", @arr), "\n";
The
output:
ab d ge h
There
are some other coding patterns depending requirements. The output
array can be wrote as scalars:
$s='ab
d, ge h';
my($a,
$b)=split /,/, $s;
printf("%s,
%s\n", $a, $b);
The
output:
ab
d, ge h
Or
set the limits of number of splitted parts:
#assign
to scalars and limit to two parts
$s='ab
d ge h';
my($a,
$b)=split / /, $s, 2;
printf("%s,
%s\n", $a, $b);
The
output:
ab,
d ge h
2.
Regular expression combined with split()
split
a string according to one or more blank space:
#make
slice
my
$s = "1695 root 20 0 450148 135584 53660 S 4.3 0.4
96:29.74 Xorg";
my
($PID, $CPU) = (split /\s+/, $s)[0, 8];
printf("%s##%s\n",
$PID, $CPU);
The
output:
1695##4.3
Split
a string by multiple characters:
my
$s = 'fname=Foo&lname=Bar&email=foo@bar.com';
my
@words = split /[=&]/, $r;
print
join('_', @words), "\n";
The
output:
fname_Foo_lname_Bar_email_foo@bar.com
Split
string and reserve split pattern characters:
my
$s='a-d-ge-h';
$s=~s/-/-@#%/g;
@arr=
split /@#%/, $s;
print
join(', ', @arr), "\n";
The
output:
a-,
d-, ge-, h
3.hanle
DNA or protein sequences
Split
a DNA sequence into nucleotides:
$s='ATTGTGCGCGGATGCAAACTCTAATC';
@arr=split
//, $s;
print
join("\t", @arr), "\n";
The
output:
A T T G T G C G C G G A T G C A A A CTC T A A T C
Translate
DNA sequence into amino acid sequence by splitting a string with 3
characters per piece;
@arr=split
/(...)/, $s;
@arr=grep
{/.+/} @arr; #then filter the empty entries
print
'3 chr:', join(',', @arr), "\n";
The
output:
3
chr:ATT,GTG,CGC,GGA,TGC,AAA,CTC,TAA,TC
Walk
sequence and generate sequence stretches:
$s='MRSLLFVVGAWVAALVTNLTPDAALASGTTTTAAAGNT';
my
$L=22;
my
$step=2;
my
@arr;
for(my
$i; $i<length($s)-$L+$step; $i+=$step){
my
$sub=substr($s, $i, $L);
push(@arr,
$sub);
printf("%s%s\n",
"--"x($i), $sub);
}
The
output:
MRSLLFVVGAWVAALVTNLTPD
----SLLFVVGAWVAALVTNLTPDAA
--------LFVVGAWVAALVTNLTPDAALA
------------VVGAWVAALVTNLTPDAALASG
----------------GAWVAALVTNLTPDAALASGTT
--------------------WVAALVTNLTPDAALASGTTTT
------------------------AALVTNLTPDAALASGTTTTAA
----------------------------LVTNLTPDAALASGTTTTAAAG
--------------------------------TNLTPDAALASGTTTTAAAGNT
No comments:
Post a Comment