Abstract:
Common usage of String manipulation in R is introduced here.
Create
a string
> str='abcd_345'
> str
[1] "abcd_345"
Statistics
of a string
The
function nchar() return the length of a string.However, the returned
values might be wield if the mode of the variable is not character
mode.
Note:
The function length() returns number of elements of a vector.
> nchar(str) #
length of a string
[1] 8
> length(str) #
different from nchar()
[1] 1
> nchar("")
[1] 0
> nchar(NA) #
null
[1] 2
> nchar(T)
#boolen mode, TRUE or FALSE
[1] 4
> nchar(Inf) #
Inf/-Inf
[1] 3
> nchar(-Inf)
[1] 4
So,
there is another R package ‘stringr’ provide the alternative
function str_length() equal to nchar().
> library(stringr)
> str_length(str)
[1] 8
> str_length(NA)
#return NA
[1] NA
> str_length(Inf)
[1] 3
> str_length(T)
[1] 4
combination
of strings
The
function paste() is a very popular function for string’
manipulation.
> paste('abc', 'ABC')
# the default splits is a space
[1] "abc ABC"
> paste('abc', 'ABC',
sep=',') # separated by ,
[1] "abc,ABC"
> paste('abc', 'ABC',
sep='') # no separated character
[1] "abcABC"
> paste('abc', 'ABC',
'efg', sep='-') # number of string is no limits
[1] "abc-ABC-efg"
> paste('abc', 'ABC', 200,
sep='-') # all vectors were taken as characters
[1] "abc-ABC-200"
> paste('TK', 1:5, sep="_")
# loop paste,and return a vector
[1] "TK_1" "TK_2"
"TK_3" "TK_4" "TK_5"
> paste('TK', 1:5,
collapse="-", sep="_") # return a string still
using collapse
[1] "TK_1-TK_2-TK_3-TK_4-TK_5"
> x=list(a='aa', b='bb',
c='cc') # list() paste
> y=list(d=1, e=4, f=6)
> paste(x,y, sep="-")
[1] "aa-1" "bb-4"
"cc-6"
Splitting
and trimming of strings
String
splitting split() is the reverse operations of paste().
> strsplit("Howareyou",
split="") # default is 0
[[1]]
[1] "H" "o"
"w" "a" "r" "e" "y"
"o" "u"
> strsplit("Howareyou",
split="-") # give wrong split
[[1]]
[1] "Howareyou"
> strsplit("How2are6you",
split="[0-9]") # split by any number
[[1]]
[1] "How" "are"
"you"
> strsplit("How2are6you",
perl=T, split="2|6") # boolen operator
[[1]]
[1] "How" "are"
"you"
> strtrim('asdfdasf', 6)
# reserve the first 6 characters
[1] "asdfda"
> strtrim('asdfdasf', 10)
[1] "asdfdasf"
Search
of strings
The
functions grep()/grepl() used for judging occurance, and the
functions regexpr()/gregexpr() would return the positions of the
character localized in the string. Here regexpr() only return the
first targets, and gregexpr() would return all successful targets.
Here, ‘g’ indicates ‘global’.
> grepl('a4', 'sadfa4')
#judge if ‘a4’ is involved in the string
[1] TRUE
> grep('a4', 'sadfa4')
[1] 1
> regexpr('a4',
'sadfa4adsfa43')
[1] 5 ##
position in the string
attr(,"match.length")
[1] 2
attr(,"useBytes")
[1] TRUE
> regexpr('abc',
'sadfa4adsfa43') # fail to searching
[1] -1
attr(,"match.length")
[1] -1
attr(,"useBytes")
[1] TRUE
> gregexpr('a',
'sadfa4adsfa43') # return all matched positions
[[1]]
[1] 2 5 7 11
attr(,"match.length")
[1] 1 1 1 1
attr(,"useBytes")
[1] TRUE
Replacement
and extraction of strings
The
same as above, The paired function sub() and gsub() are used for
replacment.
> sub(pattern='a',
replacement='-', 'sadfa4adsfa43')
[1] "s-dfa4adsfa43"
> gsub(pattern='a',
replacement='-', 'sadfa4adsfa43')
[1] "s-df-4-dsf-43"
The functions substr() and
substring() are used for extract sub-strings.
> substr('abcdef', start=1,
stop=3) # must tell start-pos and end-pos you want
[1] "abc"
> substr('abcdef', 1, 3)
# Both style are equal
[1] "abc"
> substring('abcdef',
first=3) # from the third character to the end
[1] "cdef"
Manipulations
of biological strings
Here,
we introduced how to operate biological strings
> DNA='ATGCTtgtAT' #
DNA sequences
> tolower(DNA)
[1] "atgcttgtat"
> toupper(DNA)
[1] "ATGCTTGTAT"
> RNA='AGUGuGA' #
convert RNA to DNA
> sub(patter='U',
replacement="T", RNA) # replace the first one
[1] "AGTGuGA"
> gsub(pattern="U",
replacement="T", RNA, ignore.case=T) #replace all patterns
[1] "AGTGTGA"
>
DNA='GTCTGTAGTCTGTTGTTTTTTA' # split DNA by coding amino acid
> substring(DNA, seq(1,
nchar(DNA)-2, by=3), seq(3,nchar(DNA), by=3))
[1] "GTC" "TGT"
"AGT" "CTG" "TTG" "TTT" "TTT"
Writing data: 2015.01.20
No comments:
Post a Comment