Thursday, January 29, 2015

R: Character string-manipulations



Abstract: Common usage of String manipulation in R is introduced here.


Create a string
> str='abcd_345'
> str
[1] "abcd_345"

Statistics of a string
The function nchar() return the length of a string.However, the returned values might be wield if the mode of the variable is not character mode.
Note: The function length() returns number of elements of a vector.
> nchar(str) # length of a string
[1] 8
> length(str) # different from nchar()
[1] 1
> nchar("")
[1] 0
> nchar(NA) # null
[1] 2
> nchar(T) #boolen mode, TRUE or FALSE
[1] 4
> nchar(Inf) # Inf/-Inf
[1] 3
> nchar(-Inf)
[1] 4

So, there is another R package ‘stringr’ provide the alternative function str_length() equal to nchar().
> library(stringr)
> str_length(str)
[1] 8
> str_length(NA) #return NA
[1] NA
> str_length(Inf)
[1] 3
> str_length(T)
[1] 4

combination of strings
The function paste() is a very popular function for string’ manipulation.
> paste('abc', 'ABC') # the default splits is a space
[1] "abc ABC"
> paste('abc', 'ABC', sep=',') # separated by ,
[1] "abc,ABC"
> paste('abc', 'ABC', sep='') # no separated character
[1] "abcABC"
> paste('abc', 'ABC', 'efg', sep='-') # number of string is no limits
[1] "abc-ABC-efg"
> paste('abc', 'ABC', 200, sep='-') # all vectors were taken as characters
[1] "abc-ABC-200"

> paste('TK', 1:5, sep="_") # loop paste,and return a vector
[1] "TK_1" "TK_2" "TK_3" "TK_4" "TK_5"
> paste('TK', 1:5, collapse="-", sep="_") # return a string still using collapse
[1] "TK_1-TK_2-TK_3-TK_4-TK_5"

> x=list(a='aa', b='bb', c='cc') # list() paste
> y=list(d=1, e=4, f=6)
> paste(x,y, sep="-")
[1] "aa-1" "bb-4" "cc-6"

Splitting and trimming of strings
String splitting split() is the reverse operations of paste().
> strsplit("Howareyou", split="") # default is 0
[[1]]
[1] "H" "o" "w" "a" "r" "e" "y" "o" "u"

> strsplit("Howareyou", split="-") # give wrong split
[[1]]
[1] "Howareyou"

> strsplit("How2are6you", split="[0-9]") # split by any number
[[1]]
[1] "How" "are" "you"
> strsplit("How2are6you", perl=T, split="2|6") # boolen operator
[[1]]
[1] "How" "are" "you"

> strtrim('asdfdasf', 6) # reserve the first 6 characters
[1] "asdfda"
> strtrim('asdfdasf', 10)
[1] "asdfdasf"

Search of strings
The functions grep()/grepl() used for judging occurance, and the functions regexpr()/gregexpr() would return the positions of the character localized in the string. Here regexpr() only return the first targets, and gregexpr() would return all successful targets. Here, ‘g’ indicates ‘global’.
> grepl('a4', 'sadfa4') #judge if ‘a4’ is involved in the string
[1] TRUE
> grep('a4', 'sadfa4')
[1] 1

> regexpr('a4', 'sadfa4adsfa43')
[1] 5 ## position in the string
attr(,"match.length")
[1] 2
attr(,"useBytes")
[1] TRUE
> regexpr('abc', 'sadfa4adsfa43') # fail to searching
[1] -1
attr(,"match.length")
[1] -1
attr(,"useBytes")
[1] TRUE
> gregexpr('a', 'sadfa4adsfa43') # return all matched positions
[[1]]
[1] 2 5 7 11
attr(,"match.length")
[1] 1 1 1 1
attr(,"useBytes")
[1] TRUE

Replacement and extraction of strings
The same as above, The paired function sub() and gsub() are used for replacment.
> sub(pattern='a', replacement='-', 'sadfa4adsfa43')
[1] "s-dfa4adsfa43"
> gsub(pattern='a', replacement='-', 'sadfa4adsfa43')
[1] "s-df-4-dsf-43"

The functions substr() and substring() are used for extract sub-strings.
> substr('abcdef', start=1, stop=3) # must tell start-pos and end-pos you want
[1] "abc"
> substr('abcdef', 1, 3) # Both style are equal
[1] "abc"

> substring('abcdef', first=3) # from the third character to the end
[1] "cdef"


Manipulations of biological strings
Here, we introduced how to operate biological strings
> DNA='ATGCTtgtAT' # DNA sequences
> tolower(DNA)
[1] "atgcttgtat"
> toupper(DNA)
[1] "ATGCTTGTAT"

> RNA='AGUGuGA' # convert RNA to DNA
> sub(patter='U', replacement="T", RNA) # replace the first one
[1] "AGTGuGA"
> gsub(pattern="U", replacement="T", RNA, ignore.case=T) #replace all patterns
[1] "AGTGTGA"

> DNA='GTCTGTAGTCTGTTGTTTTTTA' # split DNA by coding amino acid
> substring(DNA, seq(1, nchar(DNA)-2, by=3), seq(3,nchar(DNA), by=3))
[1] "GTC" "TGT" "AGT" "CTG" "TTG" "TTT" "TTT"

Writing data: 2015.01.20

No comments:

Post a Comment