Tiezheng Yuan Ph.D.: R: Character string-manipulations

Abstract: Common usage of String manipulation in R is introduced here.

Create a string

> str='abcd_345'

> str

[1] "abcd_345"

Statistics of a string

The function nchar() return the length of a string.However, the returned values might be wield if the mode of the variable is not character mode.

Note: The function length() returns number of elements of a vector.

> nchar(str) # length of a string

[1] 8

> length(str) # different from nchar()

[1] 1

> nchar("")

[1] 0

> nchar(NA) # null

[1] 2

> nchar(T) #boolen mode, TRUE or FALSE

[1] 4

> nchar(Inf) # Inf/-Inf

[1] 3

> nchar(-Inf)

[1] 4

So, there is another R package ‘stringr’ provide the alternative function str_length() equal to nchar().

> library(stringr)

> str_length(str)

[1] 8

> str_length(NA) #return NA

[1] NA

> str_length(Inf)

[1] 3

> str_length(T)

[1] 4

combination of strings

The function paste() is a very popular function for string’ manipulation.

> paste('abc', 'ABC') # the default splits is a space

[1] "abc ABC"

> paste('abc', 'ABC', sep=',') # separated by ,

[1] "abc,ABC"

> paste('abc', 'ABC', sep='') # no separated character

[1] "abcABC"

> paste('abc', 'ABC', 'efg', sep='-') # number of string is no limits

[1] "abc-ABC-efg"

> paste('abc', 'ABC', 200, sep='-') # all vectors were taken as characters

[1] "abc-ABC-200"

> paste('TK', 1:5, sep="_") # loop paste,and return a vector

[1] "TK_1" "TK_2" "TK_3" "TK_4" "TK_5"

> paste('TK', 1:5, collapse="-", sep="_") # return a string still using collapse

[1] "TK_1-TK_2-TK_3-TK_4-TK_5"

> x=list(a='aa', b='bb', c='cc') # list() paste

> y=list(d=1, e=4, f=6)

> paste(x,y, sep="-")

[1] "aa-1" "bb-4" "cc-6"

Splitting and trimming of strings

String splitting split() is the reverse operations of paste().

> strsplit("Howareyou", split="") # default is 0

[[1]]

[1] "H" "o" "w" "a" "r" "e" "y" "o" "u"

> strsplit("Howareyou", split="-") # give wrong split

[[1]]

[1] "Howareyou"

> strsplit("How2are6you", split="[0-9]") # split by any number

[[1]]

[1] "How" "are" "you"

> strsplit("How2are6you", perl=T, split="2|6") # boolen operator

[[1]]

[1] "How" "are" "you"

> strtrim('asdfdasf', 6) # reserve the first 6 characters

[1] "asdfda"

> strtrim('asdfdasf', 10)

[1] "asdfdasf"

Search of strings

The functions grep()/grepl() used for judging occurance, and the functions regexpr()/gregexpr() would return the positions of the character localized in the string. Here regexpr() only return the first targets, and gregexpr() would return all successful targets. Here, ‘g’ indicates ‘global’.

> grepl('a4', 'sadfa4') #judge if ‘a4’ is involved in the string

[1] TRUE

> grep('a4', 'sadfa4')

[1] 1

> regexpr('a4', 'sadfa4adsfa43')

[1] 5 ## position in the string

attr(,"match.length")

[1] 2

attr(,"useBytes")

[1] TRUE

> regexpr('abc', 'sadfa4adsfa43') # fail to searching

[1] -1

attr(,"match.length")

[1] -1

attr(,"useBytes")

[1] TRUE

> gregexpr('a', 'sadfa4adsfa43') # return all matched positions

[[1]]

[1] 2 5 7 11

attr(,"match.length")

[1] 1 1 1 1

attr(,"useBytes")

[1] TRUE

Replacement and extraction of strings

The same as above, The paired function sub() and gsub() are used for replacment.

> sub(pattern='a', replacement='-', 'sadfa4adsfa43')

[1] "s-dfa4adsfa43"

> gsub(pattern='a', replacement='-', 'sadfa4adsfa43')

[1] "s-df-4-dsf-43"

The functions substr() and substring() are used for extract sub-strings.

> substr('abcdef', start=1, stop=3) # must tell start-pos and end-pos you want

[1] "abc"

> substr('abcdef', 1, 3) # Both style are equal

[1] "abc"

> substring('abcdef', first=3) # from the third character to the end

[1] "cdef"

Manipulations of biological strings

Here, we introduced how to operate biological strings

> DNA='ATGCTtgtAT' # DNA sequences

> tolower(DNA)

[1] "atgcttgtat"

> toupper(DNA)

[1] "ATGCTTGTAT"

> RNA='AGUGuGA' # convert RNA to DNA

> sub(patter='U', replacement="T", RNA) # replace the first one

[1] "AGTGuGA"

> gsub(pattern="U", replacement="T", RNA, ignore.case=T) #replace all patterns

[1] "AGTGTGA"

> DNA='GTCTGTAGTCTGTTGTTTTTTA' # split DNA by coding amino acid

> substring(DNA, seq(1, nchar(DNA)-2, by=3), seq(3,nchar(DNA), by=3))

[1] "GTC" "TGT" "AGT" "CTG" "TTG" "TTT" "TTT"

Writing data: 2015.01.20

Tiezheng Yuan Ph.D.

Thursday, January 29, 2015

R: Character string-manipulations

No comments:

Post a Comment