Wednesday, May 13, 2015

Data wrangling (2): One-on-one or One-on-many


Abstract: conversion of data frame between ‘One-on-one’ or ‘One-on-many’


The relationships namely gene vs. GO terms, or gene vs. Pathways are complicated. There are one gene annotated by many GO terms, and One GO term involves many genes. Such conversion is common for GO enrichment analysis. Here, I showed how to implement the function melt.list() from the R package reshape and the function ddply() from the package plyr to achieve such conversion.

Firstly, we get example of gene~GOterms.
> #list by one gene vs. multiple GO terms
> library(annotate)
> library(hgu95av2.db)
> library(GO.db)

#three genes
> genes<-c("738_at","40840_at","1004_at")
> geneGO2.list<-lapply(genes, FUN=function(x){ names(get(x, hgu95av2GO)) })
> names(geneGO2.list)<-genes
> geneGO2.list
$`738_at`
[1] "GO:0006144" "GO:0006195" "GO:0016310" "GO:0016311" "GO:0017144" "GO:0044281" "GO:0055086"
[8] "GO:0008219" "GO:0016311" "GO:0016311" "GO:0005829" "GO:0050146" "GO:0005515" "GO:0000166"
[15] "GO:0046872" "GO:0008253"

$`40840_at`
[1] "GO:0002931" "GO:0010849" "GO:0032780" "GO:0046902" "GO:0071243" "GO:0071277" "GO:0090200"
[8] "GO:0090324" "GO:2000276" "GO:0043066" "GO:0090201" "GO:0070301" "GO:1902445" "GO:2001243"
[15] "GO:0000413" "GO:0006457" "GO:0008637" "GO:0010939" "GO:0070266" "GO:0005753" "GO:0016020"
[22] "GO:0005759" "GO:0005515" "GO:0003755" "GO:0016018"

$`1004_at`
[1] "GO:0006928" "GO:0032467" "GO:0006935" "GO:0006955" "GO:0007186" "GO:0048535" "GO:0042113"
[8] "GO:0070098" "GO:0005886" "GO:0005887" "GO:0009897" "GO:0004930" "GO:0005515" "GO:0016494"

Then, we convert list to data frame type
#data frame by one gene vs. one GO term from the list type
> library(reshape)
> geneGO<-melt.list(geneGO2.list)
> colnames(geneGO)<-c('GO', 'gene')
> head(geneGO)
GO gene
1 GO:0006144 738_at
2 GO:0006195 738_at
3 GO:0016310 738_at
4 GO:0016311 738_at
5 GO:0017144 738_at
6 GO:0044281 738_at


Finally, I showed how to convert the data frame with one on one into another data frame with one on many, and the reversed process.
> #data frame by one gene vs. multiple GO terms
> library(plyr)
> geneGO2<-ddply(geneGO, c('gene'), summarize, GO=paste(GO, collapse=',') )
> head(geneGO2)
gene
1 1004_at
2 40840_at
3 738_at
GO
1 GO:0006928,GO:0032467,GO:0006935,GO:0006955,GO:0007186,GO:0048535,GO:0042113,GO:0070098,GO:0005886,GO:0005887,GO:0009897,GO:0004930,GO:0005515,GO:0016494
2 GO:0002931,GO:0010849,GO:0032780,GO:0046902,GO:0071243,GO:0071277,GO:0090200,GO:0090324,GO:2000276,GO:0043066,GO:0090201,GO:0070301,GO:1902445,GO:2001243,GO:0000413,GO:0006457,GO:0008637,GO:0010939,GO:0070266,GO:0005753,GO:0016020,GO:0005759,GO:0005515,GO:0003755,GO:0016018
3 GO:0006144,GO:0006195,GO:0016310,GO:0016311,GO:0017144,GO:0044281,GO:0055086,GO:0008219,GO:0016311,GO:0016311,GO:0005829,GO:0050146,GO:0005515,GO:0000166,GO:0046872,GO:0008253
> #reversed conversion
> geneGO.rev<-ddply(geneGO2, c('gene'), summarize, GO=strsplit(GO, split=',')[[1]] )
> head(geneGO.rev)
gene GO
1 1004_at GO:0006928
2 1004_at GO:0032467
3 1004_at GO:0006935
4 1004_at GO:0006955
5 1004_at GO:0007186
6 1004_at GO:0048535
>
> #data frame by one GO term multiple genes
> GOgene2<-ddply(geneGO, c('GO'), summarize, gene=paste(gene, collapse=',') )
> head(GOgene2)
GO gene
1 GO:0000166 738_at
2 GO:0005515 738_at,40840_at,1004_at
3 GO:0005829 738_at
4 GO:0006144 738_at
5 GO:0006195 738_at
6 GO:0008219 738_at


Writing date: 20150510


No comments:

Post a Comment