Wednesday, May 13, 2015

Cluster analysis (I): Visualization of principal component analysis


Cluster analysis (I): Visualization of principal component analysis

Abstract: We implement principal component analysis (PCA) to reduce the dimensionality of a dataset into 2 or 3 dimensions. PCA can suggest the importance of variables. Here, I present how to compute and visualize PCA in R.


Here is the example data. The first 4 columns are continuous variables. The last column is called categorical variable.
> data(iris)
> head(iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa

Make sure that values of variables were normal-like distribution, or logarithm was required.
#log
>log.iris<-log2(iris[,1:4])

The default method of PCA is the function prcomp() or princomp() in the default ‘stats’.
#
>pca1<-prcomp(log.iris, scale=T)
>print(pca1)
Standard deviations:
[1] 1.65413553 0.19960067 0.19513527 0.07763056

Rotation:
PC1 PC2 PC3 PC4
Sepal.Length 0.10090019 -0.0008537483 -0.4891583 0.86633858
Sepal.Width -0.05759298 0.5745110809 -0.7140592 -0.39590340
Petal.Length 0.50527032 -0.6870939247 -0.4269180 -0.30057416
Petal.Width 0.85510473 0.4447900940 0.2618865 0.04871476

#graphing
>biplot(pca1)



#importance of variables
>plot(pca1)


With PCA plots, there other options except the common function biplot().
For example, PCA graphing by ggplot2()
# load ggplot2
library(ggplot2)
# create data frame with scores
pca.scores = as.data.frame(pca1$x)

# plot of observations
ggplot(data = pca.scores, aes(x = PC1, y = PC2, label = rownames(pca.scores))) +
geom_hline(yintercept = 0, colour = "gray50") +
geom_vline(xintercept = 0, colour = "gray50") +
geom_text(colour = "blue", alpha = 0.8, size = 4) +
ggtitle("PCA plots")



Another method of PCA is the R package ‘FactoMineR’.
> library(FactoMineR)
> pca1<-PCA(log.iris)
> print(pca1)
**Results for the Principal Component Analysis (PCA)**
The analysis was performed on 150 individuals, described by 4 variables
*The results are available in the following objects:

name description
1 "$eig" "eigenvalues"
2 "$var" "results for the variables"
3 "$var$coord" "coord. for the variables"
4 "$var$cor" "correlations variables - dimensions"
5 "$var$cos2" "cos2 for the variables"
6 "$var$contrib" "contributions of the variables"
7 "$ind" "results for the individuals"
8 "$ind$coord" "coord. for the individuals"
9 "$ind$cos2" "cos2 for the individuals"
10 "$ind$contrib" "contributions of the individuals"
11 "$call" "summary statistics"
12 "$call$centre" "mean of the variables"
13 "$call$ecart.type" "standard error of the variables"
14 "$call$row.w" "weights for the individuals"

15 "$call$col.w" "weights for the variables"

#
log.iris<-cbind.data.frame(log2(iris[,1:4], iris[,'Species'])
pca2<-PCA(log.iris,quali.sup=5, graph=F)
concat = cbind.data.frame(log.iris[,5],pca2$ind$coord)
ellipse.coord = coord.ellipse(concat,bary=T)
plot.PCA(pca2, habillage=5, ellipse=ellipse.coord, cex=0.5,
label='ind.sup', lwd=3, xlim=c(-2,2), ylim=c(-2,2))



writing date: 20150430


No comments:

Post a Comment