Cluster analysis
(I): Visualization of principal component analysis
Abstract: We
implement principal component analysis (PCA) to reduce the
dimensionality of a dataset into 2 or 3 dimensions. PCA can suggest
the importance of variables. Here, I present how to compute and
visualize PCA in R.
Here is the example
data. The first 4 columns are continuous variables. The last column
is called categorical variable.
> data(iris)
> head(iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
Make sure that
values of variables were normal-like distribution, or logarithm was
required.
#log
>log.iris<-log2(iris[,1:4])
The default method
of PCA is the function prcomp() or princomp() in the default ‘stats’.
#
>pca1<-prcomp(log.iris, scale=T)
>print(pca1)
Standard deviations:
[1] 1.65413553 0.19960067 0.19513527 0.07763056
Rotation:
PC1 PC2 PC3 PC4
Sepal.Length 0.10090019 -0.0008537483 -0.4891583 0.86633858
Sepal.Width -0.05759298 0.5745110809 -0.7140592 -0.39590340
Petal.Length 0.50527032 -0.6870939247 -0.4269180 -0.30057416
Petal.Width 0.85510473 0.4447900940 0.2618865 0.04871476
#graphing
>biplot(pca1)
#importance of variables
>plot(pca1)
With PCA plots,
there other options except the common function biplot().
For example, PCA
graphing by ggplot2()
# load ggplot2
library(ggplot2)
# create data frame with scores
pca.scores = as.data.frame(pca1$x)
# plot of observations
ggplot(data = pca.scores, aes(x = PC1, y = PC2, label =
rownames(pca.scores))) +
geom_hline(yintercept = 0, colour = "gray50") +
geom_vline(xintercept = 0, colour = "gray50") +
geom_text(colour = "blue", alpha = 0.8, size = 4) +
ggtitle("PCA plots")
Another method of
PCA is the R package ‘FactoMineR’.
> library(FactoMineR)
> pca1<-PCA(log.iris)
> print(pca1)
**Results for the Principal Component Analysis (PCA)**
The analysis was performed on 150 individuals, described by 4
variables
*The results are available in the following objects:
name description
1 "$eig" "eigenvalues"
2 "$var" "results for the variables"
3 "$var$coord" "coord. for the variables"
4 "$var$cor" "correlations variables -
dimensions"
5 "$var$cos2" "cos2 for the variables"
6 "$var$contrib" "contributions of the variables"
7 "$ind" "results for the individuals"
8 "$ind$coord" "coord. for the individuals"
9 "$ind$cos2" "cos2 for the individuals"
10 "$ind$contrib" "contributions of the
individuals"
11 "$call" "summary statistics"
12 "$call$centre" "mean of the variables"
13 "$call$ecart.type" "standard error of the
variables"
14 "$call$row.w" "weights for the individuals"
15 "$call$col.w" "weights for the variables"
#
log.iris<-cbind.data.frame(log2(iris[,1:4], iris[,'Species'])
pca2<-PCA(log.iris,quali.sup=5, graph=F)
concat = cbind.data.frame(log.iris[,5],pca2$ind$coord)
ellipse.coord = coord.ellipse(concat,bary=T)
plot.PCA(pca2, habillage=5, ellipse=ellipse.coord, cex=0.5,
label='ind.sup', lwd=3, xlim=c(-2,2), ylim=c(-2,2))
writing date:
20150430
No comments:
Post a Comment