Tiezheng Yuan Ph.D.: Cluster analysis (I): Visualization of principal component analysis

Cluster analysis (I): Visualization of principal component analysis

Abstract: We implement principal component analysis (PCA) to reduce the dimensionality of a dataset into 2 or 3 dimensions. PCA can suggest the importance of variables. Here, I present how to compute and visualize PCA in R.

Here is the example data. The first 4 columns are continuous variables. The last column is called categorical variable.

> data(iris)

> head(iris)

Sepal.Length Sepal.Width Petal.Length Petal.Width Species

1 5.1 3.5 1.4 0.2 setosa

2 4.9 3.0 1.4 0.2 setosa

3 4.7 3.2 1.3 0.2 setosa

4 4.6 3.1 1.5 0.2 setosa

5 5.0 3.6 1.4 0.2 setosa

6 5.4 3.9 1.7 0.4 setosa

Make sure that values of variables were normal-like distribution, or logarithm was required.

#log

>log.iris<-log2(iris[,1:4])

The default method of PCA is the function prcomp() or princomp() in the default ‘stats’.

>pca1<-prcomp(log.iris, scale=T)

>print(pca1)

Standard deviations:

[1] 1.65413553 0.19960067 0.19513527 0.07763056

Rotation:

PC1 PC2 PC3 PC4

Sepal.Length 0.10090019 -0.0008537483 -0.4891583 0.86633858

Sepal.Width -0.05759298 0.5745110809 -0.7140592 -0.39590340

Petal.Length 0.50527032 -0.6870939247 -0.4269180 -0.30057416

Petal.Width 0.85510473 0.4447900940 0.2618865 0.04871476

#graphing

>biplot(pca1)

#importance of variables

>plot(pca1)

With PCA plots, there other options except the common function biplot().

For example, PCA graphing by ggplot2()

# load ggplot2

library(ggplot2)

# create data frame with scores

pca.scores = as.data.frame(pca1$x)

# plot of observations

ggplot(data = pca.scores, aes(x = PC1, y = PC2, label = rownames(pca.scores))) +

geom_hline(yintercept = 0, colour = "gray50") +

geom_vline(xintercept = 0, colour = "gray50") +

geom_text(colour = "blue", alpha = 0.8, size = 4) +

ggtitle("PCA plots")

Another method of PCA is the R package ‘FactoMineR’.

> library(FactoMineR)

> pca1<-PCA(log.iris)

> print(pca1)

**Results for the Principal Component Analysis (PCA)**

The analysis was performed on 150 individuals, described by 4 variables

*The results are available in the following objects:

name description

1 "$eig" "eigenvalues"

2 "$var" "results for the variables"

3 "$var$coord" "coord. for the variables"

4 "$var$cor" "correlations variables - dimensions"

5 "$var$cos2" "cos2 for the variables"

6 "$var$contrib" "contributions of the variables"

7 "$ind" "results for the individuals"

8 "$ind$coord" "coord. for the individuals"

9 "$ind$cos2" "cos2 for the individuals"

10 "$ind$contrib" "contributions of the individuals"

11 "$call" "summary statistics"

12 "$call$centre" "mean of the variables"

13 "$call$ecart.type" "standard error of the variables"

14 "$call$row.w" "weights for the individuals"

15 "$call$col.w" "weights for the variables"

log.iris<-cbind.data.frame(log2(iris[,1:4], iris[,'Species'])

pca2<-PCA(log.iris,quali.sup=5, graph=F)

concat = cbind.data.frame(log.iris[,5],pca2$ind$coord)

ellipse.coord = coord.ellipse(concat,bary=T)

plot.PCA(pca2, habillage=5, ellipse=ellipse.coord, cex=0.5,

label='ind.sup', lwd=3, xlim=c(-2,2), ylim=c(-2,2))

writing date: 20150430

Tiezheng Yuan Ph.D.

Wednesday, May 13, 2015

Cluster analysis (I): Visualization of principal component analysis

No comments:

Post a Comment