Thursday, February 12, 2015

R: Analysis of variance (ANOVA)



Abstract:The process of ANOVA analysis addressed linear associations between variables, ANOVA modeling, formula optimization, and multiple testing.


Analysis of variance (ANOVA) is a widely used for statistically analyzing the effect of vairables on your experiment. ANOVA tends to be used when you have discrete levels of variables and linear modeling is used when the variables are continuous.

General process

For the data set cigarette smokers, the response Variable ‘smokers’ (the continuous variable) the smokers at the percentage of the population. The discrete factors are family income (5 levels) and age (3 levels) known as the dependent variables.
> ciga.smokers<-data.frame(smokers=c(38,42,14,41,41,16,36,39,18,32,36,15,28,33,17),
+ income=c(1,1,1,2,2,2,3,3,3,4,4,4,5,5,5), age=c(1,2,3,1,2,3,1,2,3,1,2,3,1,2,3) )
> head(ciga.smokers)
smokers income age
1 38 1 1
2 42 1 2
3 14 1 3
4 41 2 1
5 41 2 2
6 16 2 3

ANOVA is based on the linear modeling of the data across variables. Therefore, we should make surely linear correlations between the response variable ‘smokers’ and the two factors ‘income’ and ‘age’, respectively.
#present data by groups
library(ggplot2)
#plotting
ggplot(ciga.smokers, aes(income, smokers, colour=income)) +
facet_wrap(~age) +
geom_point()+
labs(title='Smokers vs. income/age', x='Levels of the incomes of smokers',
y='Smokers at the perentage of population' )

Figure 1. Linear correlation of smoker percentage on income and age

The next is to implement ANOVA analysis. The formula in the function aov() allows a simple pattern, namely y~x, which is known as one-way ANOVA. Here is the R code and ANOVA table.
> aov1<-aov(smokers~age, data=ciga.smokers) #one-way ANOVA
> summary(aov1) #display ANOVA result
Df Sum Sq Mean Sq F value Pr(>F)
age 1 902.5 902.5 16.61 0.00131 **
Residuals 13 706.4 54.3
---
Signif. Codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

> aov2<-aov(smokers~income, data=ciga.smokers)
> summary(aov2)
Df Sum Sq Mean Sq F value Pr(>F)
income 4 92.9 23.23 0.153 0.957
Residuals 10 1516.0 151.60
For the factor age, F-statistics is 16.61 with a p-value of 0.00131. We can reject the null hypothesis of equal means for all age groups. But we cannot reject the null of hypothesis of equal means for all income groups. Therefore, the factor is not income (aov2) but the age (aov1) for ANOVA analysis.

The ANOVA F-test answers the question whether there are significant differences in the K population means. It does not tell us how different these groups. When H0 is rejected in ANOVA, multiple testing is required to determine what is driving the difference in means. Tukey’s Honest Significance Difference post-hoc test (TukeyHSD test) creats a set of confidence intervals on the difference between means with family-wise probability of coverage. Age group 3 was significantly different from Age group 1 (p value 0.0000103) and 2 (p value 0.0000020), and the difference between group 1 and group 2 was not significant (p value 0.3967031) as showed in Figure 2.
> TK.aov1<-TukeyHSD(aov1, conf.level=0.95)
> TK.aov1
Tukey multiple comparisons of means
95% family-wise confidence level

Fit: aov(formula = smokers ~ age, data = ciga.smokers)

$age
diff lwr upr p adj
2-1 3.2 -3.128332 9.528332 0.3967031
3-1 -19.0 -25.328332 -12.671668 0.0000103
3-2 -22.2 -28.528332 -15.871668 0.0000020



Figure 2. Multiple comparison by the TukeyHSD test

Multiple factors
We continue ANOVA of cigar.smokers. Thought age is one factor, there is 706.4 as the square of residue left involved in the ANOVA model. Therefore, we should consider if one more factor can be involved the ANOVA model. ANOVA allows a complex situation in a logical manner, namely y~x1+x2, which is known as two way ANOVA. After the two factor were involved, the square of residue (75.9) decreased compared with only one factor (age) involved. The result indicates that two factors can improve the ANOVA model of this data set.
> #two-way ANOVA
> # age and income are indepent
> aov3<-aov(smokers~age+income, data=ciga.smokers)
> summary(aov3)
Df Sum Sq Mean Sq F value Pr(>F)
age 2 1440.1 720.1 75.93 6.27e-06 ***
income 4 92.9 23.2 2.45 0.131
Residuals 8 75.9 9.5
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
> TK.aov3<-TukeyHSD(aov3, conf.level=0.95)
> TK.aov3
Tukey multiple comparisons of means
95% family-wise confidence level

Fit: aov(formula = smokers ~ age + income, data = ciga.smokers)

$age
diff lwr upr p adj
2-1 3.2 -2.365296 8.765296 0.2836958
3-1 -19.0 -24.565296 -13.434704 0.0000267
3-2 -22.2 -27.765296 -16.634704 0.0000083

$income
diff lwr upr p adj
2-1 1.3333333 -7.353300 10.019967 0.9814883
3-1 -0.3333333 -9.019967 8.353300 0.9999136
4-1 -3.6666667 -12.353300 5.019967 0.6122103
5-1 -5.3333333 -14.019967 3.353300 0.2978092
3-2 -1.6666667 -10.353300 7.019967 0.9591767
4-2 -5.0000000 -13.686634 3.686634 0.3497164
5-2 -6.6666667 -15.353300 2.019967 0.1495106
4-3 -3.3333333 -12.019967 5.353300 0.6848951
5-3 -5.0000000 -13.686634 3.686634 0.3497164
5-4 -1.6666667 -10.353300 7.019967 0.9591767

Figure 3. ANOVA with two factors
However, the above formula is based on the hypothesis that the two factors are independent. It is possible that the two factors are interactions.
> ## From Venables and Ripley (2002) p.165.
> head(npk)
block N P K yield
1 1 0 1 1 49.5
2 1 1 1 0 62.8
3 1 0 0 0 46.8
4 1 1 0 1 57.0
5 2 1 0 0 59.8
6 2 1 1 1 58.5
> npk.aov <- aov(yield ~ block+N+K+P, data=npk)
> summary(npk.aov)
Df Sum Sq Mean Sq F value Pr(>F)
block 5 343.3 68.66 4.288 0.01272 *
N 1 189.3 189.28 11.821 0.00366 **
K 1 95.2 95.20 5.946 0.02767 *
P 1 8.4 8.40 0.525 0.47999
Residuals 15 240.2 16.01
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
> coefficients(npk.aov)
(Intercept) block2 block3 block4 block5 block6 N1 K1
53.800000 3.425000 6.750000 -3.900000 -3.500000 2.325000 5.616667 -3.983333
P1
-1.183333

Here is N, P, and K are interactions in the formula.
> npk.aov <- aov(yield ~ block+K*N*P, data=npk)
> summary(npk.aov)
Df Sum Sq Mean Sq F value Pr(>F)
block 5 343.3 68.66 4.447 0.01594 *
K 1 95.2 95.20 6.166 0.02880 *
N 1 189.3 189.28 12.259 0.00437 **
P 1 8.4 8.40 0.544 0.47490
K:N 1 33.1 33.14 2.146 0.16865
K:P 1 0.5 0.48 0.031 0.86275
N:P 1 21.3 21.28 1.378 0.26317
Residuals 12 185.3 15.44
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
> coefficients(npk.aov)
(Intercept) block2 block3 block4 block5 block6 K1 N1
51.8250000 3.4250000 6.7500000 -3.9000000 -3.5000000 2.3250000 -1.9166667 9.8500000
P1 K1:N1 K1:P1 N1:P1
0.4166667 -4.7000000 0.5666667 -3.7666667


Reference
Hoaglin, D., Mosteller, F., and Tukey, J. (1991). Fundamentals of Exploratory Analysis of Variance.



Writing date: 2014.12.12, 2015.02.10

No comments:

Post a Comment