Abstract:The
process of ANOVA analysis addressed linear associations between
variables, ANOVA modeling, formula optimization, and multiple
testing.
Analysis of variance
(ANOVA) is a widely used for statistically analyzing the effect of
vairables on your experiment. ANOVA tends to be used when you have
discrete levels of variables and linear modeling is used when the
variables are continuous.
General process
For the data set
cigarette smokers, the response Variable ‘smokers’ (the
continuous variable) the smokers at the percentage of the population.
The discrete factors are family income (5 levels) and age (3 levels)
known as the dependent variables.
>
ciga.smokers<-data.frame(smokers=c(38,42,14,41,41,16,36,39,18,32,36,15,28,33,17),
+ income=c(1,1,1,2,2,2,3,3,3,4,4,4,5,5,5),
age=c(1,2,3,1,2,3,1,2,3,1,2,3,1,2,3) )
> head(ciga.smokers)
smokers income age
1 38 1 1
2 42 1 2
3 14 1 3
4 41 2 1
5 41 2 2
6 16 2 3
ANOVA is based on
the linear modeling of the data across variables. Therefore, we
should make surely linear correlations between the response variable
‘smokers’ and the two factors ‘income’ and ‘age’,
respectively.
#present data by groups
library(ggplot2)
#plotting
ggplot(ciga.smokers, aes(income, smokers, colour=income)) +
facet_wrap(~age) +
geom_point()+
labs(title='Smokers vs. income/age', x='Levels of the incomes of
smokers',
y='Smokers at the perentage of population' )
Figure 1. Linear correlation of smoker percentage on income and age
The next is to
implement ANOVA analysis. The formula in the function aov() allows a
simple pattern, namely y~x, which is known as one-way ANOVA. Here is
the R code and ANOVA table.
> aov1<-aov(smokers~age, data=ciga.smokers) #one-way
ANOVA
> summary(aov1)
#display ANOVA result
Df Sum Sq Mean Sq F value Pr(>F)
age 1 902.5 902.5 16.61 0.00131
**
Residuals 13 706.4 54.3
---
Signif. Codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’
0.1 ‘ ’ 1
> aov2<-aov(smokers~income, data=ciga.smokers)
> summary(aov2)
Df Sum Sq Mean Sq F value Pr(>F)
income 4 92.9 23.23 0.153 0.957
Residuals 10 1516.0 151.60
For the factor age,
F-statistics is 16.61 with a p-value of 0.00131. We can reject the
null hypothesis of equal means for all age groups. But we cannot
reject the null of hypothesis of equal means for all income groups.
Therefore, the factor is not income (aov2) but the age (aov1) for
ANOVA analysis.
The ANOVA F-test
answers the question whether there are significant differences in the
K population means. It does not tell us how different these groups.
When H0 is rejected in ANOVA, multiple testing is required to
determine what is driving the difference in means. Tukey’s Honest
Significance Difference post-hoc test (TukeyHSD test) creats a set of
confidence intervals on the difference between means with family-wise
probability of coverage. Age group 3 was significantly different from
Age group 1 (p value 0.0000103) and 2 (p value 0.0000020), and the
difference between group 1 and group 2 was not significant (p value
0.3967031) as showed in Figure 2.
> TK.aov1<-TukeyHSD(aov1, conf.level=0.95)
> TK.aov1
Tukey multiple comparisons of means
95% family-wise confidence level
Fit: aov(formula = smokers ~ age, data = ciga.smokers)
$age
diff lwr upr p adj
2-1 3.2 -3.128332 9.528332 0.3967031
3-1 -19.0 -25.328332 -12.671668 0.0000103
3-2 -22.2 -28.528332 -15.871668 0.0000020
Figure
2. Multiple comparison by the TukeyHSD test
Multiple factors
We continue ANOVA of
cigar.smokers. Thought age is one factor, there is 706.4 as the
square of residue left involved in the ANOVA model. Therefore, we
should consider if one more factor can be involved the ANOVA model.
ANOVA allows a complex situation in a logical manner, namely y~x1+x2,
which is known as two way ANOVA. After the two factor were involved,
the square of residue (75.9) decreased compared with only one factor
(age) involved. The result indicates that two factors can improve the
ANOVA model of this data set.
> #two-way ANOVA
> # age and income are indepent
> aov3<-aov(smokers~age+income, data=ciga.smokers)
> summary(aov3)
Df Sum Sq Mean Sq F value Pr(>F)
age 2 1440.1 720.1 75.93
6.27e-06 ***
income 4 92.9 23.2 2.45
0.131
Residuals 8 75.9 9.5
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’
0.1 ‘ ’ 1
> TK.aov3<-TukeyHSD(aov3, conf.level=0.95)
> TK.aov3
Tukey multiple comparisons of means
95% family-wise confidence level
Fit: aov(formula = smokers ~ age + income, data = ciga.smokers)
$age
diff lwr upr p adj
2-1 3.2 -2.365296 8.765296 0.2836958
3-1 -19.0 -24.565296 -13.434704 0.0000267
3-2 -22.2 -27.765296 -16.634704 0.0000083
$income
diff lwr upr p adj
2-1 1.3333333 -7.353300 10.019967 0.9814883
3-1 -0.3333333 -9.019967 8.353300 0.9999136
4-1 -3.6666667 -12.353300 5.019967 0.6122103
5-1 -5.3333333 -14.019967 3.353300 0.2978092
3-2 -1.6666667 -10.353300 7.019967 0.9591767
4-2 -5.0000000 -13.686634 3.686634 0.3497164
5-2 -6.6666667 -15.353300 2.019967 0.1495106
4-3 -3.3333333 -12.019967 5.353300 0.6848951
5-3 -5.0000000 -13.686634 3.686634 0.3497164
5-4 -1.6666667 -10.353300 7.019967 0.9591767
Figure 3. ANOVA with two factors
However, the above
formula is based on the hypothesis that the two factors are
independent. It is possible that the two factors are interactions.
> ## From Venables and Ripley (2002) p.165.
> head(npk)
block N P K yield
1 1 0 1 1 49.5
2 1 1 1 0 62.8
3 1 0 0 0 46.8
4 1 1 0 1 57.0
5 2 1 0 0 59.8
6 2 1 1 1 58.5
> npk.aov <- aov(yield ~ block+N+K+P, data=npk)
> summary(npk.aov)
Df Sum Sq Mean Sq F value Pr(>F)
block 5 343.3 68.66 4.288 0.01272 *
N 1 189.3 189.28 11.821 0.00366 **
K 1 95.2 95.20 5.946 0.02767 *
P 1 8.4 8.40 0.525 0.47999
Residuals 15 240.2 16.01
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’
0.1 ‘ ’ 1
> coefficients(npk.aov)
(Intercept) block2 block3 block4 block5
block6 N1 K1
53.800000 3.425000 6.750000 -3.900000 -3.500000
2.325000 5.616667 -3.983333
P1
-1.183333
Here is N, P, and K
are interactions in the formula.
> npk.aov <- aov(yield ~ block+K*N*P, data=npk)
> summary(npk.aov)
Df Sum Sq Mean Sq F value Pr(>F)
block 5 343.3 68.66 4.447 0.01594 *
K 1 95.2 95.20 6.166 0.02880
*
N 1 189.3 189.28 12.259 0.00437 **
P 1 8.4 8.40 0.544
0.47490
K:N 1 33.1 33.14 2.146 0.16865
K:P 1 0.5 0.48 0.031
0.86275
N:P 1 21.3 21.28 1.378 0.26317
Residuals 12 185.3 15.44
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’
0.1 ‘ ’ 1
> coefficients(npk.aov)
(Intercept) block2 block3 block4
block5 block6 K1 N1
51.8250000 3.4250000 6.7500000 -3.9000000 -3.5000000
2.3250000 -1.9166667 9.8500000
P1 K1:N1 K1:P1 N1:P1
0.4166667 -4.7000000 0.5666667 -3.7666667
Reference
Hoaglin, D.,
Mosteller, F., and Tukey, J. (1991). Fundamentals of Exploratory
Analysis of Variance.
Writing date:
2014.12.12, 2015.02.10
No comments:
Post a Comment