Abstract:
Check normality of data sets and related concerns.
Normality test
determines if a data set is well-modeled by normal distribution
before implementing parametric significance analysis. Graphical
methods including histogram and Q-Q plots are the most used method to
check if the data set as the null hypothesis is normally distributed.
Histogram
The shape is symmetrical or non-symmetrical like. The left picture
showed distribution of the data known as population. And the right
one is the distribution of random numbers from normal distribution.
>population=state.x77[,'Population']
>par(mfrow=c(1,2),mar=c(6,4,2,1),oma=c(4,3,2,1))
>hist(population, prob=T);points(density(population),col='red',
type='l')
>random.norm<-rnorm(1000,
mean=mean(population),sd=sd(population))
>hist(random.norm, prob=T)
>points(density(random.norm),col='red', type='l')
Quantile-Quantile
plot (Q-Q plot)
The shape that the most points keeps in line usually indicates good
normality. Logarithm can be implemented for the better normality.
> qqnorm(population);qqline(population, col=2)
> qqnorm(log2(population))
#logarithm
>qqline(log2(population), col=2)
Shapiro’s test is one of methods by testing univariate normality.
Shapiro’s test is the null hypothesis test against assumption of
normality. p values show that the data is sufficiently
inconsistent with a normal distribution if you reject the null. It is
not good to implement shepiro’s test only to determine a normal
statistical procedure.
> shapiro.test(population)
# no like normal distribution
Shapiro-Wilk normality test
data: population
W = 0.77, p-value = 1.906e-07
> shapiro.test(log2(population))
#
like normal distribution after logarithm
Shapiro-Wilk normality test
data: log2(population)
W = 0.9748, p-value = 0.3585
> shapiro.test(rnorm(1000,
mean=mean(population),sd=sd(population)))
Shapiro-Wilk normality test
data: rnorm(1000, mean = mean(population), sd = sd(population))
W = 0.9989, p-value = 0.8189
Moreover, another point is the sample size. There would be a big
departure from normality even when normal testing is fine, or
estimation of mean and SD deviated from the real estimates. And when
your sample size is big (I mean it might be hundreds), even small
deviation from normality is also acceptable. But things become more
complicated if there would be a huge data set (I mean it would be
thousands or hundred thousands). Normality testing would face
‘Kurtosis risk’ because there are always some observations
extremely far from the average.
Writing date:
2014.04.10, 2015.02.10
No comments:
Post a Comment