Tuesday, February 10, 2015

R: Normality Test



Abstract: Check normality of data sets and related concerns.


Normality test determines if a data set is well-modeled by normal distribution before implementing parametric significance analysis. Graphical methods including histogram and Q-Q plots are the most used method to check if the data set as the null hypothesis is normally distributed.
Histogram
The shape is symmetrical or non-symmetrical like. The left picture showed distribution of the data known as population. And the right one is the distribution of random numbers from normal distribution.
>population=state.x77[,'Population']

>par(mfrow=c(1,2),mar=c(6,4,2,1),oma=c(4,3,2,1))
>hist(population, prob=T);points(density(population),col='red', type='l')
>random.norm<-rnorm(1000, mean=mean(population),sd=sd(population))
>hist(random.norm, prob=T)
>points(density(random.norm),col='red', type='l')


Quantile-Quantile plot (Q-Q plot)
The shape that the most points keeps in line usually indicates good normality. Logarithm can be implemented for the better normality.
> qqnorm(population);qqline(population, col=2)
> qqnorm(log2(population)) #logarithm
>qqline(log2(population), col=2)


Shapiro’s test is one of methods by testing univariate normality. Shapiro’s test is the null hypothesis test against assumption of normality. p values show that the data is sufficiently inconsistent with a normal distribution if you reject the null. It is not good to implement shepiro’s test only to determine a normal statistical procedure.
> shapiro.test(population) # no like normal distribution
Shapiro-Wilk normality test
data: population
W = 0.77, p-value = 1.906e-07
> shapiro.test(log2(population)) # like normal distribution after logarithm
Shapiro-Wilk normality test
data: log2(population)
W = 0.9748, p-value = 0.3585
> shapiro.test(rnorm(1000, mean=mean(population),sd=sd(population)))
Shapiro-Wilk normality test
data: rnorm(1000, mean = mean(population), sd = sd(population))
W = 0.9989, p-value = 0.8189

Moreover, another point is the sample size. There would be a big departure from normality even when normal testing is fine, or estimation of mean and SD deviated from the real estimates. And when your sample size is big (I mean it might be hundreds), even small deviation from normality is also acceptable. But things become more complicated if there would be a huge data set (I mean it would be thousands or hundred thousands). Normality testing would face ‘Kurtosis risk’ because there are always some observations extremely far from the average.


Writing date: 2014.04.10, 2015.02.10









No comments:

Post a Comment