Thursday, February 12, 2015

R: Linear regression modeling



Abstract: perform linear regression using the function lm(), and optimize the formula.


The linear regression model can describe the relationships between the response variables y and one or multiple dependent variables x1, x2, x3·······etc., which can be expressed by the following equation: The numbers α, β1, β2, β3 are called parameters, and ϵ is the error term.
y = α + β1x1+β2x2+ β3x3 +·······+ ϵ

#data set
> library(car)
>
> order.prestige<-Prestige[order(Prestige[,'prestige']), ]
> head(order.prestige)
education income women prestige census type
newsboys 9.62 918 7.00 14.8 5143 <NA>
janitors 7.11 3472 33.57 17.3 6191 bc
elevator.operators 7.58 3582 30.08 20.1 6193 bc
bartenders 8.50 3930 15.51 20.2 6123 bc
launderers 7.33 3000 69.31 20.8 6162 bc
farm.workers 8.60 1656 27.75 21.5 7182 bc


Observation of data sets
> #check association between the response and dependent variables
> p1<-ggplot(aes(x=prestige,y=education),data=Prestige) + facet_wrap(~type, ncol=4) + geom_point()
> p2<-ggplot(aes(x=prestige,y=log2(income)),data=Prestige) + facet_wrap(~type, ncol=4) + geom_point()
> p3<-ggplot(aes(x=prestige,y=women),data=Prestige) + facet_wrap(~type, ncol=4) + geom_point()
> p4<-ggplot(aes(x=prestige,y=census),data=Prestige) + facet_wrap(~type, ncol=4) + geom_point()
> grid.arrange(p1, p2, p3, p4, ncol=1)
Figure 1. Linear correlations between variables
Figure 1. showed that education and logarithm of income are linear correlated with prestige, and women and census look like poorly correlations. Here are the formula with two dependent variables:
prestige = α + β1* education + β2*log2(income) +·Ïµ

We can also address the correlations between variables. The results are consistent with the observations from the scatter plots above.
> cor(order.prestige[,1:5])
education income women prestige census
education 1.00000000 0.5775802 0.06185286 0.8501769 -0.8230882
income 0.57758023 1.0000000 -0.44105927 0.7149057 -0.3610023
women 0.06185286 -0.4410593 1.00000000 -0.1183342 -0.2270028
prestige 0.85017689 0.7149057 -0.11833419 1.0000000 -0.6345103
census - 0.82308821 -0.3610023 -0.22700277 -0.6345103 1.0000000

Linear regression

From the output of the linear regression model, ‘Intercept’ is α in the formula, and education is β1, and log2(income) is β2, and Residual standard error ϵ is 7.145.
> pres<-lm(prestige~education+log2(income), data=order.prestige)
> summary(pres)

Call:
lm(formula = prestige ~ education + log2(income), data = order.prestige)

Residuals:
Min 1Q Median 3Q Max
-17.0346 -4.5657 -0.1857 4.0577 18.1270

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -95.1940 10.9979 -8.656 9.27e-14 ***
education 4.0020 0.3115 12.846 < 2e-16 ***
log2(income) 7.9278 0.9961 7.959 2.94e-12 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 7.145 on 99 degrees of freedom
Multiple R-squared: 0.831, Adjusted R-squared: 0.8275
F-statistic: 243.3 on 2 and 99 DF, p-value: < 2.2e-16

And those results can be separately read by the functions:
>coefficients(pres) # model coefficients
>confint(pres, level=0.95) # CIs for model parameters
>fitted(pres) # predicted values
>residuals(pres) # residuals
>anova(pres) # anova table
>vcov(pres) # covariance matrix for model parameters
>influence(pres) # regression diagnostics

We can see the linear regression by graphics.
>layout(matrix(c(1,2,3,4),2,2)) #usually 4 graphs
>plot(pres)

Figure 2. Linear regression modelling

Optimize the formula of linear regression
The adding of interacted variables women and census can little improve the modeling of this linear model.
> #optimize formula
> fit1<-lm(prestige~education+log2(income), data=order.prestige)
> fit2<-lm(prestige~education+log2(income)+women:census, data=order.prestige)
> anova(fit1, fit2)
Analysis of Variance Table

Model 1: prestige ~ education + log2(income)
Model 2: prestige ~ education + log2(income) + women:census
Res.Df RSS Df Sum of Sq F Pr(>F)
1 99 5053.6
2 98 4867.4 1 186.28 3.7505 0.05567 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Here is the new formula. Note that the coefficient of women:census is little.
> pres2<-lm(prestige~education+log2(income)+women:census, data=order.prestige)
> summary(pres2)

Call:
lm(formula = prestige ~ education + log2(income) + women:census,
data = order.prestige)

Residuals:
Min 1Q Median 3Q Max
-17.1916 -3.5684 -0.4087 4.1356 19.8425

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.136e+02 1.441e+01 -7.880 4.58e-12 ***
education 3.899e+00 3.118e-01 12.506 < 2e-16 ***
log2(income) 9.373e+00 1.234e+00 7.596 1.82e-11 ***
women:census 1.031e-05 5.325e-06 1.937 0.0557 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 7.047 on 98 degrees of freedom
Multiple R-squared: 0.8372, Adjusted R-squared: 0.8322
F-statistic: 168 on 3 and 98 DF, p-value: < 2.2e-16



Writing date: 2014.10.10, 2015.02.10

No comments:

Post a Comment