The linear
regression model can describe the relationships between the response
variables y and one or multiple dependent variables x1, x2,
x3·······etc.,
which can be expressed by the following equation: The numbers α, β1,
β2, β3 are called parameters, and ϵ is the error term.
y = α +
β1x1+β2x2+ β3x3 +·······+
ϵ
#data set
> library(car)
>
> order.prestige<-Prestige[order(Prestige[,'prestige']), ]
> head(order.prestige)
education income women prestige census type
newsboys 9.62 918 7.00 14.8 5143 <NA>
janitors 7.11 3472 33.57 17.3 6191 bc
elevator.operators 7.58 3582 30.08 20.1 6193 bc
bartenders 8.50 3930 15.51 20.2 6123 bc
launderers 7.33 3000 69.31 20.8 6162 bc
farm.workers 8.60 1656 27.75 21.5 7182 bc
Observation of
data sets
> #check association between the response and dependent variables
> p1<-ggplot(aes(x=prestige,y=education),data=Prestige) +
facet_wrap(~type, ncol=4) + geom_point()
> p2<-ggplot(aes(x=prestige,y=log2(income)),data=Prestige) +
facet_wrap(~type, ncol=4) + geom_point()
> p3<-ggplot(aes(x=prestige,y=women),data=Prestige) +
facet_wrap(~type, ncol=4) + geom_point()
> p4<-ggplot(aes(x=prestige,y=census),data=Prestige) +
facet_wrap(~type, ncol=4) + geom_point()
> grid.arrange(p1, p2, p3, p4, ncol=1)
Figure 1. showed
that education and logarithm of income are linear correlated with
prestige, and women and census look like poorly correlations. Here
are the formula with two dependent variables:
prestige = α
+ β1* education + β2*log2(income) +·Ïµ
We can also address
the correlations between variables. The results are consistent with
the observations from the scatter plots above.
> cor(order.prestige[,1:5])
education income women prestige
census
education 1.00000000 0.5775802 0.06185286 0.8501769
-0.8230882
income 0.57758023 1.0000000 -0.44105927 0.7149057
-0.3610023
women 0.06185286 -0.4410593 1.00000000 -0.1183342 -0.2270028
prestige 0.85017689 0.7149057 -0.11833419 1.0000000 -0.6345103
census - 0.82308821 -0.3610023 -0.22700277 -0.6345103 1.0000000
Linear regression
From the output of
the linear regression model, ‘Intercept’ is α in the formula,
and education is β1, and log2(income) is β2, and Residual standard
error ϵ is 7.145.
> pres<-lm(prestige~education+log2(income),
data=order.prestige)
> summary(pres)
Call:
lm(formula = prestige ~ education + log2(income), data =
order.prestige)
Residuals:
Min 1Q Median 3Q Max
-17.0346 -4.5657 -0.1857 4.0577 18.1270
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -95.1940
10.9979 -8.656 9.27e-14 ***
education 4.0020
0.3115 12.846 < 2e-16 ***
log2(income) 7.9278
0.9961 7.959 2.94e-12 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’
0.1 ‘ ’ 1
Residual standard error: 7.145
on 99 degrees of freedom
Multiple R-squared: 0.831, Adjusted R-squared: 0.8275
F-statistic: 243.3 on 2 and 99 DF, p-value: < 2.2e-16
And those results
can be separately read by the functions:
>coefficients(pres) # model coefficients
>confint(pres, level=0.95) # CIs for model parameters
>fitted(pres) # predicted values
>residuals(pres) # residuals
>anova(pres) # anova table
>vcov(pres) # covariance matrix for model parameters
>influence(pres) # regression diagnostics
We can see the
linear regression by graphics.
>layout(matrix(c(1,2,3,4),2,2)) #usually 4 graphs
>plot(pres)
Figure 2. Linear regression modelling
Optimize the
formula of linear regression
The adding of
interacted variables women and census can little improve the modeling
of this linear model.
> #optimize formula
> fit1<-lm(prestige~education+log2(income),
data=order.prestige)
> fit2<-lm(prestige~education+log2(income)+women:census,
data=order.prestige)
> anova(fit1, fit2)
Analysis of Variance Table
Model 1: prestige ~ education + log2(income)
Model 2: prestige ~ education + log2(income) + women:census
Res.Df RSS Df Sum of Sq F Pr(>F)
1 99 5053.6
2 98 4867.4 1 186.28 3.7505
0.05567 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’
0.1 ‘ ’ 1
Here is the new
formula. Note that the coefficient of women:census is little.
> pres2<-lm(prestige~education+log2(income)+women:census,
data=order.prestige)
> summary(pres2)
Call:
lm(formula = prestige ~ education + log2(income) + women:census,
data = order.prestige)
Residuals:
Min 1Q Median 3Q Max
-17.1916 -3.5684 -0.4087 4.1356 19.8425
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.136e+02 1.441e+01 -7.880 4.58e-12 ***
education 3.899e+00 3.118e-01 12.506 < 2e-16 ***
log2(income) 9.373e+00 1.234e+00 7.596 1.82e-11 ***
women:census 1.031e-05
5.325e-06 1.937 0.0557 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’
0.1 ‘ ’ 1
Residual standard error: 7.047 on 98 degrees of freedom
Multiple R-squared: 0.8372, Adjusted R-squared: 0.8322
F-statistic: 168 on 3 and 98 DF, p-value: < 2.2e-16
Writing date:
2014.10.10, 2015.02.10
No comments:
Post a Comment