Tiezheng Yuan Ph.D.: R: Linear regression modeling

Abstract: perform linear regression using the function lm(), and optimize the formula.

The linear regression model can describe the relationships between the response variables y and one or multiple dependent variables x1, x2, x3·······etc., which can be expressed by the following equation: The numbers α, β1, β2, β3 are called parameters, and ϵ is the error term.

y = α + β1x1+β2x2+ β3x3 +·······+ ϵ

#data set

> library(car)

> order.prestige<-Prestige[order(Prestige[,'prestige']), ]

> head(order.prestige)

education income women prestige census type

newsboys 9.62 918 7.00 14.8 5143 <NA>

janitors 7.11 3472 33.57 17.3 6191 bc

elevator.operators 7.58 3582 30.08 20.1 6193 bc

bartenders 8.50 3930 15.51 20.2 6123 bc

launderers 7.33 3000 69.31 20.8 6162 bc

farm.workers 8.60 1656 27.75 21.5 7182 bc

Observation of data sets

> #check association between the response and dependent variables

> p1<-ggplot(aes(x=prestige,y=education),data=Prestige) + facet_wrap(~type, ncol=4) + geom_point()

> p2<-ggplot(aes(x=prestige,y=log2(income)),data=Prestige) + facet_wrap(~type, ncol=4) + geom_point()

> p3<-ggplot(aes(x=prestige,y=women),data=Prestige) + facet_wrap(~type, ncol=4) + geom_point()

> p4<-ggplot(aes(x=prestige,y=census),data=Prestige) + facet_wrap(~type, ncol=4) + geom_point()

> grid.arrange(p1, p2, p3, p4, ncol=1)

Figure 1. Linear correlations between variables

Figure 1. showed that education and logarithm of income are linear correlated with prestige, and women and census look like poorly correlations. Here are the formula with two dependent variables:

prestige = α + β1* education + β2*log2(income) +·ϵ

We can also address the correlations between variables. The results are consistent with the observations from the scatter plots above.

> cor(order.prestige[,1:5])

education income women prestige census

education 1.00000000 0.5775802 0.06185286 0.8501769 -0.8230882

income 0.57758023 1.0000000 -0.44105927 0.7149057 -0.3610023

women 0.06185286 -0.4410593 1.00000000 -0.1183342 -0.2270028

prestige 0.85017689 0.7149057 -0.11833419 1.0000000 -0.6345103

census - 0.82308821 -0.3610023 -0.22700277 -0.6345103 1.0000000

Linear regression

From the output of the linear regression model, ‘Intercept’ is α in the formula, and education is β1, and log2(income) is β2, and Residual standard error ϵ is 7.145.

> pres<-lm(prestige~education+log2(income), data=order.prestige)

> summary(pres)

Call:

lm(formula = prestige ~ education + log2(income), data = order.prestige)

Residuals:

Min 1Q Median 3Q Max

-17.0346 -4.5657 -0.1857 4.0577 18.1270

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) -95.1940 10.9979 -8.656 9.27e-14 ***

education 4.0020 0.3115 12.846 < 2e-16 ***

log2(income) 7.9278 0.9961 7.959 2.94e-12 ***

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 7.145 on 99 degrees of freedom

Multiple R-squared: 0.831, Adjusted R-squared: 0.8275

F-statistic: 243.3 on 2 and 99 DF, p-value: < 2.2e-16

And those results can be separately read by the functions:

>coefficients(pres) # model coefficients

>confint(pres, level=0.95) # CIs for model parameters

>fitted(pres) # predicted values

>residuals(pres) # residuals

>anova(pres) # anova table

>vcov(pres) # covariance matrix for model parameters

>influence(pres) # regression diagnostics

We can see the linear regression by graphics.

>layout(matrix(c(1,2,3,4),2,2)) #usually 4 graphs

>plot(pres)

Figure 2. Linear regression modelling

Optimize the formula of linear regression

The adding of interacted variables women and census can little improve the modeling of this linear model.

> #optimize formula

> fit1<-lm(prestige~education+log2(income), data=order.prestige)

> fit2<-lm(prestige~education+log2(income)+women:census, data=order.prestige)

> anova(fit1, fit2)

Analysis of Variance Table

Model 1: prestige ~ education + log2(income)

Model 2: prestige ~ education + log2(income) + women:census

Res.Df RSS Df Sum of Sq F Pr(>F)

1 99 5053.6

2 98 4867.4 1 186.28 3.7505 0.05567 .

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Here is the new formula. Note that the coefficient of women:census is little.

> pres2<-lm(prestige~education+log2(income)+women:census, data=order.prestige)

> summary(pres2)

Call:

lm(formula = prestige ~ education + log2(income) + women:census,

data = order.prestige)

Residuals:

Min 1Q Median 3Q Max

-17.1916 -3.5684 -0.4087 4.1356 19.8425

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) -1.136e+02 1.441e+01 -7.880 4.58e-12 ***

education 3.899e+00 3.118e-01 12.506 < 2e-16 ***

log2(income) 9.373e+00 1.234e+00 7.596 1.82e-11 ***

women:census 1.031e-05 5.325e-06 1.937 0.0557 .

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 7.047 on 98 degrees of freedom

Multiple R-squared: 0.8372, Adjusted R-squared: 0.8322

F-statistic: 168 on 3 and 98 DF, p-value: < 2.2e-16

Writing date: 2014.10.10, 2015.02.10

Tiezheng Yuan Ph.D.

Thursday, February 12, 2015

R: Linear regression modeling

No comments:

Post a Comment