Multiple Linear regression: Residuals
> lm2<-lm(wt~age+ht, data=d)
> summary(lm2)
Call:
lm(formula = wt ~ age + ht, data = d)
Residuals:
Min 1Q Median 3Q Max
-2.48498 -0.53548 0.01508 0.51986 2.77917
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -8.297442 0.865929 -9.582 <2e-16 ***
age 0.005368 0.010169 0.528 0.598
ht 0.228086 0.014205 16.057 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.9035 on 182 degrees of freedom
Multiple R-squared: 0.9163, Adjusted R-squared: 0.9154
F-statistic: 995.8 on 2 and 182 DF, p-value: < 2.2e-16
The estimated coefficient data:image/s3,"s3://crabby-images/1d608/1d6088fd32097dfa6aa3bbd3e1cd39a5653a1372" alt=""
using OLS
- What is residual denoted as ε.
Consider a multiple variate linear regression model (MvLR). Suppose that there is linear relationship between X and Y.
In another word, data:image/s3,"s3://crabby-images/ae436/ae436cdc274bd47376ddf98578709f13b075d6a3" alt=""
is independent random variable (iid).
There are equations if there is column matrix εnx1:
So column matrix εnx1 follows multivariates Gaussian distribution: ε ~ MVG(0, σ2I), which indicates a multivariate normal distribution of ε with mean 0nx1 and the variance-covariance matrix σ2Inxn.
There is residuals sum of squares (RSS):
- Estimation of residual using OLS.
Because the population residuals are unknown, so consider OLS residuals based on observed data.
Consider OLS residuals data:image/s3,"s3://crabby-images/17baf/17baf86c9ab43ca52cc794b17b636cbe3f3cce6d" alt=""
data:image/s3,"s3://crabby-images/31fd7/31fd70879f4cbb8c109bc35588328abd483c73bc" alt=""
Consider another equation of data:image/s3,"s3://crabby-images/ea834/ea8340af1be334679e19c5d2651e47b51f9bd77e" alt=""
data:image/s3,"s3://crabby-images/fd411/fd411edc3e879c363a263b319c9cb793a0d3ecff" alt=""
OLS residual data:image/s3,"s3://crabby-images/7a73b/7a73b4ae2d6b7c3d1c405c5eb7092ae5f437ee0c" alt=""
is unbiased estimator of ε.
Proof data:image/s3,"s3://crabby-images/fb84c/fb84c7d419f48c074bc6ef9d109ecc3a0397c179" alt=""
data:image/s3,"s3://crabby-images/c03a2/c03a2a7ed7a4fbae5a4ab869649a3a6fc34880a9" alt=""
Proof variance of data:image/s3,"s3://crabby-images/016a1/016a18653987fa5f9e3dd45f1ef7037612ca7ef0" alt=""
: data:image/s3,"s3://crabby-images/e8091/e80917b5b24fb84e915414d09dd439b2b066c321" alt=""
data:image/s3,"s3://crabby-images/7e907/7e907d9b1501782d38988deffaad056a746ede41" alt=""
There is linear algebra theorem Var(Xv)=XVar(v)X' where v is a column and X is a matrix.
So data:image/s3,"s3://crabby-images/3d8a1/3d8a129312d6bc1ff1cadd90d65fd7a7c36764a7" alt=""
data:image/s3,"s3://crabby-images/988ac/988acb8e4738a47727cc1e9b27bc995af963778e" alt=""
- Estimation of variance of residuals
It is supposed that ε follow gaussian distribution, ε ~ MVG(0, σ2I), σ2 is the population variance, but it is unknown. We then estimate σ2 with data:image/s3,"s3://crabby-images/eede2/eede21940d6482f2245e6837d0fcebb84e6dad85" alt=""
. So there is a definition on data:image/s3,"s3://crabby-images/af9d7/af9d712d6d2e20159ce0beaf3b71fa2ada13218e" alt=""
, n is number of observations, and p is number of coefficients:
Proof data:image/s3,"s3://crabby-images/e47c0/e47c07824e0291c14fac4997a285086c8fe8a2ee" alt=""
is an unbiased estimator of the variance of the disturbance.
There is prediction error sum of squared (PESS) or residuals sum of squares (RSS):
So residuals standard error data:image/s3,"s3://crabby-images/9d928/9d928d90a007bfcf7c82a8de9534c49ed21cdfeb" alt=""
:
95% CI of data:image/s3,"s3://crabby-images/862b6/862b6fcd57ad5fd4c8781ffbca5676c57dca56e8" alt=""
:
The PESS estimate is useful as summary measure of a model's ability to predict new observations.
- R Code
R code:
#matrix of independent variables
> X<-cbind(rep(1, nrow(d)), as.matrix(d[,c('age','ht')]))
> head(X)
age ht
446 1 1 52.4
671 1 1 53.6
241 1 2 56.2
261 1 2 54.7
301 1 2 54.1
196 1 3 57.0
> dim(X)
[1] 185 3
#matrix of depencent variable:
> Y<-as.matrix(d$wt)
> dim(Y)
[1] 185 1
#degree of freedom
> (df=nrow(X)-length(coef(lm2)))
[1] 182
#OLS_beta: beta0, beta1, beta2
> (OLS_beta<-ginv(t(X)%*%X)%*%t(X)%*%Y)
[,1]
[1,] -8.297442239
[2,] 0.005368228
[3,] 0.228085501
#hat matrix
> H<-X%*%ginv(t(X)%*%X)%*%t(X)
> dim(H)
[1] 185 185
Here is R code to calculate OLS residuals:
> residuals<-(diag(1, nrow(X))-H)%*%Y
> dim(residuals)
> #standard error of residuals
> (se_residuals<-sqrt(t(Y)%*%(diag(1, nrow(X))-H)%*%Y/df))
[,1]
[1,] 0.9034802
> #95% CI of residuals
> (t95<-qt(0.975, df=182, lower.tail=F))
[1] 1.973084
> df<-data.frame('residuals'=residuals,
+ 'lower_bound'=as.numeric(residuals)-t95*se_residuals,
+ 'upper_bound'=as.numeric(residuals)+t95*se_residuals)
> head(df)
residuals lower_bound upper_bound
446 0.44039334 -1.342249 2.223036
671 -0.13330852 -1.915951 1.649334
241 0.06830038 -1.714342 1.850943
261 -0.08957137 -1.872214 1.693071
301 0.04728045 -1.735362 1.829923
196 0.48046383 -1.302179 2.263106
Then histogram of residuals
> hist(residuals, main='OLS residuals')
> abline(v=c(-t95*se_residuals, 0, t95*se_residuals),
+ col=c('blue', 'green', 'red'), lwd=3)
No comments:
Post a Comment