Thursday, May 24, 2018

Multiple Linear regression: Predicted Values

Multiple Linear regression: Predicted Values

 
Here is a linear regression model in R
> lm2<-lm(wt~age+ht, data=d)
> summary(lm2)

Call:
lm(formula = wt ~ age + ht, data = d)

Residuals:
    Min     1Q Median       3Q Max
-2.48498 -0.53548  0.01508 0.51986 2.77917

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -8.297442   0.865929 -9.582 <2e-16 ***
age          0.005368 0.010169   0.528 0.598
ht           0.228086 0.014205  16.057 <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.9035 on 182 degrees of freedom
Multiple R-squared:  0.9163,    Adjusted R-squared:  0.9154
F-statistic: 995.8 on 2 and 182 DF,  p-value: < 2.2e-16


The estimated coefficient using OLS
R code:
#matrix of independent variables
> X<-cbind(rep(1, nrow(d)), as.matrix(d[,c('age','ht')]))
> head(X)
     age ht
446 1   1 52.4
671 1   1 53.6
241 1   2 56.2
261 1   2 54.7
301 1   2 54.1
196 1   3 57.0
> dim(X)
[1] 185   3
#matrix of depencent variable:
> Y<-as.matrix(d$wt)
> dim(Y)
[1] 185   1
#degree of freedom
> (df=nrow(X)-length(coef(lm2)))
[1] 182
#OLS_beta: beta0, beta1, beta2
> (OLS_beta<-ginv(t(X)%*%X)%*%t(X)%*%Y)
            [,1]
[1,] -8.297442239
[2,]  0.005368228
[3,]  0.228085501
#hat matrix
> H<-X%*%ginv(t(X)%*%X)%*%t(X)
> dim(H)
[1] 185 185

  1. Response variable
Consider a multiple variate linear regression model (MvLR)
                    or
Here, εi ~ N(0, σ2), or ε ~ MVG (0, σ2I). So E(ε)=0, Var(ε)= σ2I. Now consider the values of response variable yi or Yn×1.
Here, Xβ is fitted value and random variable ε is supposed to follow normal distribution. Therefore the values of the function Xβ+ε should follow normal distribution:



  1. Predicted values of dependent variable
So expected value of Y is
set hat matrix  So .
H is symmetric (There is linear algebra theorem: ):
H is independent:

R code:
> OLS_Y=H%*%Y
> head(OLS_Y)
       [,1]
446 3.659607
671 3.933308
241 4.531700
261 4.189571
301 4.052719
196 4.719536

The n x n matrix H is known as the hat matrix simply because it maps y into , and H is only determined by X regardless of Y. The element hij of H has a direct interpretation as the amount of influence on by y. hij points out which the value of y has a large impact on the fit.
The diagonal elements of H hii, are called leverages and satisfy
where p is the number of coefficients, and n is the number of observations (rows of X) in the regression model. HatMatrix is an n-by-n matrix in the Diagnostics table.
  1. Confidence Interval of predicted value
There is OLS where linear relationship between predicted value and coefficient is . Because . So
Implement delta method, so
If is unknown, estimate with . p is the number of coefficients, and n is the number of observations.
95% of
> #residuals sum of squares
> (RSS<-t(Y)%*%(diag(1,nrow(H))-H)%*%Y)
        [,1]
[1,] 148.5623
> vcov_Y<-as.numeric(RSS/(nrow(H)-2-1))*H
> #variance-covariance matrix of OLS-Y
> dim(vcov_Y)
[1] 185 185
> #standard error of OLS-Y
> se_Y<-sqrt(diag(vcov_Y))
> #95%CI
> (t95<-qt(0.975, df=182, lower.tail=F))
[1] 1.973084
>CI=data.frame('exp_Y'=exp_Y,'lower_bound'=exp_Y-t95*se_Y, 'upper_bound'=exp_Y+t95*se_Y)
> head(CI)
      exp_Y lower_bound upper_bound
446 3.659607    3.311358 4.007855
671 3.933308    3.607185 4.259432
241 4.531700    4.242070 4.821329
261 4.189571    3.875997 4.503145
301 4.052719    3.728534 4.376904
196 4.719536    4.437238 5.001834
> CI<-CI[order(CI$exp_Y),]
> plot(x=1:nrow(CI), y=CI$exp_Y, type='l', ylab='Expected value of OLS-Y',
+ main='95% CI of OLS-Y')
> points(x=1:nrow(CI), y=CI$lower_bound, type='l', col='blue' )
> points(x=1:nrow(CI), y=CI$upper_bound, type='l', col='red' )



Based on linear combination, let say H matrix is i*j (i=j), whose variance and covariance of any cells could be hij. The formula above given dependent variables Y, the expected value of yi could be written as

Here is a plot of residuals () ~ predicted values (), where predicted values were distributed the two part regions around 0.
#plot of expected value of Y~ OLS-residuals
plot(x=H%*%Y, y=(diag(1,nrow(H))-H)%*%Y,
    xlab='Predicted value (OLS_Y)', ylab='Residuals(OLS_e)')
abline(h=0)



The next

No comments:

Post a Comment