Thursday, May 24, 2018

Logistic regression: Introduction

Logistic regression: Introduction



Logistic regression could be a method of general linear regression (GLR) of binary or dichotomous 
dependent variable Y (Y=1,0) on independent variables X1,X2,…Xp. So expected value of Y given X is

E(Y|X) = 1xPr(Y=1|X) + 0xPr(Y=0|X) = Pr(Y=1|X)
A linear equation would be like
E(Y|X) = β01X1+…βpXp
Therefore, consider if there is a linear relationship between Pr(Y=1|X) and β01X1+…βpXp
If so the value of dependent variable should be unbounded, but Pr(Y=1|X) ranges from 0 to 1. 
So there is sort of transformation known as logistic transformation. Logistic regression measures 
the relationship between binary dependent variable and independent variables by establishing 
probabilities using logit function logit(P).
logit(Pr) = β01X1+…βpXp

  1. logistic regression on one dummy variable and 
    one continuous variable
There are binary variable major smoking-caused disease (mscd=1, 0) as dependent variable, binary 
predictor self-reported smoking (eversmk=1,0) and continuous variable age as a covariate predictor.
 Then consider the logistic regression model:
logit P(mscd|eversmk, lastage) =
β0 +β1×eversmk + β2×lastage+β3×eversmk×lastage

Here is R code:
> lr<-glm(mscd~eversmk*lastage, data=data1, family=binomial(link='logit'))
> (ce<-summary(lr)$coefficients)
                    Estimate Std. Error    z value Pr(>|z|)
(Intercept)      -7.031639241 0.315431065   -22.2921583 4.402490e-110
eversmk           0.885018245 0.386372041   2.2905856 2.198739e-02
lastage           0.069656351 0.004350929   16.0095333 1.096290e-57
eversmk:lastage -0.001384426  0.005483971 -0.2524495 8.006937e-01
> #odds or odds ratio
> exp(ce[,1])
   (Intercept)         eversmk lastage eversmk:lastage
  0.0008834824    2.4230285981 1.0721396778    0.9986165323
> #95%CI of odds or odds ratio
> lw<-exp(ce[,1]-1.96*ce[,2])
> up<-exp(ce[,1]+1.96*ce[,2])
> paste(round(lw,5), round(up,5), sep='~')
[1] "0.00048~0.00164" "1.13625~5.16708" "1.06304~1.08132" "0.98794~1.00941"

The coefficient β0 is denoted by the function:
log-odds(mscd=1|eversmk=0, lastage=0)= β0
It is the log odds of non-smoker at age of 0 suffer from disease. The probability of non-smoker at age 
of 0 suffer from disease is 0.0008834824 times lower than the probability without disease.
The coefficient β1 is denoted as:
log-OR(mscd=1|eversmk=1, lastage=0)= log-odds(mscd=1|eversmk=1, lastage=0) -
 log-odds(mscd=1|eversmk=0, lastage=0)= (β0 + β1) -β01
It is the log odds ratio of disease compared smoker and non-smoker at age of 0. The odds of disease 
of smoker increases 2.43 times with 95%CI 1.14-5.17 than the odds of disease of non-smoker at age of 0.
The coefficient β2 is denoted as
log-OR(mscd=1|eversmk=0,lastage=age) = log-odds(mscd=1|eversmk=0, lastage=age+1) -
 log-odds(mscd=1| eversmk=0, lastage=age) = (β0 + β2×(age+1)) - (β02×age) = β2
It is the log odds ratio comparing to every one unit change in age. Among non smokers, 
the odds of disease compared to non-disease increase 7% with 95%CI 6-8% as a person grows older
 by one year.
The coefficient β3 is defined by the equations below:
log odds(mscd=1|eversmk=1, lastage=age) = β0 +β1 + β2×age+β3×age
log odds(mscd=1|eversmk=1, lastage=age+1) = β0 +β1 + β2×(age+1) + β3×(age+1)
Δ1= log odds(mscd=1|eversmk=1, lastage=age+1) - log odds(mscd=1|eversmk=1, lastage=age)= 
 [β0 +β1 + β2×(age+1) + β3×(age+1)] - (β0 +β1 + β2×age+β3×age) = β2 + β3
log odds(mscd=1|eversmk=0, lastage=age) = β0 + β2×age
log odds(mscd=1|eversmk=0, lastage=age+1) = β0 + β2×(age+1)
Δ2= log odds(mscd=1|eversmk=0, lastage=age+1) - log odds(mscd=1|eversmk=0, lastage=age)=
  β0 + β2×(age+1) - (β0 + β2×age) =β2
So Δ122 + β3 - β2 = β3
So β3 is the difference of log odds ratio of disease corresponding to a change in age by 1
 among smokers and the log odds ratio of disease corresponding to a change in age by 1 
among non-smokers. With the increase of age by 1, the odds ratio of disease among smokers 
compared to the odds ratio of disease among non-smokers changes 1.00 times (95%CI 0.99-1.01).


  1. Logistic regression on two dummy variables
Consider logistic regression of medical expenditure(bigexp=1,0) on two binary variables
 major smoking-caused disease (mscd=1,0) and gender(male=1,0).
logit P(bigexp|mscd, male) = β0 +β1×mscd + β2×male+β3×mscd×male


R code:
> lr<-glm(bigexp~mscd*male, data=data1, family=binomial(link='logit'))
> (ce<-summary(lr)$coefficients)
             Estimate Std. Error    z value Pr(>|z|)
(Intercept) -0.5996118 0.02708850 -22.135288 1.445870e-108
mscd         1.6901596 0.09371319  18.035451 1.026544e-72
male        -0.3409380 0.04309174  -7.911911 2.534679e-15
mscd:male    0.3308752 0.13388727   2.471297 1.346241e-02
> #odds or odds ratio
> exp(ce[,1])
(Intercept)        mscd male mscd:male
 0.5490247   5.4203455 0.7111030   1.3921860
> #95%CI of odds or odds ratio
> lw<-exp(ce[,1]-1.96*ce[,2])
> up<-exp(ce[,1]+1.96*ce[,2])
> paste(round(lw,5), round(up,5), sep='~')
[1] "0.52064~0.57896" "4.51083~6.51324" "0.65351~0.77377" "1.07085~1.80994"

The coefficient of β0 is the log odds(bigexp=1|mscd=0, male=0), which is log odds of female 
and no disease having big expenditure. The odds of big expenditure compared to lower expenditure 
is only 0.55 (95%CI 0.52-0.58) among female and non disease.
The coefficient of β1 is the log odds ratio (bigexp=1|mscd=1, male=0), which is log odds ratio of 
big expenditure comparing disease and none disease among females. The odds of big expenditure 
with disease are 5.42 times (95%CI 4.51-6.51) than that without disease among females.
The coefficient of β2 is the log odds ratio (bigexp=1|mscd=0, male=1), which is log odds ratio of 
big expenditure comparing men and women among no disease persons. The odds of big expenditure 
of men are 0.71 times (95%CI 0.65-0.77) lower than the odds of big expenditure of women among 
none disease persons.
The coefficient of β3 is formula showed below:
logit P(bigexp|mscd, male) = β0 +β1×mscd + (β23×mscd)×male
So (β23×mscd) is the log odds ratio of big expenditure for men and women. β3 is the difference 
of the log odds ratio of big expenditure comparing men and women in disease persons 
and the log odds ratio of big expenditure comparing men and women in non disease persons.
 In terms of the odds ratio of big expenditure comparing men and women, their difference in disease
 and non-disease persons is 1.39 (95%CI 1.07-1.81).

  1. logistic regression on two continuous predictors
Consider logistic regression of major smoking-caused disease (mscd=1,0) on two continuous variables
 last self-reported age (lastage) and total expenditure (totalexp).
logit P(mscd|lastage,totalexp) = β0 +β1×lastage + β2×totalexp + β3×lastage×totalexp


R code:
> lr<-glm(mscd~lastage*totalexp, data=data1, family=binomial(link='logit'))
> (ce<-summary(lr)$coefficients)
                     Estimate Std. Error    z value Pr(>|z|)
(Intercept)        -6.314547e+00 1.960597e-01 -32.207272 1.396089e-227
lastage             6.263994e-02 2.860232e-03  21.900300 2.580737e-106
totalexp            1.403753e-04 1.979799e-05   7.090382 1.337425e-12
lastage:totalexp -1.223882e-06 2.805894e-07  -4.361825 1.289823e-05
> #odds or odds ratio
> exp(ce[,1])
    (Intercept)         totalexp lastage totalexp:lastage
    0.001809785      1.000140385 1.064643435      0.999998776
> #95%CI of odds or odds ratio
> lw<-exp(ce[,1]-1.96*ce[,2])
> up<-exp(ce[,1]+1.96*ce[,2])
> paste(round(lw,5), round(up,5), sep='~')
[1] "0.00123~0.00266" "1.0001~1.00018"  "1.05869~1.07063" "1~1"
The coefficient of β0 is the log odds (mscd=1|lastage=0, totalexp=0), which is log odds of age of 0 
and total expenditure of 0. The odds of disease compared to none disease is
 0.0018 (95%CI 0.0012-0.0023) times when age and total expenditure are all zero.
The coefficient of β1 is the log odds ratio (mscd=1|lastage, totalexp=0), 
which is log odds ratio of disease with the increase of age by 1 year. 
The odds of disease increase 1.00 (95%CI 1.00-1.00) with increase of age by 1 year 
after considering the effects of total expenditure.
The coefficient of β2 is the log odds ratio (mscd=1|lastage=0, totalexp), 
which is log odds ratio of disease with the increase of total expenditure by 1 unite. 
The odds of disease increase 1.06 (95%CI 1.06-1.07) with increase of expenditure by $1,000
 after considering the effects of reported age.
Regarding the coefficient β3, there is formula showed below:
logit P(msce|lastage, totalexp) = β0 +β1×lastage + (β23×lastage)×totalexp
So (β23×lastage) is the log odds ratio of disease with the increase of total expenditure 
by 1 unit. β3 is the difference of the log odds ratio of disease with the increase of total expenditure
 by 1 unit and the increase of age by 1 year. In terms of the odds ratio of disease 
with the increase of total expenditure by $1,000, their difference in disease and non-disease persons
 with the increase of age by 1 year is 1.00.

No comments:

Post a Comment