Logistic regression: Introduction
Logistic regression could be a method of general linear regression (GLR) of binary or dichotomous
dependent variable Y (Y=1,0) on independent variables X1,X2,…Xp. So expected value of Y given X is
dependent variable Y (Y=1,0) on independent variables X1,X2,…Xp. So expected value of Y given X is
E(Y|X) = 1xPr(Y=1|X) + 0xPr(Y=0|X) = Pr(Y=1|X)
A linear equation would be like
E(Y|X) = β0+β1X1+…βpXp
Therefore, consider if there is a linear relationship between Pr(Y=1|X) and β0+β1X1+…βpXp.
If so the value of dependent variable should be unbounded, but Pr(Y=1|X) ranges from 0 to 1.
So there is sort of transformation known as logistic transformation. Logistic regression measures
the relationship between binary dependent variable and independent variables by establishing
probabilities using logit function logit(P).
If so the value of dependent variable should be unbounded, but Pr(Y=1|X) ranges from 0 to 1.
So there is sort of transformation known as logistic transformation. Logistic regression measures
the relationship between binary dependent variable and independent variables by establishing
probabilities using logit function logit(P).
logit(Pr) = β0+β1X1+…βpXp
- logistic regression on one dummy variable and
one continuous variable
There are binary variable major smoking-caused disease (mscd=1, 0) as dependent variable, binary
predictor self-reported smoking (eversmk=1,0) and continuous variable age as a covariate predictor.
Then consider the logistic regression model:
predictor self-reported smoking (eversmk=1,0) and continuous variable age as a covariate predictor.
Then consider the logistic regression model:
logit P(mscd|eversmk, lastage) =
β0 +β1×eversmk + β2×lastage+β3×eversmk×lastage
Here is R code:
> lr<-glm(mscd~eversmk*lastage, data=data1, family=binomial(link='logit'))
> (ce<-summary(lr)$coefficients)
Estimate Std. Error z value Pr(>|z|)
(Intercept) -7.031639241 0.315431065 -22.2921583 4.402490e-110
eversmk 0.885018245 0.386372041 2.2905856 2.198739e-02
lastage 0.069656351 0.004350929 16.0095333 1.096290e-57
eversmk:lastage -0.001384426 0.005483971 -0.2524495 8.006937e-01
> #odds or odds ratio
> exp(ce[,1])
(Intercept) eversmk lastage eversmk:lastage
0.0008834824 2.4230285981 1.0721396778 0.9986165323
> #95%CI of odds or odds ratio
> lw<-exp(ce[,1]-1.96*ce[,2])
> up<-exp(ce[,1]+1.96*ce[,2])
> paste(round(lw,5), round(up,5), sep='~')
[1] "0.00048~0.00164" "1.13625~5.16708" "1.06304~1.08132" "0.98794~1.00941"
The coefficient β0 is denoted by the function:
log-odds(mscd=1|eversmk=0, lastage=0)= β0
It is the log odds of non-smoker at age of 0 suffer from disease. The probability of non-smoker at age
of 0 suffer from disease is 0.0008834824 times lower than the probability without disease.
of 0 suffer from disease is 0.0008834824 times lower than the probability without disease.
The coefficient β1 is denoted as:
log-OR(mscd=1|eversmk=1, lastage=0)= log-odds(mscd=1|eversmk=1, lastage=0) -
log-odds(mscd=1|eversmk=0, lastage=0)= (β0 + β1) -β0 =β1
log-odds(mscd=1|eversmk=0, lastage=0)= (β0 + β1) -β0 =β1
It is the log odds ratio of disease compared smoker and non-smoker at age of 0. The odds of disease
of smoker increases 2.43 times with 95%CI 1.14-5.17 than the odds of disease of non-smoker at age of 0.
of smoker increases 2.43 times with 95%CI 1.14-5.17 than the odds of disease of non-smoker at age of 0.
The coefficient β2 is denoted as
log-OR(mscd=1|eversmk=0,lastage=age) = log-odds(mscd=1|eversmk=0, lastage=age+1) -
log-odds(mscd=1| eversmk=0, lastage=age) = (β0 + β2×(age+1)) - (β0 +β2×age) = β2
log-odds(mscd=1| eversmk=0, lastage=age) = (β0 + β2×(age+1)) - (β0 +β2×age) = β2
It is the log odds ratio comparing to every one unit change in age. Among non smokers,
the odds of disease compared to non-disease increase 7% with 95%CI 6-8% as a person grows older
by one year.
the odds of disease compared to non-disease increase 7% with 95%CI 6-8% as a person grows older
by one year.
The coefficient β3 is defined by the equations below:
log odds(mscd=1|eversmk=1, lastage=age) = β0 +β1 + β2×age+β3×age
log odds(mscd=1|eversmk=1, lastage=age+1) = β0 +β1 + β2×(age+1) + β3×(age+1)
Δ1= log odds(mscd=1|eversmk=1, lastage=age+1) - log odds(mscd=1|eversmk=1, lastage=age)=
[β0 +β1 + β2×(age+1) + β3×(age+1)] - (β0 +β1 + β2×age+β3×age) = β2 + β3
[β0 +β1 + β2×(age+1) + β3×(age+1)] - (β0 +β1 + β2×age+β3×age) = β2 + β3
log odds(mscd=1|eversmk=0, lastage=age) = β0 + β2×age
log odds(mscd=1|eversmk=0, lastage=age+1) = β0 + β2×(age+1)
Δ2= log odds(mscd=1|eversmk=0, lastage=age+1) - log odds(mscd=1|eversmk=0, lastage=age)=
β0 + β2×(age+1) - (β0 + β2×age) =β2
β0 + β2×(age+1) - (β0 + β2×age) =β2
So Δ1 -Δ2 =β2 + β3 - β2 = β3
So β3 is the difference of log odds ratio of disease corresponding to a change in age by 1
among smokers and the log odds ratio of disease corresponding to a change in age by 1
among non-smokers. With the increase of age by 1, the odds ratio of disease among smokers
compared to the odds ratio of disease among non-smokers changes 1.00 times (95%CI 0.99-1.01).
among smokers and the log odds ratio of disease corresponding to a change in age by 1
among non-smokers. With the increase of age by 1, the odds ratio of disease among smokers
compared to the odds ratio of disease among non-smokers changes 1.00 times (95%CI 0.99-1.01).
- Logistic regression on two dummy variables
Consider logistic regression of medical expenditure(bigexp=1,0) on two binary variables
major smoking-caused disease (mscd=1,0) and gender(male=1,0).
major smoking-caused disease (mscd=1,0) and gender(male=1,0).
logit P(bigexp|mscd, male) = β0 +β1×mscd + β2×male+β3×mscd×male
R code:
> lr<-glm(bigexp~mscd*male, data=data1, family=binomial(link='logit'))
> (ce<-summary(lr)$coefficients)
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.5996118 0.02708850 -22.135288 1.445870e-108
mscd 1.6901596 0.09371319 18.035451 1.026544e-72
male -0.3409380 0.04309174 -7.911911 2.534679e-15
mscd:male 0.3308752 0.13388727 2.471297 1.346241e-02
> #odds or odds ratio
> exp(ce[,1])
(Intercept) mscd male mscd:male
0.5490247 5.4203455 0.7111030 1.3921860
> #95%CI of odds or odds ratio
> lw<-exp(ce[,1]-1.96*ce[,2])
> up<-exp(ce[,1]+1.96*ce[,2])
> paste(round(lw,5), round(up,5), sep='~')
[1] "0.52064~0.57896" "4.51083~6.51324" "0.65351~0.77377" "1.07085~1.80994"
The coefficient of β0 is the log odds(bigexp=1|mscd=0, male=0), which is log odds of female
and no disease having big expenditure. The odds of big expenditure compared to lower expenditure
is only 0.55 (95%CI 0.52-0.58) among female and non disease.
and no disease having big expenditure. The odds of big expenditure compared to lower expenditure
is only 0.55 (95%CI 0.52-0.58) among female and non disease.
The coefficient of β1 is the log odds ratio (bigexp=1|mscd=1, male=0), which is log odds ratio of
big expenditure comparing disease and none disease among females. The odds of big expenditure
with disease are 5.42 times (95%CI 4.51-6.51) than that without disease among females.
big expenditure comparing disease and none disease among females. The odds of big expenditure
with disease are 5.42 times (95%CI 4.51-6.51) than that without disease among females.
The coefficient of β2 is the log odds ratio (bigexp=1|mscd=0, male=1), which is log odds ratio of
big expenditure comparing men and women among no disease persons. The odds of big expenditure
of men are 0.71 times (95%CI 0.65-0.77) lower than the odds of big expenditure of women among
none disease persons.
big expenditure comparing men and women among no disease persons. The odds of big expenditure
of men are 0.71 times (95%CI 0.65-0.77) lower than the odds of big expenditure of women among
none disease persons.
The coefficient of β3 is formula showed below:
logit P(bigexp|mscd, male) = β0 +β1×mscd + (β2+β3×mscd)×male
So (β2+β3×mscd) is the log odds ratio of big expenditure for men and women. β3 is the difference
of the log odds ratio of big expenditure comparing men and women in disease persons
and the log odds ratio of big expenditure comparing men and women in non disease persons.
In terms of the odds ratio of big expenditure comparing men and women, their difference in disease
and non-disease persons is 1.39 (95%CI 1.07-1.81).
of the log odds ratio of big expenditure comparing men and women in disease persons
and the log odds ratio of big expenditure comparing men and women in non disease persons.
In terms of the odds ratio of big expenditure comparing men and women, their difference in disease
and non-disease persons is 1.39 (95%CI 1.07-1.81).
- logistic regression on two continuous predictors
Consider logistic regression of major smoking-caused disease (mscd=1,0) on two continuous variables
last self-reported age (lastage) and total expenditure (totalexp).
last self-reported age (lastage) and total expenditure (totalexp).
logit P(mscd|lastage,totalexp) = β0 +β1×lastage + β2×totalexp + β3×lastage×totalexp
R code:
> lr<-glm(mscd~lastage*totalexp, data=data1, family=binomial(link='logit'))
> (ce<-summary(lr)$coefficients)
Estimate Std. Error z value Pr(>|z|)
(Intercept) -6.314547e+00 1.960597e-01 -32.207272 1.396089e-227
lastage 6.263994e-02 2.860232e-03 21.900300 2.580737e-106
totalexp 1.403753e-04 1.979799e-05 7.090382 1.337425e-12
lastage:totalexp -1.223882e-06 2.805894e-07 -4.361825 1.289823e-05
> #odds or odds ratio
> exp(ce[,1])
(Intercept) totalexp lastage totalexp:lastage
0.001809785 1.000140385 1.064643435 0.999998776
> #95%CI of odds or odds ratio
> lw<-exp(ce[,1]-1.96*ce[,2])
> up<-exp(ce[,1]+1.96*ce[,2])
> paste(round(lw,5), round(up,5), sep='~')
[1] "0.00123~0.00266" "1.0001~1.00018" "1.05869~1.07063" "1~1"
The coefficient of β0 is the log odds (mscd=1|lastage=0, totalexp=0), which is log odds of age of 0
and total expenditure of 0. The odds of disease compared to none disease is
0.0018 (95%CI 0.0012-0.0023) times when age and total expenditure are all zero.
and total expenditure of 0. The odds of disease compared to none disease is
0.0018 (95%CI 0.0012-0.0023) times when age and total expenditure are all zero.
The coefficient of β1 is the log odds ratio (mscd=1|lastage, totalexp=0),
which is log odds ratio of disease with the increase of age by 1 year.
The odds of disease increase 1.00 (95%CI 1.00-1.00) with increase of age by 1 year
after considering the effects of total expenditure.
which is log odds ratio of disease with the increase of age by 1 year.
The odds of disease increase 1.00 (95%CI 1.00-1.00) with increase of age by 1 year
after considering the effects of total expenditure.
The coefficient of β2 is the log odds ratio (mscd=1|lastage=0, totalexp),
which is log odds ratio of disease with the increase of total expenditure by 1 unite.
The odds of disease increase 1.06 (95%CI 1.06-1.07) with increase of expenditure by $1,000
after considering the effects of reported age.
which is log odds ratio of disease with the increase of total expenditure by 1 unite.
The odds of disease increase 1.06 (95%CI 1.06-1.07) with increase of expenditure by $1,000
after considering the effects of reported age.
Regarding the coefficient β3, there is formula showed below:
logit P(msce|lastage, totalexp) = β0 +β1×lastage + (β2+β3×lastage)×totalexp
So (β2+β3×lastage) is the log odds ratio of disease with the increase of total expenditure
by 1 unit. β3 is the difference of the log odds ratio of disease with the increase of total expenditure
by 1 unit and the increase of age by 1 year. In terms of the odds ratio of disease
with the increase of total expenditure by $1,000, their difference in disease and non-disease persons
with the increase of age by 1 year is 1.00.
by 1 unit. β3 is the difference of the log odds ratio of disease with the increase of total expenditure
by 1 unit and the increase of age by 1 year. In terms of the odds ratio of disease
with the increase of total expenditure by $1,000, their difference in disease and non-disease persons
with the increase of age by 1 year is 1.00.
No comments:
Post a Comment