## Friday, February 10, 2012

### A textbook logistic regresion example in R.

Copy the following data into a file test.dat. This particular textbook example is from the Berenson,Levinson and Krebhiel, p.605. We are interested in comparing if the results obtained in this reference (obtained by using Minitab) will match is we use the R statistical software. Here is the data.

obs purchase spending additional
1 0 32.1007 0
2 1 34.3706 1
3 0 4.8749 0
4 0 8.1263 0
5 0 12.9783 0
6 0 16.0471 0
7 0 20.6648 0
8 1 42.0483 1
9 0 42.2264 1
10 1 37.99 1
11 1 53.6063 1
12 0 38.7936 0
13 0 27.9999 0
14 1 42.1694 0
15 1 56.1997 1
16 0 23.7609 0
17 0 35.0338 1
18 1 49.7388 1
19 0 24.7372 0
20 1 26.1315 1
21 0 31.322 1
22 1 40.1967 1
23 0 35.3899 0
24 0 30.228 0
25 1 50.3778 0
26 0 52.7713 0
27 0 27.3728 0
28 1 59.2146 1
29 1 50.0686 1
30 1 35.4234 1


Then fire R from the command line in the directory where you saved the file and type the following command.

Always make it a habit to verify your data. If it has a few rows, you can dump it by typing mydata and the contents will be displayed in the screen.

To see the column names, type names(mydat). It will print

[1] "obs"        "purchase"   "spending"   "additional"


For the data, obs is just an index, "purchase" is purchasing behavior, "spending" is annual credit card spending, and "additional" is possesion of additional credit cards.

R has a convenient function to print summary statistics. Try summary(mydat) and the following information will be printed into terminal:

>> summary(mydat)
> summary(mydat)
Min.   : 1.00   Min.   :0.0000   Min.   : 4.875   Min.   :0.0000
1st Qu.: 8.25   1st Qu.:0.0000   1st Qu.:26.442   1st Qu.:0.0000
Median :15.50   Median :0.0000   Median :35.212   Median :0.0000
Mean   :15.50   Mean   :0.4333   Mean   :34.732   Mean   :0.4667
3rd Qu.:22.75   3rd Qu.:1.0000   3rd Qu.:42.212   3rd Qu.:1.0000
Max.   :30.00   Max.   :1.0000   Max.   :59.215   Max.   :1.0000


We see that the minimum value, first quartile, median, arithmetic mean, third quartile and maximum value are shown for each variable column.

Note that purchase and additional are binary variable(only two values are allowed. whereas spending is a contiunous variable.

Assume that the variable purchase is determined by the variables spending and additional.

We try to fit the model purchase ~ spending + additional.
We just CAn NOT Use ordinary regression function which assumes that the dependent variable is continuous. But to illustrate, we go ahead and get the following results:

> attach(mydat)
> model <- purchase ~ spending + additional
> lmfit <- lm(model)
> summary(lmfit)
> summary(lmfit)

Call:
lm(formula = model)

Residuals:
Min       1Q   Median       3Q      Max
-0.78345 -0.14671  0.03262  0.20923  0.68850

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.244500   0.177063  -1.381  0.17864
spending     0.013185   0.005473   2.409  0.02309 *
additional   0.471204   0.151516   3.110  0.00438 **
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.3538 on 27 degrees of freedom
Multiple R-squared: 0.5411, Adjusted R-squared: 0.5071
F-statistic: 15.92 on 2 and 27 DF,  p-value: 2.711e-05



We note that only spending and additional are significant at the 5 percent level. To see why the above linear model may not be appropriate we compare the original response variable and the fitted estimated values.

> M <- cbind (purchase, lmfit$fitted) > M purchase 1 0 0.17874336 2 1 0.67987541 3 0 -0.18022484 4 0 -0.13735561 5 0 -0.07338270 6 0 -0.03292102 7 0 0.02796269 8 1 0.78110477 9 0 0.78345299 10 1 0.72759667 11 1 0.93349531 12 0 0.26698827 13 0 0.12467491 14 1 0.31149770 15 1 0.96768891 16 0 0.06878431 17 0 0.68861961 18 1 0.88250289 19 0 0.08165669 20 1 0.57124408 21 0 0.63968006 22 1 0.75669169 23 0 0.22211098 24 0 0.15405208 25 1 0.41972426 26 0 0.45128221 27 0 0.11640669 28 1 1.00743993 29 1 0.88685125 30 1 0.69375643 >  The maojor complaint why linear regression is not much used for analyzing models with 0-1 response variables? There are negative values and a value greater than 1,in the fitted values! Now let us try the built-in logistic fitting function in R. > logisfit <- glm(model, family=binomial, data=mydat) > summary(logisfit) Call: glm(formula = model, family = binomial, data = mydat) Deviance Residuals: Min 1Q Median 3Q Max -1.94321 -0.34398 -0.09305 0.52174 1.64704 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -6.94043 2.94746 -2.355 0.0185 * spending 0.13948 0.06807 2.049 0.0405 * additional 2.77448 1.19276 2.326 0.0200 * --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 41.054 on 29 degrees of freedom Residual deviance: 20.076 on 27 degrees of freedom AIC: 26.076 Number of Fisher Scoring iterations: 6 Now in the logistic fit fitting, we see now that all computed coefficients are significant at the 5 percent level. We are confident that the results of R are accurate since the values of the coefficients matches well with the published reference test. But some results displayed by the Minitab program for this example are not by default displayed by R, like Odds ratio for coefficients (except for the intercept) and confidence intervals for the coefficients. log-likelhiood, and various goodness of fit tests like Peason, Devisnce and Hosmer-Lemeshow. Let us now combine and see side by side the computed responses in ordinary least squares and the logistic regression > M <- cbind(M, logisfit$fitted)
> M
purchase
1         0  0.17874336 0.078497943
2         1  0.67987541 0.652071987
3         0 -0.18022484 0.001906712
4         0 -0.13735561 0.002997568
5         0 -0.07338270 0.005880704
6         0 -0.03292102 0.008994244
7         0  0.02796269 0.016989149
8         1  0.78110477 0.845412887
9         0  0.78345299 0.848631689
10        1  0.72759667 0.756392820
11        1  0.93349531 0.964811415
12        0  0.26698827 0.178083337
13        0  0.12467491 0.045872822
14        1  0.31149770 0.257592205
15        1  0.96768891 0.975227725
16        0  0.06878431 0.025927231
17        0  0.68861961 0.672753313
18        1  0.88250289 0.941130613
19        0  0.08165669 0.029597672
20        1  0.57124408 0.372605728
21        0  0.63968006 0.550559546
22        1  0.75669169 0.808578627
23        0  0.22211098 0.118768370
24        0  0.15405208 0.061563718
25        1  0.41972426 0.521589224
26        0  0.45128221 0.603546179
27        0  0.11640669 0.042192923
28        1  1.00743993 0.983592497
29        1  0.88685125 0.943628146
30        1  0.69375643 0.684603007
>


We see that there are no negative values nor value(s) greater than 1 in the third column! In fact the third column can be interpreted as probabilities of the odds-ratio defined as
$$odds-ratio=\frac{p}{q}=\frac{p}{1-p}$$ where p is probability of success of an event. To compute the p from the third column, we use the formula

$$p=\frac{odds-ratio}{1+odds-ratio}$$.

In a future article, we shall try to write a Python program to perform logistic regression.

Logits: Logistic Regression Using R. An earlier post using Gujarati's example.

R-bloggers.com: logistic regression, a tutorial from r-bloggers.

Wikipedia entry, for logistic regression.