Friday, February 10, 2012

A textbook logistic regresion example in R.

Copy the following data into a file test.dat. This particular textbook example is from the Berenson,Levinson and Krebhiel, p.605. We are interested in comparing if the results obtained in this reference (obtained by using Minitab) will match is we use the R statistical software. Here is the data.

obs purchase spending additional
1 0 32.1007 0
2 1 34.3706 1
3 0 4.8749 0
4 0 8.1263 0
5 0 12.9783 0
6 0 16.0471 0
7 0 20.6648 0
8 1 42.0483 1
9 0 42.2264 1
10 1 37.99 1
11 1 53.6063 1
12 0 38.7936 0
13 0 27.9999 0
14 1 42.1694 0
15 1 56.1997 1
16 0 23.7609 0
17 0 35.0338 1
18 1 49.7388 1
19 0 24.7372 0
20 1 26.1315 1
21 0 31.322 1
22 1 40.1967 1
23 0 35.3899 0
24 0 30.228 0
25 1 50.3778 0
26 0 52.7713 0
27 0 27.3728 0
28 1 59.2146 1
29 1 50.0686 1
30 1 35.4234 1

Then fire R from the command line in the directory where you saved the file and type the following command.


mydat <- read.table("test.dat", header=T)


Always make it a habit to verify your data. If it has a few rows, you can dump it by typing mydata and the contents will be displayed in the screen.

To see the column names, type names(mydat). It will print

[1] "obs"        "purchase"   "spending"   "additional"

For the data, obs is just an index, "purchase" is purchasing behavior, "spending" is annual credit card spending, and "additional" is possesion of additional credit cards.

R has a convenient function to print summary statistics. Try summary(mydat) and the following information will be printed into terminal:

>> summary(mydat)
> summary(mydat)
      obs           purchase         spending        additional    
 Min.   : 1.00   Min.   :0.0000   Min.   : 4.875   Min.   :0.0000  
 1st Qu.: 8.25   1st Qu.:0.0000   1st Qu.:26.442   1st Qu.:0.0000  
 Median :15.50   Median :0.0000   Median :35.212   Median :0.0000  
 Mean   :15.50   Mean   :0.4333   Mean   :34.732   Mean   :0.4667  
 3rd Qu.:22.75   3rd Qu.:1.0000   3rd Qu.:42.212   3rd Qu.:1.0000  
 Max.   :30.00   Max.   :1.0000   Max.   :59.215   Max.   :1.0000  

We see that the minimum value, first quartile, median, arithmetic mean, third quartile and maximum value are shown for each variable column.

Note that purchase and additional are binary variable(only two values are allowed. whereas spending is a contiunous variable.

Assume that the variable purchase is determined by the variables spending and additional.

We try to fit the model purchase ~ spending + additional.
We just CAn NOT Use ordinary regression function which assumes that the dependent variable is continuous. But to illustrate, we go ahead and get the following results:

> attach(mydat)
> model <- purchase ~ spending + additional
> lmfit <- lm(model)
> summary(lmfit)
> summary(lmfit)

Call:
lm(formula = model)

Residuals:
Min       1Q   Median       3Q      Max 
-0.78345 -0.14671  0.03262  0.20923  0.68850 

Coefficients:
Estimate Std. Error t value Pr(>|t|)   
(Intercept) -0.244500   0.177063  -1.381  0.17864   
spending     0.013185   0.005473   2.409  0.02309 * 
additional   0.471204   0.151516   3.110  0.00438 **
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 0.3538 on 27 degrees of freedom
Multiple R-squared: 0.5411, Adjusted R-squared: 0.5071 
F-statistic: 15.92 on 2 and 27 DF,  p-value: 2.711e-05 


We note that only spending and additional are significant at the 5 percent level. To see why the above linear model may not be appropriate we compare the original response variable and the fitted estimated values.

> M <- cbind (purchase, lmfit$fitted)
> M
purchase            
1         0  0.17874336
2         1  0.67987541
3         0 -0.18022484
4         0 -0.13735561
5         0 -0.07338270
6         0 -0.03292102
7         0  0.02796269
8         1  0.78110477
9         0  0.78345299
10        1  0.72759667
11        1  0.93349531
12        0  0.26698827
13        0  0.12467491
14        1  0.31149770
15        1  0.96768891
16        0  0.06878431
17        0  0.68861961
18        1  0.88250289
19        0  0.08165669
20        1  0.57124408
21        0  0.63968006
22        1  0.75669169
23        0  0.22211098
24        0  0.15405208
25        1  0.41972426
26        0  0.45128221
27        0  0.11640669
28        1  1.00743993
29        1  0.88685125
30        1  0.69375643
> 

The maojor complaint why linear regression is not much used for analyzing models with 0-1 response variables? There are negative values and a value greater than 1,in the fitted values! Now let us try the built-in logistic fitting function in R.

> logisfit <- glm(model, family=binomial, data=mydat) > summary(logisfit)

Call:
glm(formula = model, family = binomial, data = mydat)

Deviance Residuals:
Min 1Q Median 3Q Max
-1.94321 -0.34398 -0.09305 0.52174 1.64704

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -6.94043 2.94746 -2.355 0.0185 *
spending 0.13948 0.06807 2.049 0.0405 *
additional 2.77448 1.19276 2.326 0.0200 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 41.054 on 29 degrees of freedom
Residual deviance: 20.076 on 27 degrees of freedom
AIC: 26.076

Number of Fisher Scoring iterations: 6




Now in the logistic fit fitting, we see now that all computed coefficients are significant at the 5 percent level. We are confident that the results of R are accurate since the values of the coefficients matches well with the published reference test.

But some results displayed by the Minitab program for this example are not by default displayed by R, like Odds ratio for coefficients (except for the intercept) and confidence intervals for the coefficients.

log-likelhiood, and various goodness of fit tests like Peason, Devisnce and Hosmer-Lemeshow.



Let us now combine and see side by side the computed responses in ordinary least squares and the logistic regression

> M <- cbind(M, logisfit$fitted)
> M
purchase                        
1         0  0.17874336 0.078497943
2         1  0.67987541 0.652071987
3         0 -0.18022484 0.001906712
4         0 -0.13735561 0.002997568
5         0 -0.07338270 0.005880704
6         0 -0.03292102 0.008994244
7         0  0.02796269 0.016989149
8         1  0.78110477 0.845412887
9         0  0.78345299 0.848631689
10        1  0.72759667 0.756392820
11        1  0.93349531 0.964811415
12        0  0.26698827 0.178083337
13        0  0.12467491 0.045872822
14        1  0.31149770 0.257592205
15        1  0.96768891 0.975227725
16        0  0.06878431 0.025927231
17        0  0.68861961 0.672753313
18        1  0.88250289 0.941130613
19        0  0.08165669 0.029597672
20        1  0.57124408 0.372605728
21        0  0.63968006 0.550559546
22        1  0.75669169 0.808578627
23        0  0.22211098 0.118768370
24        0  0.15405208 0.061563718
25        1  0.41972426 0.521589224
26        0  0.45128221 0.603546179
27        0  0.11640669 0.042192923
28        1  1.00743993 0.983592497
29        1  0.88685125 0.943628146
30        1  0.69375643 0.684603007
> 

We see that there are no negative values nor value(s) greater than 1 in the third column! In fact the third column can be interpreted as probabilities of the odds-ratio defined as
$$odds-ratio=\frac{p}{q}=\frac{p}{1-p}$$ where p is probability of success of an event. To compute the p from the third column, we use the formula

$$p=\frac{odds-ratio}{1+odds-ratio}$$.

In a future article, we shall try to write a Python program to perform logistic regression.

Further reading:


Logits: Logistic Regression Using R. An earlier post using Gujarati's example.


R-bloggers.com: logistic regression, a tutorial from r-bloggers.

Wikipedia entry, for logistic regression.






No comments:

Post a Comment