obs purchase spending additional 1 0 32.1007 0 2 1 34.3706 1 3 0 4.8749 0 4 0 8.1263 0 5 0 12.9783 0 6 0 16.0471 0 7 0 20.6648 0 8 1 42.0483 1 9 0 42.2264 1 10 1 37.99 1 11 1 53.6063 1 12 0 38.7936 0 13 0 27.9999 0 14 1 42.1694 0 15 1 56.1997 1 16 0 23.7609 0 17 0 35.0338 1 18 1 49.7388 1 19 0 24.7372 0 20 1 26.1315 1 21 0 31.322 1 22 1 40.1967 1 23 0 35.3899 0 24 0 30.228 0 25 1 50.3778 0 26 0 52.7713 0 27 0 27.3728 0 28 1 59.2146 1 29 1 50.0686 1 30 1 35.4234 1
Then fire R from the command line in the directory where you saved the file and type the following command.
mydat <- read.table("test.dat", header=T)
Always make it a habit to verify your data. If it has a few rows, you can dump it by typing mydata and the contents will be displayed in the screen.
To see the column names, type names(mydat). It will print
[1] "obs" "purchase" "spending" "additional"
For the data, obs is just an index, "purchase" is purchasing behavior, "spending" is annual credit card spending, and "additional" is possesion of additional credit cards.
R has a convenient function to print summary statistics. Try summary(mydat) and the following information will be printed into terminal:
>> summary(mydat) > summary(mydat) obs purchase spending additional Min. : 1.00 Min. :0.0000 Min. : 4.875 Min. :0.0000 1st Qu.: 8.25 1st Qu.:0.0000 1st Qu.:26.442 1st Qu.:0.0000 Median :15.50 Median :0.0000 Median :35.212 Median :0.0000 Mean :15.50 Mean :0.4333 Mean :34.732 Mean :0.4667 3rd Qu.:22.75 3rd Qu.:1.0000 3rd Qu.:42.212 3rd Qu.:1.0000 Max. :30.00 Max. :1.0000 Max. :59.215 Max. :1.0000
We see that the minimum value, first quartile, median, arithmetic mean, third quartile and maximum value are shown for each variable column.
Note that purchase and additional are binary variable(only two values are allowed. whereas spending is a contiunous variable.
Assume that the variable purchase is determined by the variables spending and additional.
We try to fit the model purchase ~ spending + additional.
We just CAn NOT Use ordinary regression function which assumes that the dependent variable is continuous. But to illustrate, we go ahead and get the following results:
> attach(mydat) > model <- purchase ~ spending + additional > lmfit <- lm(model) > summary(lmfit) > summary(lmfit) Call: lm(formula = model) Residuals: Min 1Q Median 3Q Max -0.78345 -0.14671 0.03262 0.20923 0.68850 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -0.244500 0.177063 -1.381 0.17864 spending 0.013185 0.005473 2.409 0.02309 * additional 0.471204 0.151516 3.110 0.00438 ** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.3538 on 27 degrees of freedom Multiple R-squared: 0.5411, Adjusted R-squared: 0.5071 F-statistic: 15.92 on 2 and 27 DF, p-value: 2.711e-05
We note that only spending and additional are significant at the 5 percent level. To see why the above linear model may not be appropriate we compare the original response variable and the fitted estimated values.
> M <- cbind (purchase, lmfit$fitted) > M purchase 1 0 0.17874336 2 1 0.67987541 3 0 -0.18022484 4 0 -0.13735561 5 0 -0.07338270 6 0 -0.03292102 7 0 0.02796269 8 1 0.78110477 9 0 0.78345299 10 1 0.72759667 11 1 0.93349531 12 0 0.26698827 13 0 0.12467491 14 1 0.31149770 15 1 0.96768891 16 0 0.06878431 17 0 0.68861961 18 1 0.88250289 19 0 0.08165669 20 1 0.57124408 21 0 0.63968006 22 1 0.75669169 23 0 0.22211098 24 0 0.15405208 25 1 0.41972426 26 0 0.45128221 27 0 0.11640669 28 1 1.00743993 29 1 0.88685125 30 1 0.69375643 >
The maojor complaint why linear regression is not much used for analyzing models with 0-1 response variables? There are negative values and a value greater than 1,in the fitted values! Now let us try the built-in logistic fitting function in R.
> logisfit <- glm(model, family=binomial, data=mydat) > summary(logisfit)
Call:
glm(formula = model, family = binomial, data = mydat)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.94321 -0.34398 -0.09305 0.52174 1.64704
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -6.94043 2.94746 -2.355 0.0185 *
spending 0.13948 0.06807 2.049 0.0405 *
additional 2.77448 1.19276 2.326 0.0200 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 41.054 on 29 degrees of freedom
Residual deviance: 20.076 on 27 degrees of freedom
AIC: 26.076
Number of Fisher Scoring iterations: 6
Now in the logistic fit fitting, we see now that all computed coefficients are significant at the 5 percent level. We are confident that the results of R are accurate since the values of the coefficients matches well with the published reference test.
But some results displayed by the Minitab program for this example are not by default displayed by R, like Odds ratio for coefficients (except for the intercept) and confidence intervals for the coefficients.
log-likelhiood, and various goodness of fit tests like Peason, Devisnce and Hosmer-Lemeshow.
Let us now combine and see side by side the computed responses in ordinary least squares and the logistic regression
> M <- cbind(M, logisfit$fitted) > M purchase 1 0 0.17874336 0.078497943 2 1 0.67987541 0.652071987 3 0 -0.18022484 0.001906712 4 0 -0.13735561 0.002997568 5 0 -0.07338270 0.005880704 6 0 -0.03292102 0.008994244 7 0 0.02796269 0.016989149 8 1 0.78110477 0.845412887 9 0 0.78345299 0.848631689 10 1 0.72759667 0.756392820 11 1 0.93349531 0.964811415 12 0 0.26698827 0.178083337 13 0 0.12467491 0.045872822 14 1 0.31149770 0.257592205 15 1 0.96768891 0.975227725 16 0 0.06878431 0.025927231 17 0 0.68861961 0.672753313 18 1 0.88250289 0.941130613 19 0 0.08165669 0.029597672 20 1 0.57124408 0.372605728 21 0 0.63968006 0.550559546 22 1 0.75669169 0.808578627 23 0 0.22211098 0.118768370 24 0 0.15405208 0.061563718 25 1 0.41972426 0.521589224 26 0 0.45128221 0.603546179 27 0 0.11640669 0.042192923 28 1 1.00743993 0.983592497 29 1 0.88685125 0.943628146 30 1 0.69375643 0.684603007 >
We see that there are no negative values nor value(s) greater than 1 in the third column! In fact the third column can be interpreted as probabilities of the odds-ratio defined as
$$odds-ratio=\frac{p}{q}=\frac{p}{1-p}$$ where p is probability of success of an event. To compute the p from the third column, we use the formula
$$p=\frac{odds-ratio}{1+odds-ratio}$$.
In a future article, we shall try to write a Python program to perform logistic regression.
Further reading:
Logits: Logistic Regression Using R. An earlier post using Gujarati's example.
R-bloggers.com: logistic regression, a tutorial from r-bloggers.
Wikipedia entry, for logistic regression.
No comments:
Post a Comment