Sunday, June 20, 2010

Python, Econometrics, MWD test(release 0.0.1 with possible bug on pvalues? 06.21): Choosing between Linear vs log-linear models

A practical problem arises when choosing between alternative model specifications such as the basic linear model:

$$Y = \alpha_1 +\alpha_2 X_{2t} +\alpha_3 X_{3t} + \log X_{4t} + u_t$$

versus

$$\log Y = \beta_1 +\beta_2 \log X_{2t} +\beta_3 \log X_{3t} +\beta_4 \log X_{4t} + u_t$$

The MWD test, named after McKinnon, White and Davidson allows us to use ordinary least squares to help choose the appropriate model to fit the data.

Our main reference is Gujarati, 4e, pp. 280-282. The outline of the method may be seen in our python code, though if you are a beginner to Python, it will take time to understand.
Our mwdtest expects only the matrix of regressors,X, the first column of which consists of all ones (a constant column) and the response variable Y. All values of the model matrix and Y must
be greater than 0! since the logarithmic function has for its domain, the positive real numbers.


"""
file     mwdtest.py
author   Dr. Ernesto P. Adorio
         U.P. Clarkfield
         ernesto . adorio @ gmail . com
description
         Performs the McKinnon, White and Davidson test for testing
         suitability  between competing linear vs. log-linear models.
version
         0.0.1  2010.06.21 initial                   
"""

from matlib    import *
from lm        import *
from dataframe import *
from math      import *


def mwdtest(X, Y, ztol):
    """
    Arguments.
      X - matrix of regressors including constant column of 1's.
      Y - output vector.
      ztol - comparison with zero tolerance.
    Output:
      fitted linear and loglinear objects.
    Warning:
      The input X matrix may be modified by this routine.
    """
    # Is first column of X all ones?
    allones = True
    for row in X:
        if (row[0] - 1) > ztol:
           allones = False
           raise ValueError, "mwd():expects first column of all ones"

    #linear model
    print "@@@ debug: linear fit model."
    mataugprint(X,Y)
    
    linearfit1  = ols(X,Y)
    Yhat = linearfit1.Yhat

    #loglinear model
    logX =[]
    for row in X:
        a = [1.0]
        a.extend([log(x) for x in row[1:]])
        logX.append(a)
    logY = [log(y) for y in Y]
    print "@@@ debug:log linear model"
    mataugprint(logX, logY)
  
    logfit1 = ols(logX, logY)
    logYhat = logfit1.Yhat
   

    Z1 = [log(yhat) - logyhat for yhat, logyhat in zip(Yhat, logYhat)]
    for x, z1 in zip(X, Z1):
        x.append(z1)
    linearfit2 = ols(X, Y)

    Z2 = [exp(logy) - y for logy,y in zip(logYhat,Yhat)] 
    for x, z2 in zip (logX, Z2):
        x.append(z2)
    logfit2 = ols(logX, logY)
    return linearfit2, logfit2


def Test(X,Y, ztol):
    lfit, logfit = mwdtest(X,Y, ztol)
    print "testing linear fit:"
    for i, (t, beta,pvalue) in enumerate(zip(lfit.tbetas, lfit.betas, lfit.pvalues)):
        print "beta(", i,")=", beta,"t=", t, "pvalue=", pvalue
    print "F=", lfit.F, "R^2 = ", lfit.R2
    print
    print "testing log-linear fit:"
    for i, (t, beta,pvalue) in enumerate(zip(logfit.tbetas, logfit.betas, logfit.pvalues)):
        print "beta(", i,")=", beta,"t=", t, "pvalue=", pvalue
    print "F=", logfit.F, "R^2 = ", logfit.R2
    print

 
if __name__ == "__main__":
    #Rose demand data from Gujarati.
    data = """
# "Y" "X2" "X3" "X4" "X5"
# "Y" = Quantity of Roses Sold, Dozens
# "X2" = Average Wholesale Price of Roses, $ Per Dozen
# "X3" = Average Wholesale Price of Carnations, $ Per Dozen
# "X4" = Average Weekly Family Disposable Income, $ Per Week
# "X5" = Trend Variable: 1, 2, 3, ...
Y     X2   X3   X4  X5
11484 2.26 3.49 158.11 1 
9348 2.54 2.85 173.36 2 
8429 3.07 4.06 165.26 3 
10079 2.91 3.64 172.92 4 
9240 2.73 3.21 178.46 5 
8862 2.77 3.66 198.62 6 
6216 3.59 3.76 186.28 7 
8253 3.23 3.49 188.98 8 
8038 2.60 3.13 180.49 9 
7476 2.89 3.20 183.33 10 
5911 3.77 3.65 181.87 11 
7950 3.64 3.60 185.00 12 
6134 2.82 2.94 184.00 13 
5868 2.96 3.12 188.20 14 
3160 4.24 3.58 175.67 15 
5872 3.69 3.53 188.00 16 
""" 
    D = dataframe(data, header=True)
    X = matSelectCols(D.data, [1,2])
    
    #prepend  a 1 column to X model matrix.
    for x in X:
        x.insert(0, 1.0)
    Y = [d[0] for d in D.data]
 
    # See input data
    print "Input matrix."
    mataugprint(X, Y)

    ztol = 1.0e-5
    Test(X,Y, ztol)   

Let us compare the results with Gujarati's output in pp.281-282.


python mwdtest.py
Input matrix.
1.0000 2.2600 3.4900 | 11484.0000
1.0000 2.5400 2.8500 | 9348.0000
1.0000 3.0700 4.0600 | 8429.0000
1.0000 2.9100 3.6400 | 10079.0000
1.0000 2.7300 3.2100 | 9240.0000
1.0000 2.7700 3.6600 | 8862.0000
1.0000 3.5900 3.7600 | 6216.0000
1.0000 3.2300 3.4900 | 8253.0000
1.0000 2.6000 3.1300 | 8038.0000
1.0000 2.8900 3.2000 | 7476.0000
1.0000 3.7700 3.6500 | 5911.0000
1.0000 3.6400 3.6000 | 7950.0000
1.0000 2.8200 2.9400 | 6134.0000
1.0000 2.9600 3.1200 | 5868.0000
1.0000 4.2400 3.5800 | 3160.0000
1.0000 3.6900 3.5300 | 5872.0000

@@@ debug: linear fit model.
1.0000 2.2600 3.4900 | 11484.0000
1.0000 2.5400 2.8500 | 9348.0000
1.0000 3.0700 4.0600 | 8429.0000
1.0000 2.9100 3.6400 | 10079.0000
1.0000 2.7300 3.2100 | 9240.0000
1.0000 2.7700 3.6600 | 8862.0000
1.0000 3.5900 3.7600 | 6216.0000
1.0000 3.2300 3.4900 | 8253.0000
1.0000 2.6000 3.1300 | 8038.0000
1.0000 2.8900 3.2000 | 7476.0000
1.0000 3.7700 3.6500 | 5911.0000
1.0000 3.6400 3.6000 | 7950.0000
1.0000 2.8200 2.9400 | 6134.0000
1.0000 2.9600 3.1200 | 5868.0000
1.0000 4.2400 3.5800 | 3160.0000
1.0000 3.6900 3.5300 | 5872.0000

@@@ debug:log linear model
1.0000 0.8154 1.2499 | 9.3487
1.0000 0.9322 1.0473 | 9.1429
1.0000 1.1217 1.4012 | 9.0394
1.0000 1.0682 1.2920 | 9.2182
1.0000 1.0043 1.1663 | 9.1313
1.0000 1.0188 1.2975 | 9.0895
1.0000 1.2782 1.3244 | 8.7349
1.0000 1.1725 1.2499 | 9.0183
1.0000 0.9555 1.1410 | 8.9919
1.0000 1.0613 1.1632 | 8.9195
1.0000 1.3271 1.2947 | 8.6846
1.0000 1.2920 1.2809 | 8.9809
1.0000 1.0367 1.0784 | 8.7216
1.0000 1.0852 1.1378 | 8.6773
1.0000 1.4446 1.2754 | 8.0583
1.0000 1.3056 1.2613 | 8.6780

testing linear fit:
beta( 0 )= 9727.56629203 t= 3.21783350732 pvalue= 0.191817531244
beta( 1 )= -3783.06271822 t= -6.33375547952 pvalue= 0.0996893146221
beta( 2 )= 2817.71671158 t= 2.83663692683 pvalue= 0.215767665862
beta( 3 )= 85.3186515319 t= 0.0207244164541 pvalue= 0.986808315114
F= 13.4410359246 R^2 = 0.770655824729

testing log-linear fit:
beta( 0 )= 9.14861065241 t= 17.0825302798 pvalue= 0.0372248169624
beta( 1 )= -1.96990658676 t= -6.41899335191 pvalue= 0.0983866546526
beta( 2 )= 1.58915441923 t= 3.07287928323 pvalue= 0.200292472074
beta( 3 )= -0.00012938402573 t= -1.6612759925 pvalue= 0.344952324446
F= 14.1664502895 R^2 = 0.779813891197


The important values are for the last coefficient. Except for the published pvalue for the log-linear fit of 0.1225 vs our own 0.34495, all the other value match.

REMINDER TO SELF: The lm code needs to be checked for the right computation of the p-values.

No comments:

Post a Comment