## Saturday, June 19, 2010

### Python, Econometrics (revised 06.20.2010): Computing the Augmented Dickey-Fuller test statistic.

A time-series vector Y is non-stationary if in the model $$Y_t = \rho Y_{t-1} + \epsilon$$, $$\rho=1$$.
Given a time-series vector Y, and corresponding time vector T, the Dickey-Fuller test for the presence of a unit root or non-stationarity works with the following test equations (models):

basic: $$\Delta y_t = \tau y_{t-1}$$
with a constant: $$\Delta y_t = \beta + \tau + y_{t-1}$$
with constant and trend: $$\Delta y_t = \beta_1 + \beta_2 T_t + \tau y_{t-1}$$

Each model require its own particular Dickey-Fuller table of critical values which is related to but different from the t-statistical distribution.

The full Augmented Dickey-Fuller test works with the following models given a specified max lag of the differenced vector $$\Delta Y$$

basic: $$\Delta y_t = \tau y_{t-1} + \rho_1 \Delta Y_{t-1}+ \rho_2 \Delta y_{t-2} +\ldots + \rho_p \Delta y_{t-p}$$
with a constant: $$\Delta y_t = \beta + \tau y_{t-1} + \rho \Delta y_{t-1}+ \rho_2 \Delta y_{t-2} + \ldots + \rho_p \Delta y_{t-p}$$

with constant and trend: $$\Delta y_t = \beta_1 + \beta_2 T_t + \tau y_{t-1}+\rho \Delta Y_{t-1}+ \rho_2 \Delta y_{t-2} + \ldots + \rho_p \Delta y_{t-p}$$

Thus if we have to be careful in specifying the input X matrix and the independent response vector to an OLS routine.

The test statistic Dickey-Fuller test in all models is the value $${\frac{\hat{\tau}}{se (\hat{\tau})}$$ where the denominator is the standard error of the computed coefficient of the lagged term $$y_{t-1}.$$

There are published tables for the critical values of the D-F test.
@@@ Place references here!!!@@@

Unlike the well known standard distribution such as the normal, t, F and $$\chi^2$$ distributions, the critical values are obtained by simulation.

The NUll hypothesis is that $$H_0: \tau = 0$$, i.e, the series $$Y$$ is non-stationary. If the absolute values of the estimated $\tau$$is less than the absolute value of the critical values for specified$$\alpha$\$, we do not reject the Null hypothesis, otherwise we reject the Null hypothesis.

Time for an example: Let us review Danao's Dicker-Fuller test applied on the GNP data in page 323. For convenience we copy it here:

YEAR GNPR PCER PGNP
1940 772.9 502.6 0.1299
1941 909.4 531.1 0.138003
1942 1080.3 527.6 0.147181
1943 1276.2 539.9 0.150995
1944 1380.6 557.1 0.153122
1945 1354.8 592.7 0.157514
1946 1096.9 655 0.193637
1947 1066.7 666.6 0.220493
1948 1108.7 681.8 0.235952
1949 1109 695.4 0.234806
1950 1203.7 733.2 0.239512
1951 1328.2 748.7 0.251016
1952 1380 771.4 0.254783
1953 1435.3 802.5 0.258901
1954 1416.2 822.7 0.263028
1955 1494.9 873.8 0.271523
1956 1525.6 899.8 0.280676
1957 1551.1 919.7 0.290761
1958 1539.2 932.9 0.296778
1959 1629.1 979.4 0.30434
1960 1665.3 1005.1 0.309434
1961 1708.7 1025.2 0.312401
1962 1799.4 1069 0.319329
1963 1873.3 1108.4 0.323974
1964 1973.3 1170.6 0.329296
1965 2087.6 1236.4 0.337756
1966 2208.3 1298.9 0.34959
1967 2271.4 1337.7 0.359426
1968 2365.6 1405.9 0.377367
1969 2423.3 1456.7 0.397763
1970 2416.2 1492 0.420288
1971 2484.8 1538.8 0.443778
1972 2608.5 1621.9 0.464942
1973 2744.1 1689.6 0.495354
1974 2729.3 1674 0.539626
1975 2695 1711.9 0.593098
1976 2826.7 1803.9 0.6307
1977 2958.6 1883.8 0.672784
1978 3115.2 1961 0.722169
1979 3192.4 2004.4 0.785678
1980 3187.1 2000.4 0.857206
1981 3248.8 2024.2 0.939608
1982 3166 2050.7 1
1983 3279.1 2146 1.038608
1984 3489.9 2246.3 1.078827
1985 3585.2 2324.5 1.115168
1986 3676.5 2418.6 1.144703

Using the GNP and Year columns only, the model to fit in R notation is Delta(GNPR)_{t-1} ~ 1 + Year + GNPR_{t-1}. The response variable is the lagged differenced of GNPR and the independent variable are a constant, the year and the lagged GNPR. When we use our Python program, it outputs the following:

[-12794.909494518186, 6.6503686622997975, -0.098105000002848719]
[8538.0875917954763, 4.4232851542101859, 0.073658549799719808]
With drift-trend: df test_statistic -1.33188883395

The first output array are the computed estimated slope coefficients. The next array list the corresponding standard errors. The computed DF statistic matches that of Danao.

Now let us specify a max lag for the differenced responsse variable. The computed input X matrix is given by the following (we added labels for clarity):

C Year Ylag delYlag delY
1.0000 1942.0000 909.4000 34.4000 | 170.9000
1.0000 1943.0000 1080.3000 25.0000 | 195.9000
1.0000 1944.0000 1276.2000 -91.5000 | 104.4000
1.0000 1945.0000 1380.6000 -130.2000 | -25.8000
1.0000 1946.0000 1354.8000 -232.1000 | -257.9000
1.0000 1947.0000 1096.9000 227.7000 | -30.2000
1.0000 1948.0000 1066.7000 72.2000 | 42.0000
1.0000 1949.0000 1108.7000 -41.7000 | 0.3000
1.0000 1950.0000 1109.0000 94.4000 | 94.7000
1.0000 1951.0000 1203.7000 29.8000 | 124.5000
1.0000 1952.0000 1328.2000 -72.7000 | 51.8000
1.0000 1953.0000 1380.0000 3.5000 | 55.3000
1.0000 1954.0000 1435.3000 -74.4000 | -19.1000
1.0000 1955.0000 1416.2000 97.8000 | 78.7000
1.0000 1956.0000 1494.9000 -48.0000 | 30.7000
1.0000 1957.0000 1525.6000 -5.2000 | 25.5000
1.0000 1958.0000 1551.1000 -37.4000 | -11.9000
1.0000 1959.0000 1539.2000 101.8000 | 89.9000
1.0000 1960.0000 1629.1000 -53.7000 | 36.2000
1.0000 1961.0000 1665.3000 7.2000 | 43.4000
1.0000 1962.0000 1708.7000 47.3000 | 90.7000
1.0000 1963.0000 1799.4000 -16.8000 | 73.9000
1.0000 1964.0000 1873.3000 26.1000 | 100.0000
1.0000 1965.0000 1973.3000 14.3000 | 114.3000
1.0000 1966.0000 2087.6000 6.4000 | 120.7000
1.0000 1967.0000 2208.3000 -57.6000 | 63.1000
1.0000 1968.0000 2271.4000 31.1000 | 94.2000
1.0000 1969.0000 2365.6000 -36.5000 | 57.7000
1.0000 1970.0000 2423.3000 -64.8000 | -7.1000
1.0000 1971.0000 2416.2000 75.7000 | 68.6000
1.0000 1972.0000 2484.8000 55.1000 | 123.7000
1.0000 1973.0000 2608.5000 11.9000 | 135.6000
1.0000 1974.0000 2744.1000 -150.4000 | -14.8000
1.0000 1975.0000 2729.3000 -19.5000 | -34.3000
1.0000 1976.0000 2695.0000 166.0000 | 131.7000
1.0000 1977.0000 2826.7000 0.2000 | 131.9000
1.0000 1978.0000 2958.6000 24.7000 | 156.6000
1.0000 1979.0000 3115.2000 -79.4000 | 77.2000
1.0000 1980.0000 3192.4000 -82.5000 | -5.3000
1.0000 1981.0000 3187.1000 67.0000 | 61.7000
1.0000 1982.0000 3248.8000 -144.5000 | -82.8000
1.0000 1983.0000 3166.0000 195.9000 | 113.1000
1.0000 1984.0000 3279.1000 97.7000 | 210.8000
1.0000 1985.0000 3489.9000 -115.5000 | 95.3000
1.0000 1986.0000 3585.2000 -4.0000 | 91.3000

Our program returned the following slopes/st. errors:

[466.8367900006233, -0.22813001176738901, 0.020909396153443754, 0.50373782051647442]
[8182.7301885576253, 4.2390077773331614, 0.070306560324028558, 0.12685872804481083]
With drift-trend: df test_statistic 0.297403201879

The result does not match that of Danao's published result of -2.436537!! Testing this on R we get the following summary:

> fit <-lm( dely ~ 1 + year + ylag +dely1 ) > summary(fit)

Call:
lm(formula = dely ~ 1 + year + ylag + dely1)

Residuals:
Min 1Q Median 3Q Max
-192.206 -32.704 3.383 39.123 137.138

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 466.83679 8182.73019 0.057 0.954781
year -0.22813 4.23901 -0.054 0.957342
ylag 0.02091 0.07031 0.297 0.767660
dely1 0.50374 0.12686 3.971 0.000282 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 68.08 on 41 degrees of freedom
Multiple R-squared: 0.3353, Adjusted R-squared: 0.2866
F-statistic: 6.893 on 3 and 41 DF, p-value: 0.000728

Our slope(beta) and se(std.error) for ylag matches that of R. This is contrast with Danao's published output:

Std.Error
(Intercept) 124.1005 49.00592
@TREND(1940) 10.71098 4.127310
GNPR(-1) -0.168378 0.969105
D(GNPR(-1) 0.452307 0.140006

and with computed ADF test statistic of -2,436537.(Read p.331). We will find out if there are errors.

Here is the complete Python program to compute the test statistic of the DF and ADF test.

"""

U.P. Clark, Pampanga

Version 0.0.1   June 20, 2010
"""

from lm import *

"""
Description
Computes Dickey-Fuller Unit test values.
Arguments
Y - input vector
t - time vector, None if no trend is to be incorporated.
withintercept - true if there is a constant in model.
maxp  maximum lag for differences.
Return
tau, coefficient of lagged variable Y(-1).
se(tau), standard error of coefficient.
n- number of data  points used.
"""
# Form matrix of columns corresponding to delY-1 to delY-p
# Be simple!
n        = len(Y)

delyp    = [0]* (maxp+1)
delyp[0] = [Y[i]-Y[i-1] for i in range(1, n)]

for p in range(1, maxp+1):
delyp[p] = [delyp[p-1][i] - delyp[p-1][i-1] for i in range(1,len(delyp[p-1]))]

# Adjust differences to same length.
for p in range(0, maxp+1):
delyp[p] = delyp[p][maxp-p:]

# Form the X design matrix.
X = []
for i in range(maxp + 1, n):
x = []
if withintercept:
x.append(1)
if t:
x.append(t[i])
x.append(Y[i-1])
for p in range(1, maxp+1):
x.append(delyp[p][i-maxp-1])
X.append(x)

mataugprint(X, delyp[0])

fit = ols( X,  delyp[0])

ylagindex = 0
if withintercept:
ylagindex+=1
if t:
ylagindex+=1
print "yindex = ", ylagindex
tau = fit.betas[ylagindex]
se  = fit.sdbetas[ylagindex]
print fit.betas
print fit.sdbetas
return tau, se, len(X)

if __name__ == "__main__":
data = """YEAR    GNPR    PCER    PGNP
1940    772.9   502.6   0.1299
1941    909.4   531.1   0.138003
1942    1080.3  527.6   0.147181
1943    1276.2  539.9   0.150995
1944    1380.6  557.1   0.153122
1945    1354.8  592.7   0.157514
1946    1096.9  655     0.193637
1947    1066.7  666.6   0.220493
1948    1108.7  681.8   0.235952
1949    1109    695.4   0.234806
1950    1203.7  733.2   0.239512
1951    1328.2  748.7   0.251016
1952    1380    771.4   0.254783
1953    1435.3  802.5   0.258901
1954    1416.2  822.7   0.263028
1955    1494.9  873.8   0.271523
1956    1525.6  899.8   0.280676
1957    1551.1  919.7   0.290761
1958    1539.2  932.9   0.296778
1959    1629.1  979.4   0.30434
1960    1665.3  1005.1  0.309434
1961    1708.7  1025.2  0.312401
1962    1799.4  1069    0.319329
1963    1873.3  1108.4  0.323974
1964    1973.3  1170.6  0.329296
1965    2087.6  1236.4  0.337756
1966    2208.3  1298.9  0.34959
1967    2271.4  1337.7  0.359426
1968    2365.6  1405.9  0.377367
1969    2423.3  1456.7  0.397763
1970    2416.2  1492    0.420288
1971    2484.8  1538.8  0.443778
1972    2608.5  1621.9  0.464942
1973    2744.1  1689.6  0.495354
1974    2729.3  1674    0.539626
1975    2695    1711.9  0.593098
1976    2826.7  1803.9  0.6307
1977    2958.6  1883.8  0.672784
1978    3115.2  1961    0.722169
1979    3192.4  2004.4  0.785678
1980    3187.1  2000.4  0.857206
1981    3248.8  2024.2  0.939608
1982    3166    2050.7  1
1983    3279.1  2146    1.038608
1984    3489.9  2246.3  1.078827
1985    3585.2  2324.5  1.115168
1986    3676.5  2418.6  1.144703
"""
X= []
for row in data.split("\n"):
xvars = row.split()
if len(xvars) == 4:
else:
cases = [float(v) for v in xvars]
X.append(cases)
TIME = [x[0] for x in X]
GNPR = [x[1] for x in X]

tau, se, n = adftest(GNPR, t=TIME, withintercept=True, maxp = 1)
print "With drift-trend: df test_statistic", tau/se


We will try to find out why our results do not match with that of Danao when the lag of the differenced response is not zero.

We only saw the following reference after we wrote our code.