## Wednesday, April 28, 2010

### Python, econometrics: Computing The Sample Partial Autocorrelation Function

April 29, 2010

I was wondering for a long time how to implement a partial autocorrelation function (pacf)in Python. Maybe it was the temptation to choose the fastest, and freely available algorithm like the Levinson-Durbin algorithm and its variants but the more I tried to read the references, the more I got discouraged. Even the description of the Levinson-Durbin algorithm in a wikipedia article I find dense.

The simplest explanation I could find and follow and from which I wrote a Python function for the spacf is by Gidon Eshel, "Yule-Walker Equations for the AR coefficients

In teaching econometrics, we try to go for the simplest and easiest to describe algorithm that works. Then we will point out to students references to faster algorithms. No, I have not gotten around to teaching the time series part of basic econometrics, but I will do so in the second semester.

Basically we are given a time series data and from it we can compute the sample autocorrelation (sacf) function for various lags.

### AutoRegressive model coefficients

Assume that we have data vector X indeXed from 0 to n-1 and consider the the $$AR(p)$$, AutoRegressive model of order $$p$$, equation:

$$X_{i+1}= \phi_1 X_i + \phi_2 X_{i-1} + \cdots \phi_pX_{i-p+1} + \epsilon_{i+1}$$

where $X_{i+k}$ is the time series value with lag $k$. In terms of least squares. As a particular example, assume that n = 5
and p is 3. then the model data matriX $X$ and the dependent variable $Y$ written in augmented matriX form as

Note that $Y = [X_p, X_{p+1} \ldots ... X_{n-1}]$. The column vector of lag $k$ is
$X_k= [X_{p-k} \ldots X_{n-(k)}]$,
and the least squares normal equations would be given by $(X^tX) \Phi = X^t Y$. The unknown
$\Phi$ 's may then be obtained from the solution $\Phi = (X^tX)^{-1} X^t Y$.

### Autocorrelations

This is already implemented in the sacf function in a previous post. For convenience we repeat it here. The function expects the vector X and the lag for the other vector.

def sacf(X, k=1):
"""
Computes the sample autocorrelation function coeffficient.
for given lag k.
"""
if k == 0:
return 1.0
flen = float(len(X))
ybar = float(sum([x for x in X])) / flen
D = sum([(x-xbar)**2 for x in X])
N = sum([(x-xbar)* (xtpk -xbar) for (x, xtpk) in zip(X[:-k],X[k:])])
return N/D


### Relation between autocorrelations and AR coefficients $$\Phi$$

The relation between autocorrelations and the $$\Phi$$ coefficients of the AR(p) models.

or more compactly as $$r = R \Phi$$. The matrix $$R$$ is symmetric and is full rank and is invertible, which allows us to solve for $$\Phi = R^{-1} r.$$

### Computing Partial Correlation Coefficients from autocorrelations and AR(p) coefficients

The partial autocorrelations are now obtained by the following algorithm implementing the one described by Eshel:

Given acf - autocorrelation function values up to lag maxlag.
Find pcf - corresponding partial correlation function vector.

set up array pcf up to size maxlag
pcf = 1.0
pcf = acf
for lag from 2 upto maxlag:
Compute R
Compute r
Compute Phi = R**(-1) r
pcf[lag] =  Phi[lag]
return pcf


### Python code

And finally here is our Python code, which has taken a long time to develop:

# -*- coding: utf-8 -*-
"""
file     : spacf.py
version: 0.0.1   april 29, 2010
reference: Gidon Eshel,
Yule-Walker Equations for the AR
coefficients.
"""
import math
import matlib
from   gausselim import Solve

def  acf(X, k=1):
"""
Computes the sample autocorrelation function coeffficient.
for given lag k and input data X.
"""
if k == 0:
return 1.0
flen = float(len(X))
xbar = float(sum([x for x in X])) / flen
D = sum([(x-xbar)**2 for x in X])
N = sum([ (x-xbar)* (xtpk -xbar) for (x, xtpk) in zip(X[:-k],X[k:])])
return N/D

def sacf(X,  maxlag):
return [acf(X, i) for i in range(maxlag+1)] #includes 0 lag

def Rmatrix(r, maxlag):
"""
Creates R matrix (Yule-Walker correlation matrix)
Input:
r           - sample acf values up to maxlag.
maxlag- maximum lag
"""
R = matlib.matIdentity(maxlag)
for i in range(0, maxlag):
for j in range(i+1, maxlag):
R[i][j] = R[j][i]= r[j-i]
return R

def spacf(acf, maxlag):
pacfs=acf[:]
for i in range(2,maxlag+1):
R = Rmatrix(acf, i)
r = acf[1:i+1]
Phi = Solve(R, r)   # solution of R Phi = r
pacfs[i] = Phi[-1]   # last element of vector Phi.
return pacfs

#data from Danao's text. page 323.
X = ['772.9', '909.4', '1080.3', '1276.2', '1380.6', '1354.8',
'1096.9', '1066.7', '1108.7', '1109', '1203.7', '1328.2', '1380',
'1435.3', '1416.2', '1494.9', '1525.6', '1551.1', '1539.2',
'1629.1', '1665.3', '1708.7', '1799.4', '1873.3', '1973.3',
'2087.6', '2208.3', '2271.4', '2365.6', '2423.3', '2416.2',
'2484.8', '2608.5', '2744.1', '2729.3', '2695', '2826.7',
'2958.6', '3115.2', '3192.4', '3187.1', '3248.8', '3166',
'3279.1', '3489.9', '3585.2', '3676.5']

def Test1(maxlag):
"""
Prints out values of acf and pacf for the Danao data set.
"""
Y = [float(x) for x in X]
r  = sacf(Y,  maxlag)
pacfs = spacf(r,  maxlag)
print "The autocorrelations and partisl acorrelation"
for i ,  (acf,  pacf) in enumerate(zip(r, pacfs)):
print i,  acf,  pacf

if __name__== "__main__":
Test1(17)


Given the vector acf's for various lags, the corresponding pacf values will be computed by the sacf function. Here is the output from the terminal when running python sacf.py:

The autocorrelations and partisl acorrelation
0 1.0 1.0
1 0.925682317386 0.925682317386
2 0.852706579655 -0.0292160394675
3 0.787096604484 0.0122016750938
4 0.737850083142 0.0784204182616
5 0.697253316633 0.0358005082792
6 0.64842031925 -0.071567073034
7 0.587527096625 -0.0988314678858
8 0.519141887224 -0.0843023226042
9 0.450228026064 -0.0629615959671
10 0.384896320219 -0.0480711212172
11 0.32584304195 -0.0201806609446
12 0.273845336962 0.00621787084071
13 0.216766465976 -0.0631790415256
14 0.156888401912 -0.0477940531702
15 0.0992408085419 -0.0204290294085
16 0.0477462812535 -0.0101131483561
17 -0.00206714577028 -0.0495417448475

The computed pacf values match those with page 324 of Danao's text and we are confident the implementation above is correct. In practice the maximum lag to use should not exceed N/4 where N is the vector length of the time series data.

To be revised. Latex formulas will be displayed as images in a future version of this post.

The gaussian elimination routine may be found in Python: gaussian elimination which includes the Solve routine for solving linear systems $$A x = b$$.

We will try to implement the recursive Durbin algorithm in a future post.