## Monday, April 5, 2010

### Python, Statistics: The non-parametric Runs test for randomness.

Top

 Top Intro Pmf Python Normal Swed Bottom

### Introduction

Consider the sequence ABBAAABBABABBB. A subsequence of consecutive same letters is called a run. For our example, we use vertical bars to separate each run: A|BB|AAA|BB|A|B|A|BBB consists of 8 runs. This sequence contains 6 A's and 8 B's which we denote by (m,n)= (6,8). We would be suspicious of people claiming that the sequences AAAAAAABBBBBBB or ABABABABABABAB are random. The runs test was developed precisely to test the null hypothesis whether a sample of a sequence of binary symbols is randomly generated. The minimum number of runs of two symbols $m$ of one kind and $n$ of another kind is 2. On the other hand, the maximum possible number of runs is $2\, min(m, n)$ plus 1 if both m and n are different.

### The probability mass function of the distribution of runs

The distribution of the number of runs has been "derived" or explained in [Mood, Graybill and Boes,"Introduction to Statistics" McGraw Hill, 3e, 1974] p.519-521. The probability mass function is given for any even number of runs $(z = 2k)$ as $prob(Z = z)=\frac{2\binom{m-1}{k-1}\binom{n-1}{k-1}}{\binom{m+n}{m}}$ and for an odd number of runs, $(z = 2k+1)$ as $prob(Z = z)=\frac{\binom{m-1}{k}\binom{n-1}{k-1} + \binom{m-1}{k-1} \binom{n-1}{k}}{\binom{m+n}{m}}$ Recall that $\binom{n}{r}$ is the number of combinations of $n$ things taken $r$ at a time.It can be computed as $\binom{n}{r} = \frac{n!}{r!(n-r)!}$ but this is inefficient, and instead we use the formula $\binom{n}{r} = \frac{n(n-1)\cdots (n-r+1)}{1\cdot 2\cdots r}$ It would be foolish to do by hand even with aid of a non-programmable scientific calculators, especialy when calculating the cumulative mass distribution function and is the reason we are writing Python codes.

 Top Intro Pmf Normal Python Swed Bottom

### The normal approximation to the distribution of runs

When m and n are both large, say m and n are both greater than 10, then an asymptotic normal approximation might be good enough. The mean is given by $\mu = \frac{2mn}{m+n} +1$ and the variance would be given by $\sigma^2 = \frac{2mn(2mn -m -n)}{(m+n)^2 (m+n-1}$

 Top Intro Pmf Normal Python Swed Bot

### Python codes

Here is our latest version as of April 6. It is more complete now but as usual, users beware. It has not been thoroughly debugged.
FunctionDescription
nCr(n, r) binomial coefficient
maxruns(m,n) maximum runs is (m,n) sequence
pmfruns(z, m, n) probability mass function
cmftableruns(m, n) cumulative mass function table
pvalue(z, m, n, side=-1) $p-$value of test statistic z
normrunstest(z, m, n, alpha, side) normal approximation test
exactrunstest(z,m,n,alpha,side=0)exact test for runs
swedeisenhart Swed and Eisenhart (m,n)cumulative distribution table
isvalidruns(S) return (m,n) if S is a valid runs sequence, else None

# -*- coding: utf-8 -*-
"""
file    testofruns.py
version 0.0.1 April 5, 2010

"""

from math import sqrt
from basictests import ztest,  ttest

def  nCr(n, r):
"""
Computes the binomial coefficient, number of combinations of n things taken r at a time.
"""
if r == 1:  return n
if r == 0 or r == n: return 1

if n < r   or n  < 0 or r  < 0: return 0
if r > n-r: # find value of less looping.
r = n-r
f = 1
for i in range(1, r+1):
f *= (n - i+1)
f /= i
return f

def maxruns(m,n):
# maximum runs with m  symbols of the first kind and n symbols of the second.kind.
return 2 * min(m, n) + (1 if m!=n else 0)

def pmfruns(z, m, n):
"""
Computes the probability mass function for the distribution of runs.
"""
if z < 2:
raise ValueError,  "in pmfruns, z must satisfy z >= 2"
if  z % 2 ==0: #even z.
k = z // 2
return 2.0 * nCr(m-1, k-1) * nCr (n-1, k-1)/ nCr(m+n, m)
else : #odd z
k = (z-1) // 2
return float(nCr(m-1, k) * nCr(n-1, k-1) + nCr(m-1, k-1) * nCr(n-1, k)) / float(nCr(m + n, m))

def cmftableruns(m, n):
"""
Returns a cumulative masss function for the distribution of runs.
You must note that that the random variable z starts at 2 !
"""
tot = 0
result=[]

maxz = maxruns(m,  n)
for i in range(2, maxz+1):
tot += pmfruns(i, m, n)
result.append(tot)
return  result

def pvalue(z, m,  n,  side=-1):
"unchecked. to be tested first!"
upperz = maxruns(m,n)
if not (2 <= z <= maxruns) :
raise ValueError, "in pvalue(), z must lie in (2, %d)" % upperz
table = cmftableruns(m,n)
if side == 0:
p = table[z-2]
if p > 0.5:
p = 1-0.5
return p * 2
elif side == -1:
return table[z-2]
elif side == 1:
return 1 - table[z-2]

def normrunstest(z,  m,  n,  alpha,  side):
#Test using normal approximation, when both m and n are greater than 10.
mean = 2.0 * m *n / (m+n) + 1
variance = 2. * m * n * (2 *m *n -m -n)/((m+n)**2 * (m+n - 1))
return ztest(z - mean,  sqrt(variance/n), alpha,  side)

def swedeisenhart():
print
print """"""      for i in range(2, 11):         print '' % i,     print "
"     for m in range(2, 11):         for n in range(m, 11):             rowvals = cmftableruns(m, n)             maxcol = len(rowvals)             print "
" % (m,n),             for j in range(min(maxcol, 9)):                 print "" % rowvals[j],             for j in range(maxcol, 9):                 print "",             print "
"     print "P(Z <= a) when H_0 is true in the Runs test
(m, n)%d(%d,%d)%5.3f"

def exactrunstest(z, m, n, alpha, side=0):
#returns pvalue, test stat z and lower and upper limits.
cmf = cmftableruns(m,n)
print cmf,
a = b = None
if side ==0:
alpha /= 2.0
alpha1 = alpha
alpha2 = 1-alpha1
elif side == 1:
alpha2 = 1-alpha
elif side == 0:
alpha1 = alpha
for i in range(len(cmf)):
if cmf[i] <= alpha1:
a  = i+2
if cmf[i] >= alpha2:
b =  i+2
p = pvalue(z,  m,n, side)
if side == 0:
return (p, z, (a,b))
elif side == 1:
return (p, z, b)
elif side == -1:
return (p, z, a)

def isvalidruns(S):
D = {}
for s in S:
if s in D:
D[s] +=1
else:
D[s] = 1
if len(D) != 2:
raise  ValueError, "in isvalidruns(): more than two symbols"

counts = []
for (i,s) in enumerate(D):
counts.append(D[s])
a,b  = counts, counts
return min(a, b), max(a,b)

def Test():
swedeisenhart() # decomment line to turn on table generation test.

#print "Exact runs test for z = 5, (m, n)= (5,7)"
#print exactrunstest(5, 5,7, 0.05, side=0)
"""
print pmfruns(3, 5,4)
print normrunstest(5, 4,9,0.05, -1)
print "Table for runs test."
m, n = 3,5
print cmftableruns(m,n)
"""

if __name__ == "__main__":
Test()


The Python module above calls the ztest which was already described in previous posts. We will add all files required in one zip file for convenience later.

 Top Intro Pmf Python Normal Swed Bot

### Swed and Eisenhart Table

Here is an html table obtained by running swedeisenhart() routine.
$P(Z <= a)$ when $H_0$ is true in the Runs test
(m, n) 2 3 4 5 6 7 8 9 10
(2,2) 0.333 0.667 1.000
(2,3) 0.200 0.500 0.900 1.000
(2,4) 0.133 0.400 0.800 1.000
(2,5) 0.095 0.333 0.714 1.000
(2,6) 0.071 0.286 0.643 1.000
(2,7) 0.056 0.250 0.583 1.000
(2,8) 0.044 0.222 0.533 1.000
(2,9) 0.036 0.200 0.491 1.000
(2,10) 0.030 0.182 0.455 1.000
(3,3) 0.100 0.300 0.700 0.900 1.000
(3,4) 0.057 0.200 0.543 0.800 0.971 1.000
(3,5) 0.036 0.143 0.429 0.714 0.929 1.000
(3,6) 0.024 0.107 0.345 0.643 0.881 1.000
(3,7) 0.017 0.083 0.283 0.583 0.833 1.000
(3,8) 0.012 0.067 0.236 0.533 0.788 1.000
(3,9) 0.009 0.055 0.200 0.491 0.745 1.000
(3,10) 0.007 0.045 0.171 0.455 0.706 1.000
(4,4) 0.029 0.114 0.371 0.629 0.886 0.971 1.000
(4,5) 0.016 0.071 0.262 0.500 0.786 0.929 0.992 1.000
(4,6) 0.010 0.048 0.190 0.405 0.690 0.881 0.976 1.000
(4,7) 0.006 0.033 0.142 0.333 0.606 0.833 0.955 1.000
(4,8) 0.004 0.024 0.109 0.279 0.533 0.788 0.929 1.000
(4,9) 0.003 0.018 0.085 0.236 0.471 0.745 0.902 1.000
(4,10) 0.002 0.014 0.068 0.203 0.419 0.706 0.874 1.000
(5,5) 0.008 0.040 0.167 0.357 0.643 0.833 0.960 0.992 1.000
(5,6) 0.004 0.024 0.110 0.262 0.522 0.738 0.911 0.976 0.998
(5,7) 0.003 0.015 0.076 0.197 0.424 0.652 0.854 0.955 0.992
(5,8) 0.002 0.010 0.054 0.152 0.347 0.576 0.793 0.929 0.984
(5,9) 0.001 0.007 0.039 0.119 0.287 0.510 0.734 0.902 0.972
(5,10) 0.001 0.005 0.029 0.095 0.239 0.455 0.678 0.874 0.958
(6,6) 0.002 0.013 0.067 0.175 0.392 0.608 0.825 0.933 0.987
(6,7) 0.001 0.008 0.043 0.121 0.296 0.500 0.733 0.879 0.966
(6,8) 0.001 0.005 0.028 0.086 0.226 0.413 0.646 0.821 0.937
(6,9) 0.000 0.003 0.019 0.063 0.175 0.343 0.566 0.762 0.902
(6,10) 0.000 0.002 0.013 0.047 0.137 0.287 0.497 0.706 0.864
(7,7) 0.001 0.004 0.025 0.078 0.209 0.383 0.617 0.791 0.922
(7,8) 0.000 0.002 0.015 0.051 0.149 0.296 0.514 0.704 0.867
(7,9) 0.000 0.001 0.010 0.035 0.108 0.231 0.427 0.622 0.806
(7,10) 0.000 0.001 0.006 0.024 0.080 0.182 0.355 0.549 0.743
(8,8) 0.000 0.001 0.009 0.032 0.100 0.214 0.405 0.595 0.786
(8,9) 0.000 0.001 0.005 0.020 0.069 0.157 0.319 0.500 0.702
(8,10) 0.000 0.000 0.003 0.013 0.048 0.117 0.251 0.419 0.621
(9,9) 0.000 0.000 0.003 0.012 0.044 0.109 0.238 0.399 0.601
(9,10) 0.000 0.000 0.002 0.008 0.029 0.077 0.179 0.319 0.510
(10,10) 0.000 0.000 0.001 0.004 0.019 0.051 0.128 0.242 0.414
If you have any suggestions, or corrections, or features request do not hesitate to email the author at ernesto.adorio @ gmail . com.

Bottom

 Top Intro Pmf Python Normal Swed Bot

I have encountered the html problem: Your HTML cannot be accepted: Closing tag has no matching opening tag: A ???!!! I wonder where its !!

#### 1 comment:

1. where is the basictest module? did you create it or is it a python module? "from basictests import ztest, ttest"