Doc Ernie's Adventures in the Web: Python, Statistics: The non-parametric Runs test for randomness.

Top

Introduction
Probability mass function
Normal Approximation
Python Code
Swed and Eisenhart table

Top

Intro

Introduction

Consider the sequence ABBAAABBABABBB. A subsequence of consecutive same letters is called a run. For our example, we use vertical bars to separate each run: A|BB|AAA|BB|A|B|A|BBB consists of 8 runs. This sequence contains 6 A's and 8 B's which we denote by (m,n)= (6,8). We would be suspicious of people claiming that the sequences AAAAAAABBBBBBB or ABABABABABABAB are random. The runs test was developed precisely to test the null hypothesis whether a sample of a sequence of binary symbols is randomly generated. The minimum number of runs of two symbols $m$ of one kind and $n$ of another kind is 2. On the other hand, the maximum possible number of runs is $2\, min(m, n)$ plus 1 if both m and n are different.

The probability mass function of the distribution of runs

The distribution of the number of runs has been "derived" or explained in [Mood, Graybill and Boes,"Introduction to Statistics" McGraw Hill, 3e, 1974] p.519-521. The probability mass function is given for any even number of runs $(z = 2k)$ as \[ prob(Z = z)=\frac{2\binom{m-1}{k-1}\binom{n-1}{k-1}}{\binom{m+n}{m}}\] and for an odd number of runs, $(z = 2k+1)$ as \[ prob(Z = z)=\frac{\binom{m-1}{k}\binom{n-1}{k-1} + \binom{m-1}{k-1} \binom{n-1}{k}}{\binom{m+n}{m}}\] Recall that $\binom{n}{r}$ is the number of combinations of $n$ things taken $r$ at a time.It can be computed as $\binom{n}{r} = \frac{n!}{r!(n-r)!}$ but this is inefficient, and instead we use the formula $\binom{n}{r} = \frac{n(n-1)\cdots (n-r+1)}{1\cdot 2\cdots r}$ It would be foolish to do by hand even with aid of a non-programmable scientific calculators, especialy when calculating the cumulative mass distribution function and is the reason we are writing Python codes.

The normal approximation to the distribution of runs

When m and n are both large, say m and n are both greater than 10, then an asymptotic normal approximation might be good enough. The mean is given by \[\mu = \frac{2mn}{m+n} +1\] and the variance would be given by \[\sigma^2 = \frac{2mn(2mn -m -n)}{(m+n)^2 (m+n-1}\]

Python codes

Here is our latest version as of April 6. It is more complete now but as usual, users beware. It has not been thoroughly debugged.

Function	Description
nCr(n, r)	binomial coefficient
maxruns(m,n)	maximum runs is (m,n) sequence
pmfruns(z, m, n)	probability mass function
cmftableruns(m, n)	cumulative mass function table
pvalue(z, m, n, side=-1)	$p-$value of test statistic z
normrunstest(z, m, n, alpha, side)	normal approximation test
exactrunstest(z,m,n,alpha,side=0)	exact test for runs
swedeisenhart	Swed and Eisenhart (m,n)cumulative distribution table
isvalidruns(S)	return (m,n) if S is a valid runs sequence, else None

# -*- coding: utf-8 -*-
"""
file    testofruns.py
author  Ernesto P. Adorio
version 0.0.1 April 5, 2010
         
"""

from math import sqrt
from basictests import ztest,  ttest


def  nCr(n, r):
     """
     Computes the binomial coefficient, number of combinations of n things taken r at a time.
     """
     if r == 1:  return n
     if r == 0 or r == n: return 1
     
     if n < r   or n  < 0 or r  < 0: return 0
     if r > n-r: # find value of less looping.
        r = n-r
     f = 1
     for i in range(1, r+1):
        f *= (n - i+1) 
        f /= i
     return f
     
def maxruns(m,n):
     # maximum runs with m  symbols of the first kind and n symbols of the second.kind.
     return 2 * min(m, n) + (1 if m!=n else 0)
  
def pmfruns(z, m, n):
     """
     Computes the probability mass function for the distribution of runs.
     """
     if z < 2: 
        raise ValueError,  "in pmfruns, z must satisfy z >= 2"
     if  z % 2 ==0: #even z.
        k = z // 2 
        return 2.0 * nCr(m-1, k-1) * nCr (n-1, k-1)/ nCr(m+n, m)
     else : #odd z
        k = (z-1) // 2
        return float(nCr(m-1, k) * nCr(n-1, k-1) + nCr(m-1, k-1) * nCr(n-1, k)) / float(nCr(m + n, m))

def cmftableruns(m, n):
    """
    Returns a cumulative masss function for the distribution of runs.
    You must note that that the random variable z starts at 2 !
    """
    tot = 0
    result=[]
    
    maxz = maxruns(m,  n)
    for i in range(2, maxz+1):
       tot += pmfruns(i, m, n)
       result.append(tot)
    return  result


def pvalue(z, m,  n,  side=-1):
    "unchecked. to be tested first!"
    upperz = maxruns(m,n)
    if not (2 <= z <= maxruns) :
       raise ValueError, "in pvalue(), z must lie in (2, %d)" % upperz
    table = cmftableruns(m,n)
    if side == 0:
       p = table[z-2]
       if p > 0.5: 
          p = 1-0.5
       return p * 2
    elif side == -1:
        return table[z-2]
    elif side == 1:
       return 1 - table[z-2] 
        
def normrunstest(z,  m,  n,  alpha,  side):
    #Test using normal approximation, when both m and n are greater than 10.
    mean = 2.0 * m *n / (m+n) + 1
    variance = 2. * m * n * (2 *m *n -m -n)/((m+n)**2 * (m+n - 1))
    return ztest(z - mean,  sqrt(variance/n), alpha,  side) 


def swedeisenhart():
    print
    print """"""      for i in range(2, 11):         print '' % i,     print "
"     for m in range(2, 11):         for n in range(m, 11):             rowvals = cmftableruns(m, n)             maxcol = len(rowvals)             print "
" % (m,n),             for j in range(min(maxcol, 9)):                 print "" % rowvals[j],             for j in range(maxcol, 9):                 print "",             print "
"     print "P(Z <= a) when H_0 is true in the Runs test
(m, n) %d
(%d,%d) %5.3f "

def exactrunstest(z, m, n, alpha, side=0):
    #returns pvalue, test stat z and lower and upper limits.
    cmf = cmftableruns(m,n)
    print cmf, 
    a = b = None
    if side ==0:
       alpha /= 2.0
       alpha1 = alpha
       alpha2 = 1-alpha1
    elif side == 1:
       alpha2 = 1-alpha
    elif side == 0:
       alpha1 = alpha
    for i in range(len(cmf)):
       if cmf[i] <= alpha1:
          a  = i+2 
       if cmf[i] >= alpha2:
          b =  i+2
    p = pvalue(z,  m,n, side)
    if side == 0:
       return (p, z, (a,b))
    elif side == 1:
       return (p, z, b)
    elif side == -1:
       return (p, z, a)

   
def isvalidruns(S):
        D = {}
        for s in S:
           if s in D:
                D[s] +=1
           else:
                D[s] = 1
                if len(D) != 2:
                   raise  ValueError, "in isvalidruns(): more than two symbols"
        
        counts = [] 
        for (i,s) in enumerate(D):
             counts.append(D[s])
        a,b  = counts[0], counts[1]
        return min(a, b), max(a,b)
 
  
  
 
def Test():
   swedeisenhart() # decomment line to turn on table generation test.

   #print "Exact runs test for z = 5, (m, n)= (5,7)"
   #print exactrunstest(5, 5,7, 0.05, side=0)
   """
   print pmfruns(3, 5,4)
   print normrunstest(5, 4,9,0.05, -1)
   print "Table for runs test."
   m, n = 3,5
   print cmftableruns(m,n)
   """

if __name__ == "__main__":
   Test()

P(Z <= a) when H_0 is true in the Runs test
(m, n)	%d
(%d,%d)	%5.3f

The Python module above calls the ztest which was already described in previous posts. We will add all files required in one zip file for convenience later.

Swed and Eisenhart Table

Here is an html table obtained by running swedeisenhart() routine.

$P(Z <= a)$ when $H_0$ is true in the Runs test
(m, n)	2	3	4	5	6	7	8	9	10
(2,2)	0.333	0.667	1.000
(2,3)	0.200	0.500	0.900	1.000
(2,4)	0.133	0.400	0.800	1.000
(2,5)	0.095	0.333	0.714	1.000
(2,6)	0.071	0.286	0.643	1.000
(2,7)	0.056	0.250	0.583	1.000
(2,8)	0.044	0.222	0.533	1.000
(2,9)	0.036	0.200	0.491	1.000
(2,10)	0.030	0.182	0.455	1.000
(3,3)	0.100	0.300	0.700	0.900	1.000
(3,4)	0.057	0.200	0.543	0.800	0.971	1.000
(3,5)	0.036	0.143	0.429	0.714	0.929	1.000
(3,6)	0.024	0.107	0.345	0.643	0.881	1.000
(3,7)	0.017	0.083	0.283	0.583	0.833	1.000
(3,8)	0.012	0.067	0.236	0.533	0.788	1.000
(3,9)	0.009	0.055	0.200	0.491	0.745	1.000
(3,10)	0.007	0.045	0.171	0.455	0.706	1.000
(4,4)	0.029	0.114	0.371	0.629	0.886	0.971	1.000
(4,5)	0.016	0.071	0.262	0.500	0.786	0.929	0.992	1.000
(4,6)	0.010	0.048	0.190	0.405	0.690	0.881	0.976	1.000
(4,7)	0.006	0.033	0.142	0.333	0.606	0.833	0.955	1.000
(4,8)	0.004	0.024	0.109	0.279	0.533	0.788	0.929	1.000
(4,9)	0.003	0.018	0.085	0.236	0.471	0.745	0.902	1.000
(4,10)	0.002	0.014	0.068	0.203	0.419	0.706	0.874	1.000
(5,5)	0.008	0.040	0.167	0.357	0.643	0.833	0.960	0.992	1.000
(5,6)	0.004	0.024	0.110	0.262	0.522	0.738	0.911	0.976	0.998
(5,7)	0.003	0.015	0.076	0.197	0.424	0.652	0.854	0.955	0.992
(5,8)	0.002	0.010	0.054	0.152	0.347	0.576	0.793	0.929	0.984
(5,9)	0.001	0.007	0.039	0.119	0.287	0.510	0.734	0.902	0.972
(5,10)	0.001	0.005	0.029	0.095	0.239	0.455	0.678	0.874	0.958
(6,6)	0.002	0.013	0.067	0.175	0.392	0.608	0.825	0.933	0.987
(6,7)	0.001	0.008	0.043	0.121	0.296	0.500	0.733	0.879	0.966
(6,8)	0.001	0.005	0.028	0.086	0.226	0.413	0.646	0.821	0.937
(6,9)	0.000	0.003	0.019	0.063	0.175	0.343	0.566	0.762	0.902
(6,10)	0.000	0.002	0.013	0.047	0.137	0.287	0.497	0.706	0.864
(7,7)	0.001	0.004	0.025	0.078	0.209	0.383	0.617	0.791	0.922
(7,8)	0.000	0.002	0.015	0.051	0.149	0.296	0.514	0.704	0.867
(7,9)	0.000	0.001	0.010	0.035	0.108	0.231	0.427	0.622	0.806
(7,10)	0.000	0.001	0.006	0.024	0.080	0.182	0.355	0.549	0.743
(8,8)	0.000	0.001	0.009	0.032	0.100	0.214	0.405	0.595	0.786
(8,9)	0.000	0.001	0.005	0.020	0.069	0.157	0.319	0.500	0.702
(8,10)	0.000	0.000	0.003	0.013	0.048	0.117	0.251	0.419	0.621
(9,9)	0.000	0.000	0.003	0.012	0.044	0.109	0.238	0.399	0.601
(9,10)	0.000	0.000	0.002	0.008	0.029	0.077	0.179	0.319	0.510
(10,10)	0.000	0.000	0.001	0.004	0.019	0.051	0.128	0.242	0.414