Monday, April 5, 2010

Python, Statistics: The non-parametric Runs test for randomness.

Top

  1. Introduction
  2. Probability mass function
  3. Normal Approximation
  4. Python Code
  5. Swed and Eisenhart table







TopIntroPmfPythonNormalSwed Bottom



Introduction



Consider the sequence ABBAAABBABABBB. A subsequence of consecutive same letters is called a run. For our example, we use vertical bars to separate each run: A|BB|AAA|BB|A|B|A|BBB consists of 8 runs. This sequence contains 6 A's and 8 B's which we denote by (m,n)= (6,8). We would be suspicious of people claiming that the sequences AAAAAAABBBBBBB or ABABABABABABAB are random. The runs test was developed precisely to test the null hypothesis whether a sample of a sequence of binary symbols is randomly generated. The minimum number of runs of two symbols $m$ of one kind and $n$ of another kind is 2. On the other hand, the maximum possible number of runs is $2\, min(m, n)$ plus 1 if both m and n are different.



The probability mass function of the distribution of runs



The distribution of the number of runs has been "derived" or explained in [Mood, Graybill and Boes,"Introduction to Statistics" McGraw Hill, 3e, 1974] p.519-521. The probability mass function is given for any even number of runs $(z = 2k)$ as \[ prob(Z = z)=\frac{2\binom{m-1}{k-1}\binom{n-1}{k-1}}{\binom{m+n}{m}}\] and for an odd number of runs, $(z = 2k+1)$ as \[ prob(Z = z)=\frac{\binom{m-1}{k}\binom{n-1}{k-1} + \binom{m-1}{k-1} \binom{n-1}{k}}{\binom{m+n}{m}}\] Recall that $\binom{n}{r}$ is the number of combinations of $n$ things taken $r$ at a time.It can be computed as $\binom{n}{r} = \frac{n!}{r!(n-r)!}$ but this is inefficient, and instead we use the formula $\binom{n}{r} = \frac{n(n-1)\cdots (n-r+1)}{1\cdot 2\cdots r}$ It would be foolish to do by hand even with aid of a non-programmable scientific calculators, especialy when calculating the cumulative mass distribution function and is the reason we are writing Python codes.



TopIntroPmf NormalPythonSwedBottom




The normal approximation to the distribution of runs



When m and n are both large, say m and n are both greater than 10, then an asymptotic normal approximation might be good enough. The mean is given by \[\mu = \frac{2mn}{m+n} +1\] and the variance would be given by \[\sigma^2 = \frac{2mn(2mn -m -n)}{(m+n)^2 (m+n-1}\]




TopIntroPmf NormalPythonSwedBot





Python codes



Here is our latest version as of April 6. It is more complete now but as usual, users beware. It has not been thoroughly debugged.
FunctionDescription
nCr(n, r) binomial coefficient
maxruns(m,n) maximum runs is (m,n) sequence
pmfruns(z, m, n) probability mass function
cmftableruns(m, n) cumulative mass function table
pvalue(z, m, n, side=-1) $p-$value of test statistic z
normrunstest(z, m, n, alpha, side) normal approximation test
exactrunstest(z,m,n,alpha,side=0)exact test for runs
swedeisenhart Swed and Eisenhart (m,n)cumulative distribution table
isvalidruns(S) return (m,n) if S is a valid runs sequence, else None



# -*- coding: utf-8 -*-
"""
file    testofruns.py
author  Ernesto P. Adorio
version 0.0.1 April 5, 2010
         
"""

from math import sqrt
from basictests import ztest,  ttest


def  nCr(n, r):
     """
     Computes the binomial coefficient, number of combinations of n things taken r at a time.
     """
     if r == 1:  return n
     if r == 0 or r == n: return 1
     
     if n < r   or n  < 0 or r  < 0: return 0
     if r > n-r: # find value of less looping.
        r = n-r
     f = 1
     for i in range(1, r+1):
        f *= (n - i+1) 
        f /= i
     return f
     
def maxruns(m,n):
     # maximum runs with m  symbols of the first kind and n symbols of the second.kind.
     return 2 * min(m, n) + (1 if m!=n else 0)
  
def pmfruns(z, m, n):
     """
     Computes the probability mass function for the distribution of runs.
     """
     if z < 2: 
        raise ValueError,  "in pmfruns, z must satisfy z >= 2"
     if  z % 2 ==0: #even z.
        k = z // 2 
        return 2.0 * nCr(m-1, k-1) * nCr (n-1, k-1)/ nCr(m+n, m)
     else : #odd z
        k = (z-1) // 2
        return float(nCr(m-1, k) * nCr(n-1, k-1) + nCr(m-1, k-1) * nCr(n-1, k)) / float(nCr(m + n, m))

def cmftableruns(m, n):
    """
    Returns a cumulative masss function for the distribution of runs.
    You must note that that the random variable z starts at 2 !
    """
    tot = 0
    result=[]
    
    maxz = maxruns(m,  n)
    for i in range(2, maxz+1):
       tot += pmfruns(i, m, n)
       result.append(tot)
    return  result


def pvalue(z, m,  n,  side=-1):
    "unchecked. to be tested first!"
    upperz = maxruns(m,n)
    if not (2 <= z <= maxruns) :
       raise ValueError, "in pvalue(), z must lie in (2, %d)" % upperz
    table = cmftableruns(m,n)
    if side == 0:
       p = table[z-2]
       if p > 0.5: 
          p = 1-0.5
       return p * 2
    elif side == -1:
        return table[z-2]
    elif side == 1:
       return 1 - table[z-2] 
        
def normrunstest(z,  m,  n,  alpha,  side):
    #Test using normal approximation, when both m and n are greater than 10.
    mean = 2.0 * m *n / (m+n) + 1
    variance = 2. * m * n * (2 *m *n -m -n)/((m+n)**2 * (m+n - 1))
    return ztest(z - mean,  sqrt(variance/n), alpha,  side) 


def swedeisenhart():
    print
    print """"""      for i in range(2, 11):         print '' % i,     print "
"     for m in range(2, 11):         for n in range(m, 11):             rowvals = cmftableruns(m, n)             maxcol = len(rowvals)             print "
" % (m,n),             for j in range(min(maxcol, 9)):                 print "" % rowvals[j],             for j in range(maxcol, 9):                 print "",             print "
"     print "
P(Z <= a) when H_0 is true in the Runs test
(m, n)%d
(%d,%d)%5.3f
" def exactrunstest(z, m, n, alpha, side=0): #returns pvalue, test stat z and lower and upper limits. cmf = cmftableruns(m,n) print cmf, a = b = None if side ==0: alpha /= 2.0 alpha1 = alpha alpha2 = 1-alpha1 elif side == 1: alpha2 = 1-alpha elif side == 0: alpha1 = alpha for i in range(len(cmf)): if cmf[i] <= alpha1: a = i+2 if cmf[i] >= alpha2: b = i+2 p = pvalue(z, m,n, side) if side == 0: return (p, z, (a,b)) elif side == 1: return (p, z, b) elif side == -1: return (p, z, a) def isvalidruns(S): D = {} for s in S: if s in D: D[s] +=1 else: D[s] = 1 if len(D) != 2: raise ValueError, "in isvalidruns(): more than two symbols" counts = [] for (i,s) in enumerate(D): counts.append(D[s]) a,b = counts[0], counts[1] return min(a, b), max(a,b) def Test(): swedeisenhart() # decomment line to turn on table generation test. #print "Exact runs test for z = 5, (m, n)= (5,7)" #print exactrunstest(5, 5,7, 0.05, side=0) """ print pmfruns(3, 5,4) print normrunstest(5, 4,9,0.05, -1) print "Table for runs test." m, n = 3,5 print cmftableruns(m,n) """ if __name__ == "__main__": Test()


The Python module above calls the ztest which was already described in previous posts. We will add all files required in one zip file for convenience later.

TopIntroPmf Python NormalSwedBot

Swed and Eisenhart Table

Here is an html table obtained by running swedeisenhart() routine.
$P(Z <= a)$ when $H_0$ is true in the Runs test
(m, n) 2 3 4 5 6 7 8 9 10
(2,2) 0.333 0.667 1.000
(2,3) 0.200 0.500 0.900 1.000
(2,4) 0.133 0.400 0.800 1.000
(2,5) 0.095 0.333 0.714 1.000
(2,6) 0.071 0.286 0.643 1.000
(2,7) 0.056 0.250 0.583 1.000
(2,8) 0.044 0.222 0.533 1.000
(2,9) 0.036 0.200 0.491 1.000
(2,10) 0.030 0.182 0.455 1.000
(3,3) 0.100 0.300 0.700 0.900 1.000
(3,4) 0.057 0.200 0.543 0.800 0.971 1.000
(3,5) 0.036 0.143 0.429 0.714 0.929 1.000
(3,6) 0.024 0.107 0.345 0.643 0.881 1.000
(3,7) 0.017 0.083 0.283 0.583 0.833 1.000
(3,8) 0.012 0.067 0.236 0.533 0.788 1.000
(3,9) 0.009 0.055 0.200 0.491 0.745 1.000
(3,10) 0.007 0.045 0.171 0.455 0.706 1.000
(4,4) 0.029 0.114 0.371 0.629 0.886 0.971 1.000
(4,5) 0.016 0.071 0.262 0.500 0.786 0.929 0.992 1.000
(4,6) 0.010 0.048 0.190 0.405 0.690 0.881 0.976 1.000
(4,7) 0.006 0.033 0.142 0.333 0.606 0.833 0.955 1.000
(4,8) 0.004 0.024 0.109 0.279 0.533 0.788 0.929 1.000
(4,9) 0.003 0.018 0.085 0.236 0.471 0.745 0.902 1.000
(4,10) 0.002 0.014 0.068 0.203 0.419 0.706 0.874 1.000
(5,5) 0.008 0.040 0.167 0.357 0.643 0.833 0.960 0.992 1.000
(5,6) 0.004 0.024 0.110 0.262 0.522 0.738 0.911 0.976 0.998
(5,7) 0.003 0.015 0.076 0.197 0.424 0.652 0.854 0.955 0.992
(5,8) 0.002 0.010 0.054 0.152 0.347 0.576 0.793 0.929 0.984
(5,9) 0.001 0.007 0.039 0.119 0.287 0.510 0.734 0.902 0.972
(5,10) 0.001 0.005 0.029 0.095 0.239 0.455 0.678 0.874 0.958
(6,6) 0.002 0.013 0.067 0.175 0.392 0.608 0.825 0.933 0.987
(6,7) 0.001 0.008 0.043 0.121 0.296 0.500 0.733 0.879 0.966
(6,8) 0.001 0.005 0.028 0.086 0.226 0.413 0.646 0.821 0.937
(6,9) 0.000 0.003 0.019 0.063 0.175 0.343 0.566 0.762 0.902
(6,10) 0.000 0.002 0.013 0.047 0.137 0.287 0.497 0.706 0.864
(7,7) 0.001 0.004 0.025 0.078 0.209 0.383 0.617 0.791 0.922
(7,8) 0.000 0.002 0.015 0.051 0.149 0.296 0.514 0.704 0.867
(7,9) 0.000 0.001 0.010 0.035 0.108 0.231 0.427 0.622 0.806
(7,10) 0.000 0.001 0.006 0.024 0.080 0.182 0.355 0.549 0.743
(8,8) 0.000 0.001 0.009 0.032 0.100 0.214 0.405 0.595 0.786
(8,9) 0.000 0.001 0.005 0.020 0.069 0.157 0.319 0.500 0.702
(8,10) 0.000 0.000 0.003 0.013 0.048 0.117 0.251 0.419 0.621
(9,9) 0.000 0.000 0.003 0.012 0.044 0.109 0.238 0.399 0.601
(9,10) 0.000 0.000 0.002 0.008 0.029 0.077 0.179 0.319 0.510
(10,10) 0.000 0.000 0.001 0.004 0.019 0.051 0.128 0.242 0.414
If you have any suggestions, or corrections, or features request do not hesitate to email the author at ernesto.adorio @ gmail . com.

Bottom



Top IntroPmfPython NormalSwedBot

I have encountered the html problem: Your HTML cannot be accepted: Closing tag has no matching opening tag: A ???!!! I wonder where its !!

1 comment:

  1. where is the basictest module? did you create it or is it a python module? "from basictests import ztest, ttest"

    ReplyDelete