## Tuesday, April 6, 2010

### Python: can you program something better or differently? Counting number of runs and occurrences of two symbols.

Here is something to occupy beginning and intermediate Python programmers:
Given a string S supposedly consisting only of two different symbols,
count the number of runs(a sequence of the same character), and the number of occurrences of the two symbols.
For example, the string "aaaabbbbbbabababbbbbbbbbbbbbbbbbb" may be rewritten
as "aaaa|bbbbbb|a|b|a|b|a|bbbbbbbbbbbbbbbbbb". has 8 runs and 7 'a's and 26 'b's.

def countruns(S):
# returns number of runs, and (m,n) in string S
m =n = 0
runs = 1
a = S[0]
b = None
pc = a  # previous character.
for s in S:
if s == a:
m += 1
if pc != a:
runs += 1
pc = a
elif b is None:
b = s
n += 1
pc = s
runs += 1
elif s == b:
n += 1
if pc !=  b:
runs +=1
pc = b
else:
raise ValueError, "in countruns():more than two diff. symbols."
if m > n:
m,n = n,m
return runs, (m, n)

S = "aaaabbbbbbabababbbbbbbbbbbbbbbbbb"
print countruns(S)


When the above program runs it outputs 8,(7,26).
Suppose you have very large strings(length >1000000)?
The specifications are hazy. But one of them is correctness.

For an application, visit The Non-parametric Runs Test.

1. Since this is a Python problem, using a built-in module is valid. The code below has an added benefit of not limiting the number of symbols to two. The for loop can be removed (vectorized) by using two groupby() calls instead of one. BTW, indentation doesn't work. :(

import itertools
def countruns(S):
# returns number of runs, and (m,n) in string S
runs = 0
counts = {}
for k, v in itertools.groupby(S):
runs += 1
counts[k] = counts.get(k, 0) + sum(1 for i in v)
return runs, counts.values()

S = "aaaabbbbbbabababbbbbbbbbbbbbbbbbb"
print countruns(S)

2. Thanks to your elegant and shorter Python solutio, Ianalis. Itertools is one module which one needs to be more familiar with as it has quite powerful functions.