Sunday, May 16, 2010

Ranking an array

In reviewing nonparametric statistical procedures, we see that we are missing the Mann-Whitney test. Now scipy has a routine mannwhitneyu(x,y) which returns the U statistic and the pvalue of the test. The null hypothesis is that the two independent samples have the same medians.

It requires that the data is at least on the ordinal, interval or ratio scales or the data must be sortable or ranked. Since the values (after transformation) are ranks, we looked for our old routine for ranking and decided to revamp it. See Version 0.0.1


def rankarray(X, rankstartvalue = 1,  averageTies=True):
    """
    version 0.0.2 may 16, 2010
    """
    R = [rankstartvalue + i for i in range(len(X))]
    xi =  [(x,i) for i, x in enumerate(X)]
    xi.sort()
    if averageTies:
        start = 0
        end   = 0
        for i in range(1, len(X)):
           if xi[i][0] == xi[start][0]:
              end = i
           else:
              count = end-start + 1
              avgRank = xi[start][0] + xi[end][0]  / 2
              for j in range(start,end+1):
                  R[j] = avgRank
              start = i
              end   = i
              
        #Adjust for any trailing similar ranks.
        if start != end:
            count = end - start + 1
            for j in range(start,end+1):
                  R[j] = (R[start][0]+ R[end][0])/2
                  
    RR=[0] * len(X)
    for j,  (x,  i) in enumerate(xi):
        # print j, x, i,  R[i]
        RR[i] =R[j]
    return RR

if __name__ == "__main__":
    X = [7, 7, 4,4,4,4,4, 8,  6, 5, 1, 1]
    
    print "X=",  X
    print "Rank start value=%d, averageTies=%s"  %(0,True)
    R = rankarray(X,  rankstartvalue = 0,  averageTies = True)
    print R
    
    print "Rank start value=%d, averageTies=%s"  %(0,False)
    R = rankarray(X,  rankstartvalue = 0,  averageTies = False)
    print R
    
    print "Rank start value=%d, averageTies=%s"  %(1,True)
    R = rankarray(X,  rankstartvalue = 1,  averageTies = True)
    print R
    
    print "Rank start value=%d, averageTies=%s"  %(1,False)
    R = rankarray(X,  rankstartvalue = 1,  averageTies = False)
    
    print R
    

When the above script is run, it outputs:

Python 2.6.5 (r265:79063, Apr 16 2010, 13:57:41) 
[GCC 4.4.3] on toto-laptop, Standard
>>> X= [7, 7, 4, 4, 4, 4, 4, 8, 6, 5, 1, 1]
Rank start value=0, averageTies=True
[10, 10, 6, 6, 6, 6, 6, 11, 9, 7, 1, 1]
Rank start value=0, averageTies=False
[9, 10, 2, 3, 4, 5, 6, 11, 8, 7, 0, 1]
Rank start value=1, averageTies=True
[10, 10, 6, 6, 6, 6, 6, 12, 9, 7, 1, 1]
Rank start value=1, averageTies=False
[10, 11, 3, 4, 5, 6, 7, 12, 9, 8, 1, 2]

But we cannot guarantee 100 percent that there are no errors.

No comments:

Post a Comment