Sunday, August 15, 2010

Statistics and Python: Creating a frequency counts array for data.

Given a list of recorded data X, a list of upper class marks UCM, we generate a list of frequency counts for the data such that each x in X is counted in the freq[i] if x <= ucm[i] where i is as high as possible.





Our frequency counting function automatically appends an excess entry whenever x is greater than the last upper class mark.It is considered a bad design if this happens, as it may denote data with outliers and the compressed frequency count table may not truly represent the original data.






Later we will handle the way to design the upper class marks list.We hasten to add that the introduction of computers has allowed statistical analysis of the ORIGINAL data even in the millions and has deprecated the manual frequency count table. Nevertheless, for historical curiosity and for teachers who insist that students know how to create such tables, we will write Python programs as aids in such laborious constructions.






Here is our frequency counting Python code. It expects the original list of data X and a list of upper class marks. Each upper class mark represent a class interval (ucm[i-1], ucm[i]]. A data point x falls in such class whenever $$ucm[i-1] < x \le ucm[i]$$. For good results, all data points must fall in some class interval. The minimum value of X should fall withing (ucm[0]-cw,cm[0]] where cw is the class width = ucm[1] - ucm[0] and the maximum value of X must fall within (ucm[n-2], ucm[n-1]]. Recall that Python lists uses zero based indexing.






"""
"""
file    distrib.py
author  Ernesto P. Adorio
        ernesto.adorio
version 0.0.1  2010.08.16  first version for counting only.
                           this will be added to xfreqdescstats.py after
                           thorough testing.
"""

def distrib(X, ucm):
    """
    Given upper class marks, and input array of values X,
    returns a table of frequencies, which may be longer than ucm.
    """
    n = len(ucm)
    print "len(ucm) = ", n
    freq = [0] *(n+1)
    for x in X:
        if x < ucm[0]:     
           freq[0]+= 1
        elif x > ucm[n-1]: # excess frequencies!
           freq[n]+= 1
        else:
          i = 0
          while (x > ucm[i]):
             i += 1
          freq[i] += 1
    return freq

def disttable(ucm, freq):
    """
    Creates a  frequency table, with ucm, class representative x, frequency count, relative frequency and cumulative relative frequency.
WARNING:  
    """ 
    # Adjust for excess frequency counts.
    n    = len(ucm)
    cw   = ucm[1] - ucm[0]
    xrep = 0.5 * (ucm[1] + ucm[0])-cw # class rep of first class interval.
    if len(freq) > n and freq[n] > 0:
       ucm =array(ucm).append(ucm[n-1] + cw)
       n = n + 1
    totfreq = sum(freq)

    T = [[0.0]*5 for i in range(n)]
    crf = 0.0
    print "totfreq=", totfreq
    for i in range(n):
        #  class mark, x
        rf = freq[i]/float(totfreq)
        crf += rf
        T[i][0], T[i][1], T[i][2], T[i][3],T[i][4] = ucm[i], xrep, freq[i], rf, crf

        xrep += cw
    return T
 

if __name__ == "__main__":
   from scipy import *
   import scipy.stats as stat

   X = stat.norm.rvs(size= 100)
   ucm = arange(-3.0, 3.1, 0.25)
   freq = distrib(X, ucm)
   T = disttable(ucm, freq)
   print freq
   print "Frequency distribution table for example:"
   for row in T:
      print row 
    





Here is one run of the program, for 100 generated normal random numbers.

len(ucm) = 25
totfreq= 100
[0, 0, 0, 0, 0, 0, 2, 4, 4, 7, 6, 10, 8, 9, 8, 12, 6, 9, 7, 2, 2, 2, 1, 1, 0, 0]
Frequency distribution table for example:
[-3.0, -3.125, 0, 0.0, 0.0]
[-2.75, -2.875, 0, 0.0, 0.0]
[-2.5, -2.625, 0, 0.0, 0.0]
[-2.25, -2.375, 0, 0.0, 0.0]
[-2.0, -2.125, 0, 0.0, 0.0]
[-1.75, -1.875, 0, 0.0, 0.0]
[-1.5, -1.625, 2, 0.02, 0.02]
[-1.25, -1.375, 4, 0.040000000000000001, 0.059999999999999998]
[-1.0, -1.125, 4, 0.040000000000000001, 0.10000000000000001]
[-0.75, -0.875, 7, 0.070000000000000007, 0.17000000000000001]
[-0.5, -0.625, 6, 0.059999999999999998, 0.23000000000000001]
[-0.25, -0.375, 10, 0.10000000000000001, 0.33000000000000002]
[0.0, -0.125, 8, 0.080000000000000002, 0.41000000000000003]
[0.25, 0.125, 9, 0.089999999999999997, 0.5]
[0.5, 0.375, 8, 0.080000000000000002, 0.57999999999999996]
[0.75, 0.625, 12, 0.12, 0.69999999999999996]
[1.0, 0.875, 6, 0.059999999999999998, 0.76000000000000001]
[1.25, 1.125, 9, 0.089999999999999997, 0.84999999999999998]
[1.5, 1.375, 7, 0.070000000000000007, 0.91999999999999993]
[1.75, 1.625, 2, 0.02, 0.93999999999999995]
[2.0, 1.875, 2, 0.02, 0.95999999999999996]
[2.25, 2.125, 2, 0.02, 0.97999999999999998]
[2.5, 2.375, 1, 0.01, 0.98999999999999999]
[2.75, 2.625, 1, 0.01, 1.0]
[3.0, 2.875, 0, 0.0, 1.0]




The listed arrays are the ucm, class rep, freq, rf, and the crf for each class. They are not pretty printed and it is left to the reader to format the output.

I am wondering why blogger removes our br tags and removes indents from our code!

No comments:

Post a Comment