Sunday, April 4, 2010

Testing the significance of the correlation coefficient r

The sample correlation coefficient is a measure of the strength of the linear relationship between two variables X and Y. Actually it is the coefficient of determination $r^2$ which is a better measure. It is computed by the formula

$$ r =\frac{\sum_{i = 1}^n (x_i - \overline{x} ) (y_i - \overline{y})}{S_x S_y} $$

where $S_x$ and $S_y$ are the standard deviations of the X and Y samples. We have For $S_x = \sqrt{\frac{\sum (x_i -\overline{x})^2}{n-1}}$ with a similar formula for $S_y$. The value of $r$ varies from -1 (perfect negative correlation) to +1 (perfect positive correlation). A negative correlation means that as one variable increases, the other other variable decreases, or in other words, both variables varies in opposite directions. On the other hand a positive correlation arises when both variables varies together in the same direction: if one variable increases, the other also increases and if one variable decreases, the other also decreases.

Correlation is generally used when it is difficult to determine the causative order. Does X causes Y or does Y causes X? It is heavily used in the Psychology field for example. In this post, we describe how to test the significance of the computed correlation coefficient.

The sample test statistic $t= \frac{r-\rho}{\sqrt{\frac{1-r^2}{n-2}}}$ follow assymptotically the t-distribution with $df=n-2$ degree of freedom and tables of the t-distribution may be used. Thus to write our Python program to use the ttest routine described previously in our blog, we specify the sample test statistic and the standard error.

def corrtest( rho, r, n, alpha = 0.5, side = 0):
     stattest = (r - rho)
     se = sqrt( (1-r*r)/(n-2.0))
     return ttest(r-rho, se, alpha, side)



Neat, is not it? We obtain the p-value, the distribution test statistic and the critical limits(values). For side = 0, a double sided test, the two limits do not strictly form the true confidence interval. Here is the complete testforcorr.py Python file where we included the ttest function. It calls the rstats.py module we have written before to use R-names for the Python statistical functions.

# -*- coding: utf-8 -*-
"""
file     testofcorr.py
author   Ernesto P. Adorio
         ernesto.adorio @ gmail.com
         UP at Clarkfield
         Angeles, Pampanga

version  0.0.1 april 4, 2010
"""

from math import *
from rstats import *



def ttest(samplestat, se, df, alpha= 0.05, side=0):
    """                                            
    T-test of a sample statistic.                  
    Arguments:                                     
     samplestat- sample statistic                  
     se -    standard error of sample statistic    
     df      degree of freedom                     
     alpha - significance level                    
     side  - -1 left-sided test                    
             0  double sided test                  
             +1 right-sided test                   
    """                                            
    Ttest = samplestat/se                          
    if side ==0:                                   
        pvalue = pt(Ttest, df)                     
        if Ttest > 0:                              
            pvalue = 1-pvalue                      
        pvalue *=2                                 
        tcrit1 =    qt(alpha/2,   df)              
        tcrit2 =    qt(1-alpha/2.0,df)             
        return pvalue, Ttest, (tcrit1,tcrit2)      
    elif side == -1:                               
       pvalue = pt(Ttest,  df)                     
       tcrit = qt(alpha,  df)                      
       return pvalue, Ttest, tcrit                 
    else:                                          
       pvalue = 1- pt(Ttest,  df)                  
       tcrit  = qt(1.0-alpha,  df)                 
       return pvalue, Ttest, tcrit                 


def cov(X,Y):
    n = len(X)
    if n != len(Y):
        raise "ArgumentError", "in cov: len(X) != len(Y)"
    xbar = float(sum(X))/ n
    ybar = float(sum(Y))/n
    return sum([(x -xbar)*(y-ybar) for x,y in zip(X,Y)])/(n-1)

def Sx(X):
    xbar = float(sum([x for x in X]))/n
    return sqrt( sum([(x - xbar)**2 for x in X]) /(n-1.0))
 
def cor(X,Y): 
    # correlation coefficient of X and Y. 
    return cov(X,Y)/(Sx(X)*Sx(Y))


def corrtest( rho, r, n, alpha = 0.5, side = 0):
     stattest = (r - rho)
     se = sqrt( (1-r*r)/(n-2.0))
     return ttest(r-rho, se, n-2,  alpha, side)


def xycorrtest(X,Y, rho, alpha = 0.5, side= 0):
     if len(X) != len(Y):
        raise ArgumentError, "in xycorrtest: X and Y must have same length"
     r = cor(X,Y)
     return corrtest(rho, r, len(X), alpha, side)


def Test():
     #Berenson , p. 546 
     rho = 0
     r = 0.951
     alpha = 0.05
     side = 1
     n = 14
     print "Berenson's example, p. 546"
     print corrtest( rho, r, n, alpha, side)

if __name__ == "__main__":
    Test()



When we invoke the Python interpreter it will run the Berenson, p. 546 example. The output is


toto@toto-laptop:~/Blogs/statistics$ python testofcorr.py
Berenson's example, p. 546
(8.9865653363219167e-08, 10.654779470653486, 1.7822875556491591)


The extremely low p-value means that we can reject the Null hypothesis that the correlation coefficient is zero.
To support this decision, the obtained distribution test statistic is greater then 1.78 critical value limit, i.e.,
is in the rejection region.


I need to specify a darker background for <code> blocks and a central download repository for
the files. Stay tuned. We will add more useful technical contents soon!

No comments:

Post a Comment