## Sunday, April 4, 2010

### Testing the significance of the correlation coefficient r

The sample correlation coefficient is a measure of the strength of the linear relationship between two variables X and Y. Actually it is the coefficient of determination $r^2$ which is a better measure. It is computed by the formula

$$r =\frac{\sum_{i = 1}^n (x_i - \overline{x} ) (y_i - \overline{y})}{S_x S_y}$$

where $S_x$ and $S_y$ are the standard deviations of the X and Y samples. We have For $S_x = \sqrt{\frac{\sum (x_i -\overline{x})^2}{n-1}}$ with a similar formula for $S_y$. The value of $r$ varies from -1 (perfect negative correlation) to +1 (perfect positive correlation). A negative correlation means that as one variable increases, the other other variable decreases, or in other words, both variables varies in opposite directions. On the other hand a positive correlation arises when both variables varies together in the same direction: if one variable increases, the other also increases and if one variable decreases, the other also decreases.

Correlation is generally used when it is difficult to determine the causative order. Does X causes Y or does Y causes X? It is heavily used in the Psychology field for example. In this post, we describe how to test the significance of the computed correlation coefficient.

The sample test statistic $t= \frac{r-\rho}{\sqrt{\frac{1-r^2}{n-2}}}$ follow assymptotically the t-distribution with $df=n-2$ degree of freedom and tables of the t-distribution may be used. Thus to write our Python program to use the ttest routine described previously in our blog, we specify the sample test statistic and the standard error.

def corrtest( rho, r, n, alpha = 0.5, side = 0):
stattest = (r - rho)
se = sqrt( (1-r*r)/(n-2.0))
return ttest(r-rho, se, alpha, side)



Neat, is not it? We obtain the p-value, the distribution test statistic and the critical limits(values). For side = 0, a double sided test, the two limits do not strictly form the true confidence interval. Here is the complete testforcorr.py Python file where we included the ttest function. It calls the rstats.py module we have written before to use R-names for the Python statistical functions.

# -*- coding: utf-8 -*-
"""
file     testofcorr.py
author   Ernesto P. Adorio
ernesto.adorio @ gmail.com
UP at Clarkfield
Angeles, Pampanga

version  0.0.1 april 4, 2010
"""

from math import *
from rstats import *

def ttest(samplestat, se, df, alpha= 0.05, side=0):
"""
T-test of a sample statistic.
Arguments:
samplestat- sample statistic
se -    standard error of sample statistic
df      degree of freedom
alpha - significance level
side  - -1 left-sided test
0  double sided test
+1 right-sided test
"""
Ttest = samplestat/se
if side ==0:
pvalue = pt(Ttest, df)
if Ttest > 0:
pvalue = 1-pvalue
pvalue *=2
tcrit1 =    qt(alpha/2,   df)
tcrit2 =    qt(1-alpha/2.0,df)
return pvalue, Ttest, (tcrit1,tcrit2)
elif side == -1:
pvalue = pt(Ttest,  df)
tcrit = qt(alpha,  df)
return pvalue, Ttest, tcrit
else:
pvalue = 1- pt(Ttest,  df)
tcrit  = qt(1.0-alpha,  df)
return pvalue, Ttest, tcrit

def cov(X,Y):
n = len(X)
if n != len(Y):
raise "ArgumentError", "in cov: len(X) != len(Y)"
xbar = float(sum(X))/ n
ybar = float(sum(Y))/n
return sum([(x -xbar)*(y-ybar) for x,y in zip(X,Y)])/(n-1)

def Sx(X):
xbar = float(sum([x for x in X]))/n
return sqrt( sum([(x - xbar)**2 for x in X]) /(n-1.0))

def cor(X,Y):
# correlation coefficient of X and Y.
return cov(X,Y)/(Sx(X)*Sx(Y))

def corrtest( rho, r, n, alpha = 0.5, side = 0):
stattest = (r - rho)
se = sqrt( (1-r*r)/(n-2.0))
return ttest(r-rho, se, n-2,  alpha, side)

def xycorrtest(X,Y, rho, alpha = 0.5, side= 0):
if len(X) != len(Y):
raise ArgumentError, "in xycorrtest: X and Y must have same length"
r = cor(X,Y)
return corrtest(rho, r, len(X), alpha, side)

def Test():
#Berenson , p. 546
rho = 0
r = 0.951
alpha = 0.05
side = 1
n = 14
print "Berenson's example, p. 546"
print corrtest( rho, r, n, alpha, side)

if __name__ == "__main__":
Test()


When we invoke the Python interpreter it will run the Berenson, p. 546 example. The output is

 toto@toto-laptop:~/Blogs/statistics\$ python testofcorr.py Berenson's example, p. 546 (8.9865653363219167e-08, 10.654779470653486, 1.7822875556491591) 

The extremely low p-value means that we can reject the Null hypothesis that the correlation coefficient is zero.
To support this decision, the obtained distribution test statistic is greater then 1.78 critical value limit, i.e.,
is in the rejection region.

I need to specify a darker background for <code> blocks and a central download repository for
the files. Stay tuned. We will add more useful technical contents soon!