onsdag 22. februar 2017

Correlation when outliers in the data

This is the second post in a series of six that deals with mathematics for calculation of correlation and trend in data with outliers. The posts are numbered 1 to 6. They should be read consecutively.

Post 1  Introduction to Statistical analysis of data with outliers
Post 2  Correlation when outliers in the data.
Post 3  Trend when outliers in the data.
Post 4  Correlation and trend when an outlier is added.   Example.
Post 5  Compare Kendall-Theil and OLS trends.             Simulations.
Post 6  Detect serial correlation when outliers.                Simulations.

The posts are gathered in this pdf document.

Start of post 2: Correlation when outliers in the data

The method most commonly used to estimate the correlation between two datasets is to calculate the correlation coefficient based on the values in the two data sets.. But it is more robust against outliers to calculate it based on the ranks of the data. This blog post discusses the mathematics behind both methods.

2.1 Pearson

The Pearson correlation coefficient between x and y is the covariance between them divided by the square root of the product of their variances. It is usually denoted by r. It may be calculated as shown in (2.1). n is the number of x and y values.

A large outlier in either x or y will have different impacts on the numerator and on the denominator in (2.1). The Pearson correlation coefficient is therefore sensitive to outliers in the data, and it is therefore not robust against them.


The null hypothesis H0 is that r is zero, and the alternative hypothesis H1 is that it is different from zero, positive or negative. The p-value of r is the probability to get such a large correlation coefficient, positive or negative, if the null hypothesis were true.

The t-value (2.2) is distributed approximately as a Student's t distribution with n-2 degrees of freedom under the null hypothesis. The F() function in (2.3) is the cumulative distribution function of the Student's t distribution with n-2 degrees of freedom.

The Pearson equations are included in this document to demonstrate that the r value is not robust against outliers and because the equations are referred to in the chapter on the Spearman's rank correlation coefficient.

Reference for the mathematics

Hans von Storch, Francis W. Zwiers: Statistical Analysis in Climate Research, ISBN 0 511 01018 4 virtual (netLibrary Edition), shows the equations for the Pearson correlation coefficient in chapter 8.2.

2.2 Spearman

The Spearman rank correlation coefficient between x and y is calculated based on the ranks of the x and y data instead of on their data values. It is usually denoted by the Greek letter ρ (rho).

It is calculated just as the Pearson correlation coefficient in (2.1), except that the x and y values are replaced with their ranks. The p-value is calculated as the Pearson p-value in (2.2) and (2.3), except that the correlation coefficient applied in (2.2) is Spearman's rho and not Pearson's r.

The ranks are assigned to the values in ascending order. Equal values form a set of ties. They get the same ranks with Fractional ranking, as illustrated in the example below.

2.3 Kendall

The calculation of the Kendall rank correlation coefficient compares each xy pair with all the xy pairs that follow. A change in the same direction for both x and y is a contribution towards positive correlation, a change in the opposite direction is a contribution towards negative correlation, and no change in x or y or in both of them is a contribution towards no correlation. This is regardless of how far they are from each other in rank; this differs Kendall from Spearmann. The Kendall coefficient is usually denoted by the greek letter τ (tau).

First the S value is calculated (2.4). n is the number of xy pairs. The sign() function returns +1 when its input parameter is positive, -1 when it is negative, and 0 when it is zero. In the latter case either the x or the y values or both of them are equal. Equal values are denoted as tied values.

T0 is the number of xy pairs that are compared in (2.4). It is the maximum value of S, and -T0 is the minimum value of S. T0 is defined in (2.5).

The correlation coefficient is calculated as either tau-a or tau-b. Tau-b compensates for tied values, while tau-a does not do that. (2.6) shows how tau-a is calculated.

T0 (2.5) is the binominal coefficient 'n choose 2'. It tells the number of ways to choose a subset of 2 elements, disregarding their order, from a set of n elements. It is equal to the total number of summations in (2.4). If all these summations add 1 to S, the numerator and the denominator in (2.6) are equal to each other and tau-a becomes 1.

Tied values do not contribute to S in the numerator in (2.6), but they are part of n and they therefore contribute to T0 in the denominator. When calculating tau-b this mismatch is compensated for, as shown in (2.7) to (2.9). In these equations the tied x and y values reduce the denominator too.

Tx (2.7) is calculated based on the px groups of tied x values. Each group consists of tx,i data values. Tx tells the number of times tied x values causes a zero in (2.4).
Ty is calculated similarly based on the tied y values.

I will now discuss how to determine whether a calculated tau-b is statistically significantly different from zero.

The Central Limit Theorem tells that the distribution of a sum of random variables tends toward a normal distribution when the number of additions is large, even if the original variables themselves are not normally distributed. Therefore, when the number of x and y variables is greater than 10, S approximately follows a standard normal distribution under the null hypothesis.

(2.10) to (2.14) define the variables that are used to calculate the standard deviation of S in (2.15).  tx,i , px, ty,i and py in (2.11) to (2.14) have the same meaning as explained in connection with (2.7) and (2.8).

zS, taub (2.16) is the standardized form of S. When the null hypothesis is true and n is greater than 10, zS, taub approximately follows a standard normal distribution N(0,1).

The p-value of tau-b is the probability to get such a large correlation coefficient, positive or negative, if the null hypothesis were true. It is calculated as shown in (2.17). F() is the cumulative standard normal distribution. It is multiplied by 2 because the hypothesis test is two-tailed.

When the p-value is less than 0.05 the correlation is statistically significant at the 0.05 level.

References for the mathematics

Base SAS 9.2 Procedures Guide: Statistical Procedures, Third Edition, ISBN 978-1-60764-451-4, documents the formulas that are used in the SAS/STAT Software. See the section 'Kendall’s Tau-b Correlation Coefficient' in that book. The same formulas are used by the Wikipedia article Kendall rank correlation coefficient and by the blog post Kendall’s Correlation Testing with Ties written by Dr. Charles Zaiontz.

Previous and Next post in the series

Ingen kommentarer:

Legg inn en kommentar