How to interpret correlation and R-Squared (15:42)
Videos are available at one organized Quant 101 Playlist on YouTube (opens in a new browser window).
Welcome. Today's goal is to interpret the correlation coefficient and the related R-Squared measure.
I'm Paul, and it surprises me how often financial model builders find highly correlated data but don't take the extra step to look at a scatter plot to see if problems exist.
So here we will also see issues that arise like outliers, curvilinear relationships and hidden variables for all forms of correlation analysis, not just related to stocks.
Our focus here, in Quant 101, is on building financial models used for risk analysis and portfolio optimization. By the of this Chapter 4, we will use statistical concepts to evaluate portfolio performance using linear regression.
This is my 237th video and subscribers asked for video transcripts including the Excel formulas so if you want to follow along the first link in the video Description goes to the web page.
Let's talk about our plan for the day.
First, we will cover seven important points about correlation.
Second, we discuss the interpretation and how it may differ for the social sciences like Economics and the natural sciences like Physics.
Third, we cover R-Squared and its interpretation with an exercise.
Fourth, we run through five issues that often arise in correlational studies.
And in our next episode we will dive into linear regression, a natural progression from correlation.
For Step 1 let's cover seven important aspects about correlation.
First, we know correlation is short for the correlation coefficient which is a calculated measure that conveys how closely related two data sets are. The coefficient is sometimes simplified with the letter r.
Second, we know there are many different calculation methods for correlation depending on the type of data with the most common two being Pearson and Spearman's.
The Pearson method is most common. In fact, if you mention correlation
without being specific most people will assume it is the Pearson
version. It is the default calculation in many software programs
including Excel using the
function. The Pearson method measures the strength and direction of a
linear relationship and it will be our focus here.
Likely the second most common correlation measure is called Spearman's rank correlation coefficient which is better suited to measure variables that are ranked.
Third, we know the resulting numerical measure falls along the range from -1 to +1. The endpoints represent a very strong or tight linear relationship between the two variables. Those in the middle, around zero represent no linear relationship.
Fourth, the sign of the resulting measure matters in that it describes the direction of the relationship.
Let's use the first example as an illustration. Let's say we have an x-variable and a y-variable with a table of observations in the range F25:G34 in Excel.
Here and in the accompanying chart we see that both variables move in the same direction. As x increases so does y. This is a positively sloping line so these variable are said to be positively correlated. The sign on the resulting Pearson correlation coefficient is positive as a result.
We can see this in cell G23 where the formula
=CORREL(F25:F34,G25:G34) generates the
result of 1.00. To interpret, this means that the x-variable and
y-variable are perfectly correlated which means that all observations
fall on a line.
Conversely, if the y-variable had moved in the negative space, meaning if you put negative signs in front of all of the y-variables, then the correlation would result in -1.00. So again it is also perfectly correlated meaning all observations fall on the line, however however it would be perfectly negatively correlated, as the minus sign indicates.
Okay, let's throw out two examples and decide whether the correlation coefficient would be positive or negative? Here goes. In Scenario 1 you have two variables: an athlete's height and his or her success as a basketball player. As a person's height increases we would expect there to be a higher corresponding success as a basketball player, however that is defined.
Now for Scenario 2, your two variables are: height and success as a horse-racing jockey. Here we expect the sign to be negative as we would expect taller people to have less success as a horse-racing jockey.
So as a takeaway, the sign is positive when both variables increase or decrease together and negative when one increases while the other decreases.
Next, people often mistakingly assume that correlation is related to the slope of the line, so let's use that same example data set to dispel that assumption.
As you can see, the y-variable increases 2 times for every 1 increase in the x-variable, so using rise-over-run we know that the slope of this line is 2. The correlation is 1 because all observations fall on the line.
Remember, correlation captures the extent or strength of the linear relationship between two variables and the relationship between the two here couldn't be any closer to a linear relationship, so the resulting correlation is 1.00.
Sixth, the phrase "correlation does not imply causation" sums up a confusion many have with correlation. Basically, because two variables are correlated doesn't mean that one causes another to change.
We said earlier that if there is some reason to believe that changes in one variable 'help to predict' changes in another then we can move on to linear regression.
With correlation we are only quantifying the relationship between two variables and there is no 'causal relationship' then it doesn't matter which variable you put on the x-axis and y-axis. You don't need to graph the variables on a scatter plot technically, but as we will see in a few moments, it often does help.
To prove that it doesn't matter which variable is x and which is y,
let's go to cell E25 and input the y-variable first,
where the x-variable normally goes. So the formula would read
=CORREL(G25:G34,F25:F34) and the
correlation should match at 1.00.
Again, a caveat about the word 'causal'. In the social sciences, like economics, we don't truly know if there is a 'cause and effect' relationship between variables, so we are cautious about using that term. You know, couch it in all the appropriate disclaimers. But, at the same time we need a word, so you typically see 'causal' when referring to regression.
Next, for visualization, while we could simply use an Excel function
=PEARSON() to generate the correlation
coefficient and go, it often helps to visualize the data on a scatter
Visualizing data is a good first step because it helps us spot errors and patterns in the data. We will cover two additional scatter plots in Example #2 and #3 shortly when discussing issues that can potentially negate results.
Let's move on to Step 2 and interpret correlation.
Okay, so we said earlier that closer to the ends of the range, represent a tight linear relationship and this bar represents the whole range of correlations and it includes descriptive names along the spectrum.
It starts with 'very strong', to 'strong', then 'moderate' and 'weak' to 'none', or no linear relationship in the center near zero. Notice how the terms are repeated in the positive space as well.
I should note that interpretations differ for the natural sciences and the social sciences.
The natural sciences, often called the "hard sciences", include fields like biology, chemistry, materials, earth sciences and physics. Here "laws of nature" are more quantifiable and testable and results often fall within tighter bounds.
As an example, a chemist may be be able to isolate a mix of chemicals and because molecules will be exactly the same from one test to another fewer outside variables need to be accounted for.
On the other hand, in the social sciences, or "soft sciences", like Economics, because there is a human element and more uncontrolled variables interpretation requires more subjectivity. For example, how one investor reacts to uncertainty may be completely different from another. As a result interpretation differs.
The descriptions shown here are more typical of those used in the social sciences.
Now moving on to Step 3, let's talk about R-Squared and its interpretation. It's formal name is the coefficient of determination but most people use R-Square or R-Squared, because it exactly describes the procedure.
We saw that another name for the correlation coefficient earlier was r and if we square it we have R-Squared, simple as that.
So using the input range of -1 to +1 for correlations, once squared we know two things. First, the sign goes away and second the new range goes from 0 to +1.
Now for Exercise 2, let's use that same range of nine correlations which we gave descriptive names to earlier and create a table in the range from P19:Q27.
In column P we input the correlations and in column
Q we square column P. As with most
things in Excel, there are several ways to perform calculations and
here I used the shorthand version with
^2. I also rounded to two decimal
places which is fairly standard for both correlation and R-Squared.
Now if we weren't starting with correlations and instead used a data
set we could use the
and input the two columns of data.
So this shows how as you approach the end points, R-Squared increases exponentially, which aligns with the descriptive terms we used with correlation earlier.
Next, let's talk about issues so you know what to look for when reviewing your scatter plots.
Outliers are data points that skew summary statistics. They can arise from outright errors, or valid data points that are "in the tails" as they say. Think about the average height of 10 people in a room, with 9 horse-racing jockeys and 1 NBA center.
The inclusion of the NBA center in the sample will skew the average up, right? And visually, the scatterplot in the middle clearly illustrates this point of outlier data. Here we have the exact same data as in the first chart, except that our first data point appears to be an outlier.
Now look at the difference between the correlation coefficients, 1.00 to -0.14. Instead of perfect correlation, we now have very low reading of -0.14 in cell J23. This is an example of how outliers skew the results of summary data, and why we need to view the scatter plot.
What if it was just a data error? Take a second to think of the possible ramifications of that for a portfolio manager. Remember from earlier tutorials, stocks with low correlations offer diversification benefits, so granted this is an extreme example, but surely you can see that errors can lead to unintended consequences. Here we can see why.
A curvilinear relationship is depicted in the scatterplot for Example #3. A simple correlation calculation won't capture this, despite the fact that there is a logical relationship.
Notice how the calculated correlation in cell M23 shows zero. Imagine if this was built into one of the cells of our covariance matrix. To the naked eye, we can see there was some type of relationship here, and we should really investigate the cause.
Of course, the goal here is better analysis and sometimes curves better describe your data than lines do, and for this we need more advanced statistical measures found in statistical programming languages like SPSS, R, or Stata, for example.
Next, non-normal distributions are those with peaks and valleys, or skewed distributions. Think about a distribution of commute times for individuals in an office. You can say three things about this. First, more people live closer to their workplace, second, commute times are never below zero, and third, there are fewer occurrences of people who travel great distances.
Next, hidden variables. The example we used for hidden variables in the previous tutorial was the relationship between high sales of umbrellas and traffic accidents. What was the missing variable? The amount of rainfall.
The term spurious correlation refers to when there is high correlation between variables but the relationship is actually based on a third variable.
We used this example before, the sale of cars and dishwashers, they demonstrate a high correlation, but are related to economic growth rather than to each other.
So as you can see, a correlation figure may not be as meaningful as it appears.
By way of summary, we learned how to interpret a range of correlation coefficients, plus the related R-Squared measure. Visualizing the co-movements between variables is always a first step, helping to spot outliers or patterns in your data that may not be evident when reviewing numerical statistics alone.
In the end it all boils down to confidence, and how confident we are in our estimates. So as we move forward we will uncover statistical measures that quantify this so we have an automated way to increase the accuracy of our models.
In the next episode we will press on with linear regression in an attempt to predict or forecast a dependent variable given changes in an independent variable. This will be very similar to correlation but will allow us to assign a variable to the x-axis and another to the y-axis. So stay tuned for that.
If you were curious, the horse-racing theme here was inspired by the recent running of the 144th Kentucky Derby. That negatively sloping example earlier brings to mind a winning horse coming down the stretch in the Run for the Roses, manned by none other than Shaquille O'Neal. Now that's one outlier I would like to see. I'll leave you with that image.
Feel free to join us any time and have a nice day.
The visuals for this tutorial are provided in the video alone, so make sure you see that.
I encourage you to check out our YouTube Channel. Subscribe straight from here.