Correlation vs Regression for Data Analysis (18:58)
Videos are available at one organized Quant 101 Playlist on YouTube (opens in a new browser window).
Welcome. Today's goal is to review correlation versus regression.
I'm Paul, and if you're like me, then you find statistics to be a great way to summarize and make statements about data, but are often a bit unsure as to whether you are doing tests correctly or describing them properly.
So here we will compare and contrast correlation with regression with a light exercise before moving on to other financial modeling problems.
This tutorial sits in the middle of a series of 30 tutorials in Quant 101 where our focus is on portfolio optimization, risk analysis and performance attribution. This is the first tutorial in Chapter 4 where our end goal is to analyze portfolio performance using a free-to-download data set found in System Setup, but here we take a break from the data to focus on interpretation.
Some people prefer video and text, so if you would like to follow along the first link in the video Description goes to a web page with the video transcript.
Here is our plan for today.
First, we will summarize and define correlation plus regression with a little background on variables and measurement scales.
Second, we walk through a comparison with examples of each.
Third, we back up and talk about the line of best fit.
Fourth, we highlight two common correlation calculations.
And in our next episode we more fully detail common issues that arise with correlation analysis.
For all forms of data analysis a fundamental knowledge of both correlation and linear regression is vital.
The chart on the right (see video) is a visual depiction of a linear regression, but we can also use it to describe correlation. This may be confusing, so let me explain.
The chart is called a scatter plot in Excel. This one shows the relationship between two series of returns, but it wouldn't have to be returns, as it could be any variable.
The x-axis represents the returns for the Market and on the y-axis are the returns for one of our stocks Merck.
The 60 dots here represent points over 60 months on our Returns data tab. Each point corresponds to a return on the Market and Merck, for the same month.
For much of this discussion we will step back from stock returns and use the generic term 'variable'. That way this discussion applies to any form of data analysis.
Just think of a variable as a name, or a label, like 'x', and it holds a number, and that number or quantity in 'x' is assumed to change.
Now when we measure data we use scales.
All data can be cateorized into one of four measurement scales, sorted from the weakest level of measurement to the strongest. Each one builds on the previous one.
So if we were to categorize these four measurement scales using a scale, what would it be? Ordinal. They are ranked in order of levels of measurement, weakest to strongest, but the differences are not equal and there is no zero point.
Next, correlation is a measure of two variables and whether their co-movements are related. So the data points on a scatter plot can be expressed as a number, the correlation coefficient.
Correlation measures two things. First, extent of the linear relationship, meaning how closely related they are. So the dots closer to the line would represent a tight relationship and if it looked like a shotgun pattern then there would be no relationship.
This relationship is expressed in a number that ranges from -1 to +1 with those near the ends of the range, so near -1 or +1, demonstrating a higher relationship.
Second, correlation measures the direction and that is expressed in whether the sign is positive or negative.
So if the values for the x-variable increased at the same time as the y-variable, then the resulting correlation coefficient would be positive. If the y-variable decreases at the same time then the line would be negatively sloping and the correlation coefficient would be negative.
Now let's move on linear regression.
Let's say a researcher wanted to study whether that for some logical reason changes in one variable impacted or caused a second variable to change.
Let's use a non-returns example and say you are studying the relationship between the height of a plant related to the amount of water you feed it.
In this experiment you give each of 20 seedlings a different amount of water on a daily basis and measure the resulting plant height after 60 days. In this case, plant height is likely related to another variable, milliliters of water. Here we have some justification or rationale for there to be a cause-and-effect relationship. Not enough water can kill a plant and too much water can as well.
In this case we call the milliliters of water the independent variable. It is the one we change and then the dependent variable is 'dependent' on what we do with the independent variable. We normally put the independent variable on the x-axis and the dependent variable on the y-axis.
Controlled experiments such as the plant example are common in the natural sciences, but in the social sciences, like Economics, it is difficult to determine whether there is a true cause-and-effect relationship. So saying changes in one variable 'causes' another to change is often too strong.
Other terms you may hear in this context besides cause-and-effect, include 'related to', 'impacted by', 'follows', 'is explained by', 'helps to predict' or 'is dependent on'. So be careful with how you say these things. You can find great statistical relationships but if you describe them improperly your audience may tune you out.
So as we move on to the realm of Finance, here we will focus instead on whether there is economic justification for changes in the dependent variable to correspond with changes in the independent variable. We likely won't use the term 'cause', instead preferring something like 'has been shown to predict'.
So for example, you may want to study the impact of company earnings on its future stock price. Here you would have economic justficiation to state that the level of earnings has been shown to predict future stock price. Of course this is all based on estimates and presented with a range confidence levels.
Let's go back to our stock example. Here we are looking at the impact on Merck returns, the dependent variable, when the returns on the whole Market, the independent variable, change.
The economic rationale is that investors, by and large, are more comfortable paying more for all stocks at times, and less at other times. So if you are able to predict changes in the x-variable, returns on the Market, then there is some likelihood that you will be able to predict changes in the y-variable, returns on Merck.
With that let's move to Step 2 and compare and contrast correlation and regression with further examples.
First, the goal. Recall that both correlation and regression measure the relationship between two variables.
Correlation comes first and simply quantifies the relationship or co-movements between variables.
Next, if the researcher assumes that changes in one variable can help predict changes another, then we can use linear regression.
Okay, let's look at a few examples to solidify this. Let's say you are performing a study on elephants and determine there is a high correlation between skull size and leg length. This seems logical as larger elephants typically have larger skulls and longer legs. However, you couldn't assume though that one causes the other, right?
In our second example, high umbrella sales might be correlated with the number of traffic accidents, but we can't assume one causes or predicts the other. What might 'cause' both is a third variable, level of precipitation, right?
Now let's look at two areas where we assume potential causality, and use regression.
Crop yield and temperature. If it is freezing it will be difficult for crops to grow, right? As a result, here I have identified 'x' as the temperature and 'y' as the crop yield.
And similarly, I think it is fair to assume that traffic accidents and rainfall are related, and we can reasonably justify that rainfall is the cause because with higher rainfall come slicker road surfaces, right?
The data you measure with correlation is simply the co-movements between variables. In linear regression, the researcher alters or monitors the change in one variable, and measures the impact on the other, as we said.
So does it matter which is x and which is y? For correlation, it doesn't matter, you just want to measure co-movements and you get the same correlation coefficient.
For regression identifying one as the 'independent' variable and the other as the 'dependent' variable, does matter. Think about it like this, using regression, if they are swapped then the interpretation will change. So you wouldn't logically see a traffic accident and assume it's raining.
The topic of assumptions can open up a whole can of worms and is beyond our scope here so I'll touch on it lightly.
With correlation, to make observations, like setting a confidence level, we might be concerned with whether each variable fits a normal distribution.
This might not be the case for simulated scientific studies like our plant height and water example from earlier. Here a normal distribution for the x-variable need not be assumed as the researcher can alter the the amount of water to feed a plant, and then measure the results on the y-variable plant height.
Next, we will discuss correlation measures and the line of best fit in a moment, but as a technical note, for correlation with data on a ratio scale we normally use the Pearson version and for non-normal or rank scales like interval scales, the Spearman method is quite common. Here if you think about it, we aren't concerned with the intercept, or where the line crosses the y-axis.
With a regression however, we get the full equation for line of best fit and that intercept is meaningful, as we will see in a moment.
Next, multiple x's. For correlation, we calculate the co-movements between two variables not multiple.
For more advanced versions of linear regression, called multi-variable linear regression you can have multiple x's. This is difficult to visualize, because instead of a two-dimensional depiction on an x and y axis, it moves to three dimensions and beyond. For this results are often presented in tables instead of charts.
We will stick with single-variable linear regressions in this series, but keep in mind, multiple-variables are more common in academic and practical studies in Finance.
Another useful procedure associated with multi-variable linear regressions is the use of category or dummy variables.
As an example, let's say you want to evaluate race-course driving times, so that is your dependent variable. You have two independent variables, the first is the number of hours the driver slept the night before the race and the second variable 'gender' is an example of a category variable. Category variables might be populated with 0 for female and 1 for male.
A multiple variable linear regression like this this would allow you to strip out or neutralize the effect of sleep and see which of the two sexes drove faster, offering results at a more granular level.
Of course the researcher must look out for issues that arise during the review of the data and the resulting measures. With correlation we talked about causation. A second issue arises around spurious correlations which refer to when there is high correlation between variables, however the relationship is actually based on a third variable.
For example, the sale of cars and dishwashers demonstrate a high correlation, but are more related to economic growth rather than to each other.
Like, you don't buy a dishwasher because you just bought a car. Right?
Next, two others can apply to both correlation but are more commonly analyzed with linear regression.
First, linear regression only captures linear relationships. In the example earlier on temperature and crop yield, you can imagine that on both ends of the temperature spectrum, you might see crop yields drop dramatically, right? Think about it.
Corn plants grow best within a range of 77 degress and 91 degrees Fahrenheit, or 25 and 33 Celsius. At a point above 100 degrees Fahrenheit, or 38 Celsius, production drops off at a greater rate. The resulting chart would best be described by a curve instead of a line. This makes sense, at some point when heat increases plants just give up and die off.
Also, as with correlation, outliers can skew relationships and all of the conclusions that go with it. We will tackle these and other issues more fully in coming tutorials.
Before we press on let's take a step back with the visual of the linear regression and discuss the equation for the line of best fit.
The term linear, in linear regression, is important because the line estimates the relationship between the variables. This straight line mathematically minimizes the distances between all of the points and the line itself.
Conveniently here Excel automatically calculates the slope and intercept for this line and additional statistics, like R-squared, which too can be plotted on the chart.
For this chart, we have two series of monthly returns, one for the x-axis that represents the Market and a second for Merck.
Later we will use linear regression for two purposes. First, to evaluate whether a stock like Merck is cheap or expensive relative to its risk. Second, and as mentioned earlier we will also use it to evaluate the performance of a portfolio against a benchmark. So linear regression has many uses.
So we will need to go beyond Excel's default formula for a line that you see on the scatter plot.
You probably remember this one from algebra. It uses the format y = mx + b, where m is the slope of the line, b is the intercept and x and y are the variable values.
With our second equation, I want to make sure we don't forget the error term, as denoted by e. Later we will become intimately familiar with this error term as it has a lot of meaning in portfolio analysis and risk measurement.
Third, we have an equation that relates more to our data set and the modeling of investments. Here it is basically the same thing but the terms are given different labels and are rearranged.
We have a first, which is the intercept and it conveniently aligns with the term alpha. Then b for the slope of the line and in Finance we call this beta. Finally, we have e for the error term.
Let's back up a minute and clear up one technical point before we move on. There are many ways to calculate correlation.
For financial modeling we typically focus on two: Pearson and Spearman's. So the math inside the function is different for each.
The default calculation in Excel is the Pearson product-moment correlation coefficient. It is best used for ratio data and when we can assume that the distribution is normal.
Here we are working with stock returns data and technically the distribution isn't perfectly normal, but we, and most investment practitioners use the normal distribution for its convenience.
If you don't make this assumption, then you have to track down alternative correlation and regression formulas that work with your data. Few people want to be that precise, especially when working in a spreadsheet. Advanced and maybe exotic statistical procedures are often found in statistical programming languages like R, MatLab or Python. So in a spreadsheet we use what we have.
By the way, Excel offers two functions for the Pearson calculation
that basically do the same thing
When data comes in the form of ranks, so ordinal or interval measurement scales, we typically use the Spearman's rank version.
Here our data likely doesn't fit a normal distribution. As an example, let's use the rank of 30 student test scores, aligning scores by rank, from 1 to 30, so we can see this is anything but normal. Spearman's version can easily be performed in statistical analysis packages, but takes a little extra work in a spreadsheet.
For the most part, Spearman's will be out of our scope here because our data is in the form of returns which we assume to be normally distributed and uses that ratio measurement scale.
By way of summary, we took a step back from stocks here to broaden the discussion about variables, scatter plots and lines. We broke out various distinctions between correlation and regression by starting with measurement scales and saw why correlation does not imply causation.
However, when we have some rationale or economic justification for changes in one to precict changes in another, then we can move on to linear regression.
We also touched on issues with studies of this type and will circle back to them in coming tutorials. This to me is the overlooked aspect in data analysis. Many people can generate numbers and create algorightms without thinking critically. Thoughtful analysis requires quality checks of the data to explain why, not just how, variables are related.
In my view this is what differentiatiates those who can generate data and those who can synthesize it to make models that are predictive and lead to better decision making. Isn't that the goal in the first place?
In the next episode we will stick with a conceptual discussion of correlation, further expanding on those issues like spurious correlations and outliers. After that we will come back to our returns data set and pick apart why that error term is so important as we more fully explore linear regression.
This and other content sits here to help you become a better data analyst, so feel free to join us any time, and have a nice day.
This tutorial in particular is helpful to watch on YouTube as several of the visuals are not replicated here.
See what else you can learn for free at our YouTube Channel. Subscribe here.