An ad-free and cookie-free website.
Videos are available at one organized Quant 101 Playlist on YouTube.
Create a covariance matrix and interpret a correlation matrix (16:46)
Welcome. Today's goal is to create a covariance matrix, translate that to a more interpretable correlation matrix and see how these topics relate to other forms of data analytics.
I'm Paul, and for years I've been speaking to and analyzing correlations in the context of investments, realizing full well that I need a periodic refresher.
So here we will post one after walking through the calculation of covariance and correlation which in the end will enhance our ability to analyze data as the game gets bigger.
Our purpose here is to complete another one of thirty tutorials in Quant 101 where our focus is on analyzing stock risk and building portfolios. In Chapter 3 we cover topics related to plotting portfolios on a risk and return plot while building skills to do all of this in a more automated fashion.
These tutorials are offered in both video and text formats. The first link in the video description goes to a web page with the transcript and the key cell formulas in Excel.
Now what is our plan for the day?
First, we will review our sample data in context with data analytics in other fields and industries.
Second, we collect a sample variance for four stocks and translate that to standard deviation.
Third, we define and create a covariance matrix using named ranges to save time.
Fourth, we translate that to an interpretable correlation matrix and analyze co-movements between stocks.
And in our next episode we chart portfolios on a risk and return plot.
As mentioned, the goal here isn't to just build a covariance matrix to store stock risk measures but to become better data analysts.
On the first point, if you want a free data set to play with, head back to the tutorial titled System Setup. The end product sits on a tab called Returns, providing the only data set we need for all 30 tutorials. Here we summarize that data in a covariance matrix.
On the second point, there are similarities between what we are doing here and with other forms of statistical analysis performed across industries. Any form of correlation analysis sits on a covariance matrix.
Whether it be the analysis of marketing spend by advertisers or the evaluation of trends from social media data, we are all trying to model data from the past and make better decisions in the future.
The job market for those with knowledge in this space is hot too. Recent surveys of job satisfication and pay for the data scientist and data analyst back this up.
The trend looks to continue as the amount of data shared in the world balloons. Buzzwords like big data analytics, machine learning and artificial intelligence capture the attention, but in the end the objective is the same, analyzing data.
Here, in our study of stocks, we use data analytics for portfolio risk analysis, mean variance optimization and performance attribution, but I hope you see that the application of this is more broad reaching, and that is why the fundamentals covered here are so important.
As we set out to build and interpret these two matrices, let's review and introduce newcomers to how we prepared for this.
In the tutorial called Stock Risk we calculated four risk measures. We learned that each stock has its own variance and related standard deviation. We also saw that every 'pair' of stocks has a covariance and related correlation that measure co-movements.
Then in one called Portfolio Risk we calculated portfolio variance using a simplified two-stock case and saw our first covariance matrix, a place to store our neat new risk measures.
As we look to analyze large portfolios of potentially thousands of stocks, this covariance matrix will grow and be the central place for many calculations.
Then, because the covariance matrix isn't interpretable, we turn it into a correlation matrix that is appealing to the eye and easier to analyze.
Okay let's play with some data.
We have been working with a 2-stock data set up to this point, so let's try to make it more realistic. First by increasing the number of stocks to four the byproduct will be a 4-by-4 matrix containing 16 cells.
Second, we increase the number of periods measured from 6 months to 60. This allows us to make better estimates of the co-movements between stocks, improving our confidence in our statistical estimates of variability over multiple market cycles.
Another distinction from previous tutorials is instead of
using the population versions of spreadsheet functions, we will use
sample statistics. This is common when you are trying to use sample
data to make a statement about the future. The sample versions of our
Excel functions are denoted with the symbol dot-S at the end, like
Our first excercise asks us to calculate the sample variance and standard deviation for all stocks and dates as of 3/31/08.
Okay, so let's start by calculating variance and standard deviation
for all four stocks. We will use the sample versions of Excel
functions here, so in cell F8 use the formula
=VAR.S(Returns!C7:C66) and this gives
us 0.0038 for the sample variance of Microsoft.
When using the sample version of the functions, Excel automatically divides by n-1, or observations less one. This has the conservative effect of raising the variance, or making the variance higher than it was in the past.
So if you think about it, the smaller the number of observations, the larger the upward adjustment.
Now, on your end, do the same for eBay, Abbott Labs and Merck in column F.
Because variance isn't interpretable and the units are in returns-squared, we won't bother changing the format from four decimal places.
As you recall, the standard deviation is the square-root of variance and it is the interpretable measure.
We calculate this one of several ways. We can point to column
F and raise it to the one-half power, in cell
by using the Excel function method in cell G9 with
=SQRT(F9). We could also use the
Also, after taking the square root, the units for standard deviation changed to percent, from percent-squared. So we format the resulting number accordingly in Excel.
Let's review the results. Abbott Labs was the lowest risk stock at 4.87%, while eBay had the highest risk at 10.36%. Remember to use the past tense. It is incorrect to say that one 'has' higher risk than another because it implies that these relationships from the past will recur in the future. We aren't certain that will happen, right?
For financial modeling when looking at backward-looking data, we call this the Historical period.
Later, when moving to the present time frame we call it the Expected timeframe.
Then, we use the term Forecast when looking to the future as active portfolio managers and risk model providers do.
One of the main points of even going through this exercise in risk measurement is to take historical patterns from the past and make assumptions about the future. You don't prove anything with statistics; rather, you use statistics to help you estimate the probability of events.
Next, let's move on to Step 3 and discuss the covariance matrix and its many uses.
First, so what is a covariance matrix? It is basically a mathematically convenient place to store covariances. Think of it as having the dimensions of a square. So the matrix for 2 stocks would be 2-by-2 or 4 spreadsheet cells. For 3 stocks, a 3-by-3 matrix grows to 9 cells. So here in our 4-stock example, it is 16.
The 4-by-4 matrix in Exercise 2 (see below) stores data for our four stocks. The bottom left cell F19 shows the co-movements between Merck and Microsoft.
Now, let's move up directly up to cell F16, corresponding to Microsoft and Microsoft, which represents the covariance between Microsoft and itself. Yes, this sounds confusing, but think about it, if you take the demeaned returns for Microsoft and multiply them by the demeaned returns for Microsoft what does that give you? The square of the demeaned returns, which in the end is variance. So the covariance between a stock and itself is variance.
This is both confusing and convenient at the same time. Confusing because you have multiple data items stored in one matrix, but convenient because all the information you need to calculate portfolio risk sits in one place.
The covariance matrix can then be used for applications such as portfolio construction, risk analysis and performance attribution.
For a refresher on portfolio variance head back to the tutorial called Portfolio Risk.
Finally, take a moment to look at the covariance matrix again. Across the diagonals sit variances and the off-diagonals are covariances. This is the reason some people call this a variance-covariance matrix.
A neat Excel trick will help us create the matrix. It uses a convention similar to what you see in a programming language. This helps when performing multiple analyses using the same range of data. Once you name the data you can use the name instead of wasting time selecting the same block over and over again.
Let's head over to the Returns tab and name our four
ranges. Select the range C7:C66 then in the Name Box
type a unique name for that range. We'll call it
MSFT followed by
Enter. Now do the same on your end
for the range of returns on EBAY, ABT and MRK.
A couple points to add. First, the names by default apply to the whole spreadsheet, but not across to other spreadsheets, which is one reason we carry this Returns data tab with us in each Chapter of this tutorial series.
Second, if you make a mistake and want to modify the range that has been named, but not the data, go to Formulas, then Name Manager. Here you can see that I named ranges already, and now is a good time to do that on your end, because we will use these in a moment.
Okay, Exercise 2 asks us to create that covariance matrix and for that
we will use Excel's
In the top left cell F16 we calculate the covariance
between Microsoft and itself using
=COVARIANCE.S(MSFT,MSFT) using those
In cell F17 we use
=COVARIANCE.S(EBAY,MSFT). It is
common to input the row first then the column to keep it
straight. Plus this is the same convention used in programming
languages; row first, then column.
If you are following along in your own spreadsheet, this is a good time to complete the rest of the covariance matrix.
It will be tedious, but by the end the concept of what we're doing here will be ingrained.
With that, let's move on to the more interpretable version, the correlation matrix.
In Step 4, the correlation matrix will look a bit different, and for this let's bring back that formula for correlation.
This is how we learned it in a previous tutorial, right? The correlation between two stocks is the covariance between the pair divided by the product of the two standard deviations.
With a simple algebraic adjustment we can generate a formula to go the other way. So with any two of the three terms: correlation, covariance and the pair of standard deviations, you can always derive the missing number.
Under the section called Rearrange the terms I simplified it to x and y, if you prefer seeing it that way.
You will see both of these appear in textbooks and on exam day, especially for certification testing in the CFA Program or for the FRM designation. So you should be comfortable going from correlation to covariance and back.
For Exercise 3, here we are going to set up a table but fill it with correlations. We will also drop repeated calculations because they aren't necessary. This is a common practice.
You could do this in one of two ways. You could use data from two
tables above, by dividing each cell by the product of the two standard
deviations. This process is more laborious and prone to error, so
instead, we will use the
In cell F24 we have
=CORREL(MSFT,MSFT) using our newly
named ranges for Microsoft. Then in cell F25 we
=CORREL(EBAY,MSFT). Here we
use the two-decimal place formatting to aid with visualization.
For those following along, this is a good time to complete the correlation matrix so your numbers match mine.
This matrix is used to convey information about stock-by-stock co-movements and is not used for risk calculations. So when showing this matrix, dropping the duplicates helps the reader focus on only the important information.
This matrix is very helpful for client presentations and when educating your boss. I say that half jokingly, but in reality, many executives in finance still fail to understand what you now know. Sure they get diversification, but most can't quantify it, like you now can.
Let's now interpret a few cells here.
Remember a correlation of close to one, or negative one, has a high relationship, and figures around zero represent no relationship.
Second, down the diagonals are 1's. This basically says that a stock's correlation with itself is 1. That's logical.
Next, we can see correlations for each pair of stocks in the off-diagonals. So the highest correlation was Microsoft and Merck, in cell F27 at 0.35, and the most diversifying, or lowest correlation here is -0.05 in cell G26 for Abbott Labs and eBay.
Intuitively, you would expect the co-movements for stocks in the same sector to be related, right? Health stocks in general move in tandem. Technology stocks often move in tandem as well. Let's see if this showed up in our matrix of historical data.
Cell F25 depicts the correlation between the two technology stocks, and 0.26 is one of the higher correlations in the matrix. So that rhymes with our expectations.
Also, cell H27 represents the co-movements between our two health care companies and this figure was even higher at 0.31.
So our intuition was correct. Let's take a quick look at the correlations between pairs in different sectors. Abbott had very low correlations with Microsoft and against eBay the correlation was negative.
Merck and Microsoft are from different sectors, however their correlation during this period was the highest in our entire matrix, at 0.35.
This may have something to do with another factor. Could there be a factor other than sector that explains why the correlation between the two is so high?
If you were thinking size, then you are on it. We would have to import other data to be certain, but I wanted to illustrate this point to show you how important the correlation matrix is not only in finance, but in other fields of data analytics.
In fact, I am so passionate about this topic of factor analysis that I put the term factor right in the company name, FactorPad.
By way of summary, we walked through the covariance matrix that is used for raw figures and calculations in risk applications like mean variance optimization and financial risk management performed by Institutional investors. We also saw how these analyses are similar across industries.
We learned about naming ranges of data which is a huge timesaver and something you will see as you transition to programming.
We saw how the interpretable correlation matrix improves supplemental reporting and analysis. While we didn't work with demeaned returns here, we keep seeing their impact. And because we often see the portfolio variance formula taking different forms, we walked through the algebra helpful for transitioning from one to the other, so you can ace your exams.
In the next episode we will take this understanding of portfolios and head back to our simple 2-stock case, calculate return and risk, then chart all eleven portfolios.
Until next time, we'll keep plugging away over here, making risk easier to understand and analyze.
Join us any time and have a nice day.
See other free learning resources at our YouTube Channel. Subscribe to keep learning the sought after scientific approach.
A newly-updated free resource. Connect and refer a friend today.