Generate and interpret a linear regression in Excel (18:59)
Videos are available at one organized Quant 101 Playlist on YouTube.
Welcome. Today's goal is to generate regression output in Excel and then interpret results for financial model building.
I'm Paul, and if you're like me, then you have seen regression output using Excel's Data Analysis method but find it bulky and hard to work with.
So here we will walk through a second method using Excel functions and make the output easier to understand and replicate for analyzing stocks.
This tutorial sits within a Playlist called Quant 101 and if you would like a free-to-download data set and instructions for how to install the Data Analysis Toolpak head back to the tutorial called System Setup. Also, in Chapter 4 we started by comparing correlation vs regression and we will build a bridge from the tutorial on correlation to this one, so while not required, having that background will be helpful.
Also, if you'd like to slow all of this down, our video tutorials come with a related web page so you can read and see all of the Excel cell formulas and follow along in your own spreadsheet.
Let's look at our plan for the day.
First, we will summarize our data set and generate a scatter plot in Excel.
Second, we generate regression output using a method that is part of the Excel Data Analyis ToolPak.
Third, we focus on the five most useful measures and pull them using Excel regression functions.
Fourth, we tie it all together building that bridge from correlation to where we are going next with time-series analysis of stocks and portfolios.
And in our next episode we will use our takeaways to evaluate performance of active stock portfolios against a benchmark.
Okay, for Step 1 we need some data and we saw how to put it on a tab called Returns in that System Setup tutorial.
So we have that and all we need for this exercise is returns for 60 months for one stock and a Market benchmark sitting in column F and G. (See the top four rows in the table below). The rest of the columns will be used to create and analyze portfolio performance using regression in the next tutorial.
So with that data, let's dive in to Exercise 1. It says, "Create a scatter plot for MRK and the Market for the 60-month period ending 3/31/08".
Okay, first click where you'd like to put the chart, then go to Insert | Scatter then right click the mouse, hit Select Data, then Add.
Next, click on Series X values and input the range
Returns!G7:G66 for the Market and then
hit TAB and enter the Series Y values
Returns!F7:F66 followed by
Ok. Excel created the scatter plot, so hit the
Ok button again to accept.
I took the long way around here because our data was
aligned with the y-axis data range first for the stock Merck in
column F and then the x-axis, the Market,
in column G. If we had selected the whole range from
Returns!F7:G66 Excel would have
assumed the x-axis range was first and our axes would be swapped, which
we don't want.
As you can tell, I customized the output quite a bit and I won't go through every adjustment here.
On your end edit the scale of the axes by clicking Layout and Axis. This gives you the option to edit the Primary Horizontal Axis and then select More Primary Horizontal Axis Options. After that you can do the same for the Primary Vertical Axis.
There are two ways to add a trendline. Generate it by left-clicking a data point, then right-clicking. Next select Add Trendline... and this will open the Format Trendline dialog box. In the Trend/Regression Type section the Linear type is automatically selected and that's what we want for a linear regression, right?
While we're here check the boxes Display Equation on chart and Display R-squared value on chart, then click the Close button.
The second way to edit a trendline would be to click on the chart, then the Layout menu and Trendline followed by More Trendline Options....
You can click Layout, then Legend and None to remove the Legend, or just click them and hit the Delete key.
Again, here I suggest brushing up on your chart editing skills and customize it how you like, including the title and axis labels.
Now that we have a chart, just as we said in the last tutorial on correlation it is always a good idea as a first step to visualize the data because you will often spot errors and learn more about patterns in your distribution, which will lead to better financial modeling.
Let's spend a minute talking about the trendline which in linear regression is the line that mathematically minimizes the distance between all points and the line itself. It is also called the line of best fit.
There are three different formats for the equation of the line.
Excel uses the first format on the scatter plot, as provided in y = mx + b format. Where y is the dependent variable, m is the slope of the line, multiplied by the x-variable, plus the b intercept, which is where the line crosses the y-axis. As you see here, just below zero at -0.0074.
The second equation includes the error term because, for us, the error term is very important, as we will see in a moment.
If it were up to me, I'd focus exclusively on the third formula, because it is more relevant to investment modelers. This version using y = a + bx + e changes some of the labels and order, but the meaning is the same. Here 'b' is the slope and 'a' is the intercept. And those letters line up better with the words used in finance, beta and alpha. Because you will run across all three, here's a case where we need remember all three.
The error term in the equation is important because it represents the portion that can't be explained by the regression. Instead of ignoring it, like the scatter plot trendline does by default we need this figure later to separate systemetic risk from stock-specific risk.
Now let's move on to Step 2 and run a regression using Excel's Data Analysis method and for this you will need to have installed the Data Analysis ToolPak as demonstrated in System Setup.
So there are three ways you can access regression statistics in Excel, and each comes with strengths and weaknesses.
First Data Analysis here, as Exercise 2 shows (see video). This is the most comprehensive and is the way most people learn regression in Excel. The problem is that it is rather clunky and isn't easily automated. Also, because it throws so much data at you, beginners often get confused and inundated.
The second approach is to use functions, which is a quick-and-dirty approach. We will use these often, but the approach is limited in that it assumes you want your regression run using an intercept, or a constant. You know, the 'b' in the first and second formula, or the 'a' in the formula for Finance, which is where the line crosses the y-axis. So using functions can be easiest but offer less granularity, which we will need later.
And the third, and least known method is called
=LINEST(). It is basically another
function, a specialized one, so we'll save it for later. Both the Data
=LINEST() methods, give
you the option to run a regression with or without the intercept. In a
later tutorial, I'll show you if this worth worrying about. One last
=LINEST() is most similar to
what you would see in a programming language.
Now let's use this Data Analysis method by pointing to that same 60-period example.
Start by clicking on the Data menu, then
Data Analysis, then
Regression followed by the OK button.
Here we input the y-range,
While we're here the one option I'd like to mention is this Constant is zero checkbox. This is where we could elect to ignore the constant forcing the line to go straight through the axis at zero.
Next, under Output options this is where we'll dump the regression analysis. This is a bit quirky, so I'll click New Worksheet Ply, which tells Excel to create a new tab for the output.
On my end, I copied this output to the Regression tab and highlighted the relevant measures that align with the equation for our line and several correlation stats.
We'll interpret these five measures after Exercise 3 and will return to explain the rest when we cover Regression Statistics in Chapter 6.
Ok, next let's review the Excel regression function approach.
Here I'll just ask you to pull the most important five items as highlighted in blue (see video). In the table on the right, I put the term associated with Data Analysis, followed by the term we often use here and then the name of the function.
I won't go over each one now but will walk through the first one using the time-saving convention of naming ranges of data in Excel.
First, go to the Returns tab and select the returns
for Merck and in the Name Box type
MRK followed by Enter.
Then do the same for the 60 returns on the
Then in cell M21 type
=CORREL(MRK,Market) and it returns
0.6962, which matches what the Data Analysis method gave us.
From there do the same for the other four measures. You can see how naming these ranges saves time and minimizes the chance for errors.
Let's finish Exercise 3 with a little interpretation.
First, the label Multiple R refers to what we interpreted several times earlier, correlation.
Next, R-squared is correlation squared. Then we have the Std Error that aligns with what we call error for the whole regression, and it is hugely important in Finance.
The term Intercept, also called the constant, in our context is alpha, and we use beta as the term for what Excel calls the X Variable 1. This coefficient may also be called a factor loading, regression coefficient or even scaling coefficient depending on context, particularly in scholarly articles.
Speaking of that, for historians in the audience, let's quickly give credit to and see where that original and odd term 'regression' came from with an example.
In the early 19th century Sir Francis Galton studied genetics, and was curious about the height of people's offspring, as it related to their parent's height. So offspring height would be the dependent variable and parent's height would be the independent variable, right? In his study, he concluded that the height of the children "regress" towards a "mediocre point."
Today we call this "mediocre point" the "mean". So it was Galton who developed this procedure and at the same time the name 'regression' stuck. Pretty interesting huh?
Now let's change gears and discuss regression as it relates to time-series analysis of Financial data.
First, we have correlation. We discussed correlation versus regression and how if one variable helps you predict another then we can put the predictive variable, often called the independent variable on the x-axis and the dependent variable on the y-axis.
If you noticed when using the
function Excel didn't require that we specify an x-variable and a
y-variable. However, when we use the
=SLOPE() function for example Excel
asked for both the known_y's and
This illustrates a subtle but important distinction between the two.
Earlier we mentioned how in Finance we need some economic justification to move from correlation to regression, so let's add a little economic justification for our stock data here.
First, since the 1960s, academic literature has backed the rationale for there to be a relationship between the performance of an individual stock and the performance on the broader market.
On top of that, we know investor appetite for risk comes and goes, making equities more and less desirable at times, and that can affect the returns on all stocks at the same time. Meaning we could expect co-movements to be positive for stocks.
So there you have it, two reasons for why we feel comfortable pressing on with linear regression and from there, we have economic justification to set one variable, monthly returns on the Market as the independent variable, and for the dependent variable, we have returns on Merck. We can monitor the pattern historically, then use that to set our expectations.
We just covered the basis for many investment modeling processes as linear regression is used widely to develop risk models and alpha models.
So I mentioned setting expectations for stocks and for that let me back up with what we are actually doing here.
So in financial modeling we have three timeframes: Historical, Expected and Forecast. Here, we looked at Historical data, or data that occurred in the past, ran a regression and with the beta in particular we can realistically set expectations for the y-variable Merck if we have an expectation for the Market in the present timeframe, which we call Expected.
So, using the line of best fit and the slope of this line in particular we could say that if the Market changes by x%, we can now say we 'expect' Merck to change by y%. That's a big step and linear regression helps us answer this question.
Additionally, this isn't just our personal expectation for Merck, but the theoretical expectation that all investors have of Merck, as discussed previously in Portfolio Theory. So this is baked-in to market expectations for all investors theoretically which is an important point to remember.
The Passive index fund manager uses this expectation, and that's fine because they aren't trying to beat a benchmark, but instead match it. For active portfolio managers they set forecasts that differ for expectations on stocks using Fundamental, Technical or Quantitative processes and optimize portfolios in their attempt to outperform their benchmark.
Next, to illustrate let's take these figures and plug them in to the formula for the line of best fit.
In cell O20 let's say we have an expecation for the market return of 10%, using the equation for the line we can determine what all investors expect for the return on Merck. I'm taking a shortcut here and leaving out a nuance about risk free returns just to simplify the math, but we will introduce that later.
So in cell P20 using the equation for a line, input
=M24+M25*O20. So the expectation for
Merck is to increase by 11.53%.
Because Merck and the Market are not perfectly correlation then there will always be an error, but since we can't predict the error we have to leave it out.
These errors can be seen graphically in the Historical data. To
4.08% in cell
O20 and we would expect Merck to go up by 4.27%.
Looking at the chart, Merck actually went up 13.01%, so it exceeded
expectations by about 9%.
Let's look at another point, during a month in which the Market
-4.02% we would have
expected Merck to decline of -5.68%, and we can see it declined by
-26.00%, falling about 20% more than expected.
I looked. This was the month the Merck stock price felt the full weight of legal charges associated with the drug Vioxx in 2004. This is an example of what is called stock-specific risk and quantifiying it will be the focus of later tutorials.
One last point I should mention here. There are two types of regressions common in Finance.
The first, and most common, is called time-series. It is called time-series because we look over many time periods, comparing returns as we have done here. It's the same thing as with correlation, so nothing new here except the name, time-series.
The second type is called a cross-sectional regression. Here data is measured for groups of the population, during one time period. Let's say we grouped stocks into sectors and then performed a regression for one period of time. This would allow us to estimate how much of the performance was associated with the sector. Cross-sectional regressions are less common but very useful for multi-factor regressions.
Technically the regression we ran here was a single-factor linear regression where the one factor was the returns on the Market, which gave us one independent x-variable.
In a multi-factor linear regression we could use multiple factors, or multiple x-variables. On way to think of this is if you were trying to predict a student's GPA in college you could have two independent variables, one the ratio of homework assignments turned in and a second for hours of study time before tests. The regression of two factors would likely provide more accurate predictions of the student's GPA.
So for now, what is the takeaway? Time-series is for stocks over many time periods, and cross-sectional is for a group of stocks over one time period.
How does that sounds?
By way of summary, we started by visualizing our data using a scatter plot. We checked the box on having economic justification and learned that regression will be required to set expectations. We learned we can generate regression statistics using three methods, and here we employed two.
For financial modeling an often overlooked aspect is making this all useful and that depiction of Historical, Expected and Forecast timeframes keeps coming up. When working through problems in school or at the office, make sure to ask what the data is for. Many times people don't clarify this point, which leads to confusion. In fact, it's often ignored or assumed by those who write academic papers, which is part of the reason they're so difficult to read for practitioners.
We didn't cover and interpret confidence intervals around our expectations but will in future tutorials.
In the next episode we will build an active portfolio and use all of these new linear regression skills to measure performance using the Sharpe Ratio, Jensen's Alpha and the Treynor Ratio.
If this is your first visit we'd love to have you be a part of our journey to become better data analysts. Please feel free to join us any time and have a nice day.
To learn faster make sure you are watching the videos because the graphics and visuals from Excel are not duplicated here.
See more free learning opportunities at our YouTube Channel and join the conversation there with comments. Subscribe straight from here.