/ factorpad.com / fin / quant-101 / excel-linest-function.html
An ad-free and cookie-free website.
Intermediate
Videos are available at one organized Quant 101 Playlist on YouTube.
Customize a regression with LINEST and Excel INDEX functions (19:57)
Welcome. Today's objective is to learn how and when to use the
=LINEST()
function versus other
regression methods in Excel.
I'm Paul, and if you are like me then you find regression settings confusing and hard to remember.
So here we will perform regressions in Excel using both the standard
Excel regression functions and then with the more-flexible
=LINEST()
method. We will walk through
the procedures first and at the end we will interpret the results.
If you would prefer to see a web page for the video transcript and key Excel formulas the first link in the video Description will take you to it. There will be a lot going on here so you will also find links so you can zoom straight to the section you are interested in.
Let's cover the plan for the day.
First, we cover uses cases for the three ways to run linear regressions in Excel.
Second, we start with an exercise using the function method and discuss its default parameters.
Third, we explore how =LINEST()
allows
us to customize regression output.
Fourth, we combine =INDEX()
with
=LINEST()
to specify statistics from an
output array.
Fifth, we will interpret all of the resulting tables we create.
And in our next episode we will use these functions to set expectations for stocks using CAPM.
Let's kick off Step 1 with those three different methods for getting regression data out of Excel.
Now let's discuss why, how and when you would use each method.
Let's start with standard regression functions in Excel.
With functions like =SLOPE()
and
=INTERCEPT()
we can pull common
regression statistics on a cell-by-cell basis. Others include
=CORREL()
,
=RSQ()
which is R-Squared, or
correlation squared, and
=STEYX()
which is the standard error
for the regression.
The benefit of functions is that they are quick and easy to memorize. Also, as we saw in the tutorial on Time Series Modeling functions like this can be arranged down columns so you can monitor changes over time. These are called rolling regressions.
The downside is that these functions do not give you the option to customize a regression without the intercept, or constant. While in most regressions using the intercept is appropriate, sometimes we need the flexibility.
This takes us to the Data Analysis method which is very flexible and is useful for pulling detailed regression statistics in a very customized way. Plus it gives you the option to inspect and chart what are called residuals, which can be helpful for spotting errors in the data.
The drawback here is that Data Analysis in Excel requires that you to output a large block of data in your spreadsheet. This is good for one-time regressions, but makes it more difficult to see statistics change, like when analyzing rolling regressions mentioned earlier.
Most people learn regression in Excel by reading the Data Analysis ANOVA output so in another tutorial we walk through each of the regression statistics. It has been my experience that interpreting ANOVA output is confusing which is why we will demystify that there.
The =LINEST()
function method is the
most flexible function to access regression statistics. It allows you
run regressions and pull statistics in place rather than using a large
output block. It also allows you to include or exclude the intercept.
Also, when combined with the =INDEX()
function you can set up columns for rolling regression measures, as
we will see in the next tutorial.
The drawback here is that you need to understand the output and how to pull out the appropriate statistic, which is tricky, so we will nail the procedures in Step 3.
Now for Step 2, let's practice with real data, starting with the function method.
We said standard functions weren't as flexible, meaning by default the regression is run with an intercept. This works fine in our first example here because we want the intercept anyway.
Here our focus is on the financial modeling of stocks, but of course this works with any type of data where you require the intercept.
Specific to our data set here, in our last tutorial on the CAPM model we left off with two unanswered questions and here in Exercise 1 we prepare a table to answer the first one, later in Step 5.
For the regression we used a small 24-month sample data set for a regression of returns on the Market as our x-variable and returns on the stock Merck as the y-variable.
Let's grab the visual on our Returns data tab to see the data we used. We have a table with Monthly Raw Returns that can be downloaded sign-up free as shown in our System Setup tutorial.
From that we calculated Excess Returns which here is the raw return on each stock and the Market minus the risk-free rate of return in column H.
Let's now run through the instructions for Exercise 1.
Let's grab data using the convention of naming ranges because it will speed this up.
Start by highlighting the range from F7:F30 and in the
Name Box type MRK_r24
for Merck. Then from G7:G30 name it
Market_r24
.
Now for excess returns highlight the range from N7:N30
and name it MRK_e24
and name
O7:O30 Market_e24
.
Now, returning to our table, in cell G9 for the slope
we can use the formula
=SLOPE(MRK_r24,Market_r24)
giving us a
beta of 1.431 using raw returns.
We can use
=SLOPE(MRK_e24,Market_e24)
in cell
H9 to find the slope of Merck with excess returns of
1.427.
F | G | H | I | |
---|---|---|---|---|
8 | A. Raw Return With Intercept | B. Excess Return With Intercept | ||
9 | Slope | 1.431 | 1.427 | =SLOPE() |
10 | Intercept | -0.024 | -0.024 | =INTERCEPT() |
11 | R-Squared | 0.548 | 0.547 | =RSQ() |
12 | Std. Error | 0.067 | 0.067 | =STEYX |
From there you can use the functions indicated to populate the rest of the table. Again, we will return in Step 5 for an interpretation.
With that for Step 3, let's run a similar regression using
=LINEST()
but this time we can modify
it to run the regression without an intercept.
For most regressions including the intercept is recommended; however, an RTO or regression through the origin is used for applications of survey data and is common in the fields of agriculture and econometrics.
Now let's walk through the seven-step procedure for
=LINEST()
using the two excess returns
data sets we already named.
First, select the 5 row by 2 column output block in cells F18:G22.
Second, type =LINEST(
and you will see
the instructions. This function, as we will use it, asks for two ranges
of data and two others for how to calculate and present the data.
Third, enter or select the range for known_y's for the dependent
variable, in our case MRK_e24
followed
by a comma.
Fourth, enter the known_x's, for the independent variable for the
returns on the Market with Market_e24
,
followed by a comma again.
Fifth, to include or exclude the constant or intercept, type
TRUE
or in our case here with
FALSE
, followed by a comma. The
default is TRUE
if none is specified.
Sixth, again you would enter either
TRUE
to include the full table of
5 rows of output. If you only want the first row coefficients, then type
FALSE
followed by a closed parenthesis.
With the default here of FALSE
Excel
will only return the coefficients for the slope and constant.
Seventh, because this is an array calculation in Excel, it requires
the keystroke combination
Ctrl+Shift+Enter
. In an array
calculation, like with lineaer algebra involving vectors and matrices,
you need to tell Excel to calculate it all at once. This is convenient
because you only need one formula and Excel performs and outputs the
results from multiple calculations.
This regression run without the intercept populates the range F18:G22.
F | G | H | I | |
---|---|---|---|---|
17 | C. Excess Returns Without Intercept | D. Excess Returns With Intercept | ||
18 | 1.352 | 0.000 | 1.427 | -0.024 |
19 | 0.285 | #N/A | 0.277 | 0.014 |
20 | 0.495 | 0.070 | 0.547 | 0.067 |
21 | 22.505 | 23.000 | 26.563 | 22.000 |
22 | 0.111 | 0.113 | 0.120 | 0.100 |
Next, let's duplicate all of the same steps, replacing the third
parameter as TRUE
in the range
H18:I22 for the same regression, thereby including the
intercept.
Here four of the items for the output in section D should match what we ran in case B earlier.
Output arrays will take two shapes depending on whether the
[stats]
was set to
TRUE
or
FALSE
.
For the default of FALSE
the output
will only be one row of coefficients, as mentioned.
Output arrays will take two shapes depending on whether the
[stats]
parameter was set to
TRUE
or
FALSE
.
For single-variable linear regressions the output when
[stats]
is set to
TRUE
the output is a 5-by-2 array. The
output is specified in the table here.
E | F | G | H |
---|---|---|---|
27 | Column 1 | Column 2 | |
28 | Row 1 | Coefficient x1 | Intercept |
29 | Row 2 | Standard error x1 | Standard error intercept |
30 | Row 3 | R-squared | Standard error (all) |
31 | Row 4 | F-statistic | Degrees of freedom |
32 | Row 5 | Sum of Squares Regression | Sum of Squares Residual |
For a multi-variable regression the output block changes. To illustrate
this a second table shows outputs for a 2-variable regression, again
assuming the [stats] parameter was set to
TRUE
.
27 | Column 1 | Column 2 | Column 3 | |
---|---|---|---|---|
28 | Row 1 | Coefficient x2 | Coefficient x1 | Intercept |
29 | Row 2 | Standard error x2 | Standard error x1 | Standard error intercept |
30 | Row 3 | R-squared | Standard error (all) | |
31 | Row 4 | F-statistic | Degrees of freedom | |
32 | Row 5 | Sum of Squares Regression | Sum of Squares Residual |
Notice how the intercept and first x-variable shifts to the right and then the first column is now for the second x-variable.
All cells from rows 3 through 5 of the output table remain the same, meaning the R-squared, F-statistic, standard error for the whole regression plus other measures will continue to sit in this block.
This is nice from the standpoint that we always know where they are, but inconvenient in that when running multi-variable regressions, we have to track down the location of the intercept and x-variable coefficients as they shift to the right.
We will interpret the meaning of several of these measures in Step 5, when we answer the second of two questions left over from the previous tutorial.
Moving on to Step 4, now that we know what the output will look like,
we need a way to automate this. Recall, we said
=LINEST()
was the most flexible
method for running regressions and that is true, but we need help from
the =INDEX()
function.
Our output block here is fine for a one-off situation, but what if we
wanted to pull a whole column of standard errors for different time
periods? This wouldn't be ideal, and that is where the
=INDEX()
function comes into play.
I should note that this procedure is similar to indexing in statistical programming languages as well.
Let's first look at the format for the
=INDEX()
function which can be found
by going to an empty cell and inputting
=INDEX(
.
The result is
=INDEX(array,row_num,[column_num])
indicating that the first two parameters are required and the third is
optional. Here we will input all three.
First is the array, where you pass the array, and it can be a range of
cells, a named array or the result of an array calculation like we have
with a =LINEST()
function.
It's easiest if we walk through an example in Exercise 3, with the final output in the range from N16:Q19.
N | O | P | Q | |
---|---|---|---|---|
16 | R,C | Statistic | C. Stats | D. Stats |
17 | 1,1 |
Coefficient 1 | 1.352 | 1.427 |
18 | 2,1 |
Standard error x1 | 0.285 | 0.277 |
19 | 3,2 |
Standard error all | 0.070 | 0.067 |
In column N we have a guide for the row-by-column lookup and O lists the statistics we want to pull using those coordinates.
I should mention that as with arrays and matrices in linear algebra, we always use the convention, row then column. So 1,1 is row 1, column 1, or in this case the x-coefficient.
In cell P17 let's pull the x-variable coefficient for
the one-variable regression from earlier with excess returns and
without an intercept. Here we input the formula
=INDEX(LINEST(MRK_e24,Market_e24,FALSE,TRUE),1,1)
.
The result of 1.352 matches what we had earlier.
We will go through one more example and then you can finish the rest of the table on your end.
To pull the x-variable coefficient for excess returns with an intercept
use the formula
=INDEX(LINEST(MRK_e24,Market_e24,TRUE,TRUE),1,1)
,
and again we match at 1.427.
If you noticed we aren't using the
Ctrl+Shift+Enter
keystroke
combination as we normally do with array functions. Technically
=INDEX()
is not an array function, so
we can simply pass in the array and get the answer with
Enter
.
With all of our tables populated now let's interpret these results as they relate to the modeling of stocks.
Okay, let's start with the results from our first Exercise.
First, a quick word about the sample size. We used a sample of only 24-monthly periods, and we should use at least 30 periods, or better yet 60 observations if this were a real study.
Second, using the standard function method, we see that the difference between the beta coefficients is minor, from 1.431 to 1.427.
Third, the intercept was -0.24 for both regressions and you would have to go to four decimal places to see a difference. Again, this is not a material difference.
And last, with respect to the R-Squared and standard error figures, which describe the pattern of data points around the line of best fit, these figures were almost identical.
If you were with us for the last tutorial on the CAPM model, there we were curious how the slope coefficient would change if we used raw returns as opposed to the technically and academically correct method of using raw returns versus excess returns.
So in the end, to come full circle we see that the difference between using raw versus excess is quite small, offering rationale for why some practitioners use raw returns instead of excess returns for their convenience, thereby skipping a step.
In our second exercise we ran a regression for the same period
except this time using =LINEST()
but
altering the use of an intercept. In both cases we used excess returns.
Here we do note differences of 1.352 for the regression without and intercept and 1.427 with the intercept. In addition, measures indicate that the tightness of fit around the line of best fit are lower. We see an R-squared of 0.495 and 0.547, and the standard error of the whole regression 0.070 and 0.067, for the without and with intercept regressions, respectively.
So this answers the second question from the previous tutorial. It appears that yes, there is a material difference between going with or without an intercept.
In our specific case here, for the financial modeling of stocks, one of the assumptions of CAPM, is that you can't assume that alpha in the past will repeat in the future, so some researchers perform regressions without the intercept, or RTO, for regressions through the origin.
For forward-looking studies, like that with the estimation of stock risk, you may elect to eliminate the constant, but this is dependent on the person or firm modeling the data.
However, for backwards-looking calculation of past performance on portfolio performance however, the intercept represents the alpha, or outperformance of a portfolio after taking risk into consideration. This is used to evaluate portfolio alpha. So this is obviously a financial modeling scenario where you would not want to use an RTO approach.
In the table titled Excel LINEST Output Array, we have the structure of
outputs for the =LINEST()
function
output and I have assumed you understand the meaning of coefficients,
intercept and R-squared.
At this time it might be a good idea to touch on the definition of the statistics in the first three rows.
Don't worry too much if the standard errors don't make much sense now. We will return to them at a later time including adding confidence levels to all of our estimates, which is another nice feature to add to any financial model.
By way of summary, we have taken a deep dive into the
=LINEST()
function and covered what
it provides relative to the other two methods for accessing regression
statistics.
Yes, there are added steps that come with it; however, the added flexibility to customize the type of regression and to pull just one item from the regression, or create a column of just that one measure can be helpful.
Also specific to our data set revolving around stocks, we saw how the difference between using raw returns versus excess returns was not material. We did however see the intercept and how it can impact the regression results.
In the next episode we will get more experience with beta and making estimates using CAPM. We will be able to measure ranges we are comfortable with for not only our beta estimate which is so important for setting expectations for stocks, but for other measures as we near the end of Quant 101.
This has been a long tutorial, thanks for sticking with it. Make sure you connect with us if you found value here as we're on a mission to improve free resources for the modeling of return and risk on stocks and portfolios.
Please feel free to join us at any time and have a nice day.
To learn faster make sure you see the videos the visuals on Excel are difficult to duplicate one a web page.
See more videos about the institutional approach our YouTube Channel.
/ factorpad.com / fin / quant-101 / excel-linest-function.html
A newly-updated free resource. Connect and refer a friend today.