Customize a Regression with LINEST and Excel INDEX Functions

Learn a new technique for automating reressions in Excel

Intermediate

Video

Videos are available at one organized Quant 101 Playlist on YouTube.

Customize a regression with LINEST and Excel INDEX functions (19:57)

Video Script

Welcome. Today's objective is to learn how and when to use the =LINEST() function versus other regression methods in Excel.

I'm Paul, and if you are like me then you find regression settings confusing and hard to remember.

So here we will perform regressions in Excel using both the standard Excel regression functions and then with the more-flexible =LINEST() method. We will walk through the procedures first and at the end we will interpret the results.

If you would prefer to see a web page for the video transcript and key Excel formulas the first link in the video Description will take you to it. There will be a lot going on here so you will also find links so you can zoom straight to the section you are interested in.

Let's cover the plan for the day.

Outline

First, we cover uses cases for the three ways to run linear regressions in Excel.

Second, we start with an exercise using the function method and discuss its default parameters.

Third, we explore how =LINEST() allows us to customize regression output.

Fourth, we combine =INDEX() with =LINEST() to specify statistics from an output array.

Fifth, we will interpret all of the resulting tables we create.

And in our next episode we will use these functions to set expectations for stocks using CAPM.

Step 1 - Three Ways to Run Regressions in Excel

Let's kick off Step 1 with those three different methods for getting regression data out of Excel.

Standard functions
Data Analysis
LINEST function

Now let's discuss why, how and when you would use each method.

a. Standard functions

Let's start with standard regression functions in Excel.

With functions like =SLOPE() and =INTERCEPT() we can pull common regression statistics on a cell-by-cell basis. Others include =CORREL(), =RSQ() which is R-Squared, or correlation squared, and =STEYX() which is the standard error for the regression.

The benefit of functions is that they are quick and easy to memorize. Also, as we saw in the tutorial on Time Series Modeling functions like this can be arranged down columns so you can monitor changes over time. These are called rolling regressions.

The downside is that these functions do not give you the option to customize a regression without the intercept, or constant. While in most regressions using the intercept is appropriate, sometimes we need the flexibility.

b. Data Analysis

This takes us to the Data Analysis method which is very flexible and is useful for pulling detailed regression statistics in a very customized way. Plus it gives you the option to inspect and chart what are called residuals, which can be helpful for spotting errors in the data.

The drawback here is that Data Analysis in Excel requires that you to output a large block of data in your spreadsheet. This is good for one-time regressions, but makes it more difficult to see statistics change, like when analyzing rolling regressions mentioned earlier.

Most people learn regression in Excel by reading the Data Analysis ANOVA output so in another tutorial we walk through each of the regression statistics. It has been my experience that interpreting ANOVA output is confusing which is why we will demystify that there.

c. LINEST function

The =LINEST() function method is the most flexible function to access regression statistics. It allows you run regressions and pull statistics in place rather than using a large output block. It also allows you to include or exclude the intercept. Also, when combined with the =INDEX() function you can set up columns for rolling regression measures, as we will see in the next tutorial.

The drawback here is that you need to understand the output and how to pull out the appropriate statistic, which is tricky, so we will nail the procedures in Step 3.

Step 2 - Standard Excel Regression Functions

Now for Step 2, let's practice with real data, starting with the function method.

We said standard functions weren't as flexible, meaning by default the regression is run with an intercept. This works fine in our first example here because we want the intercept anyway.

Here our focus is on the financial modeling of stocks, but of course this works with any type of data where you require the intercept.

a. Raw returns or excess returns?

Specific to our data set here, in our last tutorial on the CAPM model we left off with two unanswered questions and here in Exercise 1 we prepare a table to answer the first one, later in Step 5.

For the regression we used a small 24-month sample data set for a regression of returns on the Market as our x-variable and returns on the stock Merck as the y-variable.

Let's grab the visual on our Returns data tab to see the data we used. We have a table with Monthly Raw Returns that can be downloaded sign-up free as shown in our System Setup tutorial.

From that we calculated Excess Returns which here is the raw return on each stock and the Market minus the risk-free rate of return in column H.

b. Access statistics using named ranges

Let's now run through the instructions for Exercise 1.

Perform two regressions of Merck against the Market
Both use 24 monthly periods, ending 3/31/05
The first regression is for raw returns
The second is for excess returns
Include an intercept in both cases, which is the default
Use standard Excel regression functions

Let's grab data using the convention of naming ranges because it will speed this up.

Start by highlighting the range from F7:F30 and in the Name Box type MRK_r24 for Merck. Then from G7:G30 name it Market_r24.

Now for excess returns highlight the range from N7:N30 and name it MRK_e24 and name O7:O30 Market_e24.

Now, returning to our table, in cell G9 for the slope we can use the formula =SLOPE(MRK_r24,Market_r24) giving us a beta of 1.431 using raw returns.

We can use =SLOPE(MRK_e24,Market_e24) in cell H9 to find the slope of Merck with excess returns of 1.427.

	F	G	H	I
8	A. Raw Return With Intercept		B. Excess Return With Intercept
9	Slope	1.431	1.427	`=SLOPE()`
10	Intercept	-0.024	-0.024	`=INTERCEPT()`
11	R-Squared	0.548	0.547	`=RSQ()`
12	Std. Error	0.067	0.067	`=STEYX`

From there you can use the functions indicated to populate the rest of the table. Again, we will return in Step 5 for an interpretation.

Step 3 - Procedures for LINEST

With that for Step 3, let's run a similar regression using =LINEST() but this time we can modify it to run the regression without an intercept.

a. Intercept or no intercept?

For most regressions including the intercept is recommended; however, an RTO or regression through the origin is used for applications of survey data and is common in the fields of agriculture and econometrics.

b. The seven-step LINEST procedure

Now let's walk through the seven-step procedure for =LINEST() using the two excess returns data sets we already named.

First, select the 5 row by 2 column output block in cells F18:G22.

Second, type =LINEST( and you will see the instructions. This function, as we will use it, asks for two ranges of data and two others for how to calculate and present the data.

Third, enter or select the range for known_y's for the dependent variable, in our case MRK_e24 followed by a comma.

Fourth, enter the known_x's, for the independent variable for the returns on the Market with Market_e24, followed by a comma again.

Fifth, to include or exclude the constant or intercept, type TRUE or in our case here with FALSE, followed by a comma. The default is TRUE if none is specified.

Sixth, again you would enter either TRUE to include the full table of 5 rows of output. If you only want the first row coefficients, then type FALSE followed by a closed parenthesis. With the default here of FALSE Excel will only return the coefficients for the slope and constant.

Seventh, because this is an array calculation in Excel, it requires the keystroke combination Ctrl+Shift+Enter. In an array calculation, like with lineaer algebra involving vectors and matrices, you need to tell Excel to calculate it all at once. This is convenient because you only need one formula and Excel performs and outputs the results from multiple calculations.

This regression run without the intercept populates the range F18:G22.

	F	G	H	I
17	C. Excess Returns Without Intercept		D. Excess Returns With Intercept
18	1.352	0.000	1.427	-0.024
19	0.285	#N/A	0.277	0.014
20	0.495	0.070	0.547	0.067
21	22.505	23.000	26.563	22.000
22	0.111	0.113	0.120	0.100

c. Repeat the calculation with an intercept

Next, let's duplicate all of the same steps, replacing the third parameter as TRUE in the range H18:I22 for the same regression, thereby including the intercept.

Here four of the items for the output in section D should match what we ran in case B earlier.

d. Excel LINEST output arrays

Output arrays will take two shapes depending on whether the [stats] was set to TRUE or FALSE.

For the default of FALSE the output will only be one row of coefficients, as mentioned.

Output arrays will take two shapes depending on whether the [stats] parameter was set to TRUE or FALSE.

For single-variable linear regressions the output when [stats] is set to TRUE the output is a 5-by-2 array. The output is specified in the table here.

E	F	G	H
27		Column 1	Column 2
28	Row 1	Coefficient x1	Intercept
29	Row 2	Standard error x1	Standard error intercept
30	Row 3	R-squared	Standard error (all)
31	Row 4	F-statistic	Degrees of freedom
32	Row 5	Sum of Squares Regression	Sum of Squares Residual

For a multi-variable regression the output block changes. To illustrate this a second table shows outputs for a 2-variable regression, again assuming the [stats] parameter was set to TRUE.

27		Column 1	Column 2	Column 3
28	Row 1	Coefficient x2	Coefficient x1	Intercept
29	Row 2	Standard error x2	Standard error x1	Standard error intercept
30	Row 3	R-squared	Standard error (all)
31	Row 4	F-statistic	Degrees of freedom
32	Row 5	Sum of Squares Regression	Sum of Squares Residual

Notice how the intercept and first x-variable shifts to the right and then the first column is now for the second x-variable.

All cells from rows 3 through 5 of the output table remain the same, meaning the R-squared, F-statistic, standard error for the whole regression plus other measures will continue to sit in this block.

This is nice from the standpoint that we always know where they are, but inconvenient in that when running multi-variable regressions, we have to track down the location of the intercept and x-variable coefficients as they shift to the right.

We will interpret the meaning of several of these measures in Step 5, when we answer the second of two questions left over from the previous tutorial.

Step 4 - Access Statistics using INDEX

Moving on to Step 4, now that we know what the output will look like, we need a way to automate this. Recall, we said =LINEST() was the most flexible method for running regressions and that is true, but we need help from the =INDEX() function.

Our output block here is fine for a one-off situation, but what if we wanted to pull a whole column of standard errors for different time periods? This wouldn't be ideal, and that is where the =INDEX() function comes into play.

I should note that this procedure is similar to indexing in statistical programming languages as well.

a. The Excel INDEX function procedure

Let's first look at the format for the =INDEX() function which can be found by going to an empty cell and inputting =INDEX(.

The result is =INDEX(array,row_num,[column_num]) indicating that the first two parameters are required and the third is optional. Here we will input all three.

First is the array, where you pass the array, and it can be a range of cells, a named array or the result of an array calculation like we have with a =LINEST() function.

b. Use the INDEX function to pull specific cells

It's easiest if we walk through an example in Exercise 3, with the final output in the range from N16:Q19.

	N	O	P	Q
16	R,C	Statistic	C. Stats	D. Stats
17	`1,1`	Coefficient 1	1.352	1.427
18	`2,1`	Standard error x1	0.285	0.277
19	`3,2`	Standard error all	0.070	0.067

In column N we have a guide for the row-by-column lookup and O lists the statistics we want to pull using those coordinates.

I should mention that as with arrays and matrices in linear algebra, we always use the convention, row then column. So 1,1 is row 1, column 1, or in this case the x-coefficient.

In cell P17 let's pull the x-variable coefficient for the one-variable regression from earlier with excess returns and without an intercept. Here we input the formula =INDEX(LINEST(MRK_e24,Market_e24,FALSE,TRUE),1,1). The result of 1.352 matches what we had earlier.

We will go through one more example and then you can finish the rest of the table on your end.

To pull the x-variable coefficient for excess returns with an intercept use the formula =INDEX(LINEST(MRK_e24,Market_e24,TRUE,TRUE),1,1), and again we match at 1.427.

If you noticed we aren't using the Ctrl+Shift+Enter keystroke combination as we normally do with array functions. Technically =INDEX() is not an array function, so we can simply pass in the array and get the answer with Enter.

Step 5 - Interpret Regression Output

With all of our tables populated now let's interpret these results as they relate to the modeling of stocks.

a. Compare raw returns to excess returns

Okay, let's start with the results from our first Exercise.

First, a quick word about the sample size. We used a sample of only 24-monthly periods, and we should use at least 30 periods, or better yet 60 observations if this were a real study.

Second, using the standard function method, we see that the difference between the beta coefficients is minor, from 1.431 to 1.427.

Third, the intercept was -0.24 for both regressions and you would have to go to four decimal places to see a difference. Again, this is not a material difference.

And last, with respect to the R-Squared and standard error figures, which describe the pattern of data points around the line of best fit, these figures were almost identical.

If you were with us for the last tutorial on the CAPM model, there we were curious how the slope coefficient would change if we used raw returns as opposed to the technically and academically correct method of using raw returns versus excess returns.

So in the end, to come full circle we see that the difference between using raw versus excess is quite small, offering rationale for why some practitioners use raw returns instead of excess returns for their convenience, thereby skipping a step.

b. Compare results with and without an intercept

In our second exercise we ran a regression for the same period except this time using =LINEST() but altering the use of an intercept. In both cases we used excess returns.

Here we do note differences of 1.352 for the regression without and intercept and 1.427 with the intercept. In addition, measures indicate that the tightness of fit around the line of best fit are lower. We see an R-squared of 0.495 and 0.547, and the standard error of the whole regression 0.070 and 0.067, for the without and with intercept regressions, respectively.

So this answers the second question from the previous tutorial. It appears that yes, there is a material difference between going with or without an intercept.

In our specific case here, for the financial modeling of stocks, one of the assumptions of CAPM, is that you can't assume that alpha in the past will repeat in the future, so some researchers perform regressions without the intercept, or RTO, for regressions through the origin.

For forward-looking studies, like that with the estimation of stock risk, you may elect to eliminate the constant, but this is dependent on the person or firm modeling the data.

However, for backwards-looking calculation of past performance on portfolio performance however, the intercept represents the alpha, or outperformance of a portfolio after taking risk into consideration. This is used to evaluate portfolio alpha. So this is obviously a financial modeling scenario where you would not want to use an RTO approach.

c. Interpret regression output

In the table titled Excel LINEST Output Array, we have the structure of outputs for the =LINEST() function output and I have assumed you understand the meaning of coefficients, intercept and R-squared.

At this time it might be a good idea to touch on the definition of the statistics in the first three rows.

Coeff - Let's start with cell 1,1 and coefficient. In Finance another term for this is 'beta', 'exposure' or in some cases in academic literature 'factor loading'. They all mean the same thing and here it refers to the slope of the line of best fit between the independent x-variable, the Market return, and the dependent y-variable, Merck returns.
Std. err - In cell 2,1 we have the standard error of the beta estimate. This is used to determine the significance of the beta estimate, a topic we will revisit in coming tutorials.
R-squared - In cell 3,1 we have the R-squared or correlation squared. This is the percent of the variation that is explained by the estimated line of best fit.
Intercept - In cell 1,2 we have that intercept or the point on the y-axis where the line of best fit crosses. Recall that another word for intercept in our context is alpha, but often you hear the term constant.
Std. err - In cell 2,2, we have the standard error of the intercept.
Std. err all - In cell 3,2 we have the standard error for the whole regression, not for the beta or the constant alone. This is what you see as the error term in the formula for a line using the format y = a + bx + e, again a topic for another day.

Don't worry too much if the standard errors don't make much sense now. We will return to them at a later time including adding confidence levels to all of our estimates, which is another nice feature to add to any financial model.

Summary

By way of summary, we have taken a deep dive into the =LINEST() function and covered what it provides relative to the other two methods for accessing regression statistics.

Yes, there are added steps that come with it; however, the added flexibility to customize the type of regression and to pull just one item from the regression, or create a column of just that one measure can be helpful.

Also specific to our data set revolving around stocks, we saw how the difference between using raw returns versus excess returns was not material. We did however see the intercept and how it can impact the regression results.

Step 6 - Next: Expected Return with LINEST

In the next episode we will get more experience with beta and making estimates using CAPM. We will be able to measure ranges we are comfortable with for not only our beta estimate which is so important for setting expectations for stocks, but for other measures as we near the end of Quant 101.

This has been a long tutorial, thanks for sticking with it. Make sure you connect with us if you found value here as we're on a mission to improve free resources for the modeling of return and risk on stocks and portfolios.

Please feel free to join us at any time and have a nice day.