/ factorpad.com / fin / quant-101 / linear-regression.html
An ad-free and cookie-free website.
Beginner
Videos are available at one organized Quant 101 Playlist on YouTube.
Linear Regression: SST, SSR, SSE, R-squared and Standard Error in Excel (37:31)
Welcome. Today's goal is to cover the eight most important measures from a linear regression, so you can ace your exam and accelerate your data analysis career.
I'm Paul, and this is my 250th YouTube tutorial and it just may be the most important. When it comes to regression, I have found that many people kind of get it, but when it comes to test day or interview time, they lose confidence. We'll solve that here.
Concepts covered apply across all statistical analysis software, but here we will use Excel, as it is likely the most frequently-used tool by beginners.
I will demonstrate how you can build a spreadsheet like this, mastering our 8 color-coded measures by calculating them 3 or 4 different ways. We create a scatter plot then run a regression with Excel's Data Analysis Toolpak. Next, we learn about and walk through each calculation completing a table showing all of the intermediate steps, while keeping our eye on formula notation.
Stick around to the end and I'll take 5-minutes to show you two advanced features in Excel that allow you to customize regression settings and access these 8 statistics on the fly.
But let's not get ahead of ourselves, we have to start with the data. Ours is limited to 15 observations so all of this will sit right in front of us. We won't worry about statistical significance here; instead, we'll focus on the concepts and the calculations.
Like any tutorial, you will learn faster if you follow along in your spreadsheet, so I suggest taking a couple minutes to key in this data, preferably in the same cells, as that will help when I walk through it.
With that as a start, let's get going.
Jump to the section you are interested in here.
For Step 1, let's introduce our data set.
The data we use for this regression revolves around returns on two stock portfolios from a data set I use to teach a college course on risk and portfolio management.
To explain, for the month of November '06, the Bench, or Benchmark portfolio declined by -0.55% and another portfolio called the Acme Fund declined by -0.88%. You can see returns for the next 14 months, down to row 19.
This will all become more clear in a moment when we visualize the data in a chart, but the whole point with linear regression is to quantify the relationship between variables, like we have here. With our first two measures we can derive a line that estimates the relationship. Second, with the remaining measures we can assess the strength of this relationship.
We call Bench and Acme variables because they hold values and these values are variable, meaning they change. The Benchmark is also labeled x and the Acme Fund is labeled y, and I'll share the reason for that in a moment.
C | D | E | |
---|---|---|---|
3 | x | y | |
4 | Date | Bench | Acme |
5 | Nov-06 | -0.55 | -0.88 |
6 | Dec-06 | -0.93 | -3.08 |
7 | Jan-07 | 5.26 | 4.95 |
8 | Feb-07 | -2.31 | -1.86 |
9 | Mar-07 | 0.92 | 1.40 |
10 | Apr-07 | 6.56 | 8.01 |
11 | May-07 | -0.35 | -1.02 |
12 | Jun-07 | -4.01 | -3.50 |
13 | Jul-07 | -1.98 | -0.92 |
14 | Aug-07 | 1.62 | 2.28 |
15 | Sep-07 | 5.70 | 7.56 |
16 | Oct-07 | 7.76 | 4.48 |
17 | Nov-07 | -2.43 | -2.75 |
18 | Dec-07 | 0.00 | -0.91 |
19 | Jan-08 | -12.15 | -16.88 |
Now let me touch on four points about linear regression before we calculate our eight measures.
First, there are two broad types of linear regressions: single-variable and multiple-variable. In this exercise we focus exclusively on the single-variable version.
A multi-variable linear regression has multiple x-variables, but because of their added complexity, beginners start with the single-variable version.
We can easily visualize a single-variable regression with two dimensions on an XY scatter plot, but once you include other variables, we humans struggle to view the relationships because the chart might have three, four or five dimensions with axes going in all different directions. Nobody wants that, so instead output for multi-variable regressions is normally presented in tables.
The second point can be confusing at first, and may be something you just have to memorize because it is a standard in Statistics and comes up so frequently.
In a linear regression, variables can either be independent variables or dependent variables.
The x here is the label often assigned to the independent variable, meaning, it is the one we manipulate or follow as it changes independently. Sometimes it is called the predictor variable. From there, we measure changes in the dependent variable, here y, as it is customarily labeled.
In the social sciences like Economics, one variable is not literally dependent on the other, but unfortunately the terms stuck, so we're forced to use them.
As a third point, ours is time-series data, meaning the observations occurred over time but that isn't a requirement for a regression. In fact, we don't need to concern ourselves with the date, I just put it there for context.
Let's dream up a different study to illustrate the point. A non-time-series regression might involve measuring final grades in a class associated with how many times the student attended class. Here there is no time component.
So time isn't a requirement, but in financial data sets you often see one.
On the fourth point, I suggest not wasting brainpower trying to interpret the portfolio returns data here, instead think of these simply as data points with an x-variable and a y-variable.
Also, keep in mind that I'm glossing over some technical points I covered in other tutorials. For example, the sample size for this type of regression should be at least 30 data points, and preferably over 60. Also, this works best with data that has a ratio scale, like our percent return here. Head back to the tutorial called Generate and interpret a linear regression for more of the details.
Okay, with that, let's play with some data.
Now, for Step 2, let's build out the visuals.
As a byproduct of this exercise, I hope to introduce you to new skills and timesavers in Excel. The first, is naming ranges of data, which will help as you move on to working with statistical programming languages, like R or Python.
Let's name our ranges by highlighting cells D5:D19
and in the Name Box type
x
followed by Enter.
Then do the same for E5:E19 and name it
y
.
Very good, now we're all set to visualize this relationship.
Let's build that two-dimensional XY scatter plot.
Make sure your x-variable is in the first column, as we have here. Select both columns, so D5:E19, then Insert, Scatter and Scatter with only Markers.
To add the line and formulas click Layout, Trendline, More Trendline Options... and check the box next to Display Equation on chart. There you have it.
Since I added a number of customizations on my chart, I'll delete this one and focus on mine for the rest of this tutorial.
As is the norm, we have the x-variable plotted on the horizontal axis and the y-variable on the vertical axis. The 15 data points selected for each month correspond to the 15 points on the chart.
Let's look at the point far out to the right, with XY coordinates 7.76, 4.48. Here the Benchmark, our independent x-variable was up 7.76% and the Acme Fund, our dependent y-variable was up 4.48% during the month of October 2007.
The line through the data points mathematically minimizes the distance between the points and the line itself. This will become crystal clear by the end of this tutorial.
Shortly, we will walk through all of the math and see how the equation of this line, given here in y = mx + b format, was computed. Before we do that, let's collect measures using the method many people use when learning regression for the first time in Excel, called Data Analysis, or simply ANOVA in Excel.
Many people don't know this, but Excel offers three ways to gather regression statistics. Each has its own strengths and weaknesses, which we'll cover later, but the most common way is with Excel's Data Analysis tools. Before you can use them, they must be installed on your system.
If you click on the Data menu and see Data Analysis, then you are good to go.
If you don't see it, here are the steps to turn it on. Click File, then Options followed by Add-Ins. Now in the Manage dropdown select Excel Add-ins and Go..., then check Analysis ToolPak and hit OK.
Now, to use it, click Data then Data Analysis and Regression followed by OK.
There are a lot of settings to play with here, but we only need four
for now. First, point to the y-variable range first with
our named range y
and our x-variable
range second with x
.
Then when we click on Confidence Level Excel
automatically populates those ranges for us, illustrating why naming
ranges is such a timesaver. Let's input
90
and select the radio button next to
New Worksheet Ply. I specified 90%, because if we
don't specify anything different, Excel will give us two additional 95%
columns, and we don't want that.
Now making sure no other boxes are checked, once we hit OK Excel creates a new tab with our 8 most important linear regression measures in the first three columns. Some are labeled differently, but here we have the slope, the intercept, and correlation, to name a few.
Of course having output on a separate tab can be inconvenient, so I've copied an 18 row and 9 column section, formatted it with rounding and added consistent colors in my spreadsheet, so we'll use that instead.
To me, the way Excel dumps the output is one of the drawbacks of the Data Analysis approach, so as mentioned, I will show you the other two more flexible approaches later.
Q | R | S | |
---|---|---|---|
3 | SUMMARY OUTPUT | ||
4 | Regression Statistics | ||
5 | Multiple R | 0.96 | |
6 | R Square | 0.92 | |
7 | Adj R Square | 0.92 | |
8 | Std. Error | 1.69 | |
9 | Observations | 15 | |
10 | |||
11 | ANOVA | ||
12 | df | SS | |
13 | Regression | 1 | 456.69 |
14 | Residual | 13 | 37.04 |
15 | Total | 14 | 493.73 |
16 | |||
17 | Coefficients | Std. Error | |
18 | Intercept | -0.45 | 0.44 |
19 | X Variable 1 | 1.15 | 0.09 |
By the way, if you would like me to explain the rest of the Data Analysis output, let me know in the YouTube Comments section. If enough people show interest, I'll circle back with a second tutorial.
My tutorials come with two other helpful features. First, is a web page with the transcript for this video, including formulas and an outline so you can go straight to the section you are interested in. The first link in the video's Description will take you straight to it.
Second, the same outline is included in the video Description. That way you can take a break and return to finish it off later.
That concludes Step 2.
Now, for Step 3, let's calculate and interpret these eight measures.
To do so we will incorporate three other visuals.
First, is a currently-empty table where we mark our progress as we learn about and calculate each measure on our own.
C | G | H | |
---|---|---|---|
22 | Learn | Calculate | |
23 | 1) Slope | ||
24 | 2) Intercept | ||
25 | 3) SST | ||
26 | 4) SSR | ||
27 | 5) SSE | ||
28 | 6) Correlation | ||
29 | 7) R-squared | ||
30 | 8) Standard error |
Second is the table where we walk through intermediate calculations and third is a set of flashcards with formula notation.
On the formulas used here, they come from the Microsoft website. Because equations differ depending on the source, where I think a calculation is easier to learn a different way, I'll show you that.
Okay, with that as our plan let's categorize our 8 measures into 4 related groups.
Okay, let's cover slope and intercept.
These two measures sit in the equation for a line, and we aleady saw this, in y = mx + b format, where the slope m is 1.15 and the intercept b is -0.45.
Slope is given many different names depending on context and field of study. Here in Data Analysis it is called the X-variable coefficient, and it refers to the slope of the line of best fit. Like any slope, it is a measure of rise-over-run, meaning the change in the y-variable divided by the change in the x-variable.
So here, with a slope of 1.15, this line says that as the x-variable, the Benchmark, increased by 1%, the y-variable, the Acme Fund, rose by 1.15% for this period. So a rise of 1.15, over a run of 1.00 is 1.15, our slope.
And from earlier, the line of best fit minimizes the distance between all points and the line itself.
Now looking at formula notation, to calculate slope we sum the products of x minus the average of x times y minus the average of y. We then divide that by x minus the average of x, squared.
Okay, for this we need to make a few intermediate calculations. We need averages in columns F and G. Then we take each observation and subtract this average for both variables in columns H and I.
In column J we multiply columns H and I. Finally, in column K we square column H.
Okay, there's a lot going on there so let's break it down.
Let's start with F and G and get the averages.
C | D | E | F | G | |
---|---|---|---|---|---|
3 | x | y | x | ӯ | |
4 | Date | Bench | Acme | Bench | Acme |
5 | Nov-06 | -0.55 | -0.88 | 0.21 | -0.21 |
6 | Dec-06 | -0.93 | -3.08 | 0.21 | -0.21 |
7 | Jan-07 | 5.26 | 4.95 | 0.21 | -0.21 |
8 | Feb-07 | -2.31 | -1.86 | 0.21 | -0.21 |
9 | Mar-07 | 0.92 | 1.40 | 0.21 | -0.21 |
10 | Apr-07 | 6.56 | 8.01 | 0.21 | -0.21 |
11 | May-07 | -0.35 | -1.02 | 0.21 | -0.21 |
12 | Jun-07 | -4.01 | -3.50 | 0.21 | -0.21 |
13 | Jul-07 | -1.98 | -0.92 | 0.21 | -0.21 |
14 | Aug-07 | 1.62 | 2.28 | 0.21 | -0.21 |
15 | Sep-07 | 5.70 | 7.56 | 0.21 | -0.21 |
16 | Oct-07 | 7.76 | 4.48 | 0.21 | -0.21 |
17 | Nov-07 | -2.43 | -2.75 | 0.21 | -0.21 |
18 | Dec-07 | 0.00 | -0.91 | 0.21 | -0.21 |
19 | Jan-08 | -12.15 | -16.88 | 0.21 | -0.21 |
In the range from F5 to F19 for the
average of x, often referred to as
x,
we can use our named ranges with
=AVERAGE(x)
. For
ӯ in column G we use
=AVERAGE(y)
.
If you're familiar with copying formulas in Excel, this demonstrates how much of a timesaver naming ranges can be. Now when you copy the formulas from row 5 down you won't need to update each cell reference.
Next for
(x-x) in column
H the calculation in cell H5 looks
like =D5-F5
.
For the (y-ӯ) in
cell I5 it is =E5-G5
.
H | I | J | K | L | |
---|---|---|---|---|---|
3 | (x-x) | (y-ӯ) | (x-x) | (x-x)^{2} | (y-ӯ)^{2} |
4 | Bench | Acme | *(y-ӯ) | Bench | Acme |
5 | -0.76 | -0.67 | 0.51 | 0.58 | 0.45 |
6 | -1.14 | -2.88 | 3.28 | 1.30 | 8.27 |
7 | 5.06 | 5.15 | 26.05 | 25.55 | 26.56 |
8 | -2.51 | -1.65 | 4.16 | 6.32 | 2.73 |
9 | 0.71 | 1.61 | 1.15 | 0.51 | 2.59 |
10 | 6.36 | 8.22 | 52.24 | 40.39 | 67.57 |
11 | -0.55 | -0.81 | 0.45 | 0.31 | 0.65 |
12 | -4.22 | -3.29 | 13.90 | 17.82 | 10.84 |
13 | -2.19 | -0.71 | 1.55 | 4.78 | 0.50 |
14 | 1.41 | 2.49 | 3.51 | 1.99 | 6.20 |
15 | 5.49 | 7.77 | 42.68 | 30.17 | 60.37 |
16 | 7.56 | 4.69 | 35.42 | 57.10 | 21.98 |
17 | -2.64 | -2.54 | 6.71 | 6.97 | 6.45 |
18 | -0.20 | -0.70 | 0.14 | 0.04 | 0.49 |
19 | -12.36 | -16.68 | 206.11 | 152.77 | 278.07 |
20 | 0.00 | 0.00 | 397.86 | 346.61 | 493.73 |
After we copy those down and sum them in cells H20
and I20 using
=SUM(H5:H19)
and
=SUM(I5:I19)
, respectively, we see
they add up to zero. If you take a moment to think about it, this is
logical.
While we're here, let's calculate the numerator in the slope formula
in column J. It represents
(x-x)(y-ӯ).
In cell J5 we use
=H5*I5
, and copy that down.
While we're here, because we will need these in a moment, let's also calculate columns K and L and then we'll slow it down to make a few observations.
In column K sits
(x-x)^{2}
calculated with =H5^2
, raising
each cell in column H to the second power.
In L, let's square column I to
generate (y-ӯ)^{2} for the dependent
variable Acme Fund. So in cell L5 this is
=I5^2
.
If I'm ever going to fast, remember, that web page sits there with all of these calculations, found using the first link in the Description.
Now, after we sum each of these columns in row 20 we're ready to calculate the slope and intercept.
In cell J20 we have the sum of the product of (x-x) times (y-ӯ) or 397.86. That's the numerator of the slope calculation.
Now for the denominator, (x-x)^{2} in cell K20 we have 346.61.
So when we input =J20/K20
in cell
H23 this confirms that slope of 1.15 we saw in the
scatter plot and in Excel's Data Analysis output. Very good.
It's worth taking a moment to observe what's going on here. The last point, with coordinates of -12.15 and -16.88, illustrates that the further away x and y are from their averages, after you multiply and square them, the differences are magnified.
As this point is further away from its average it has a dramatic effect on the slope of this line, pulling the left side down closer to this observation, thereby increasing the slope.
Now imagine if this point wasn't included in the regression. Without it, these large figures in columns J and K wouldn't be here and the slope of the line might not be so steep. Also, for the averages, the dashed vertical line for x would shift to the right and the horizontal line for ӯ would shift up.
This is an example of a data point some consider an outlier. After going through the math, we can see why outliers have such a dramatic effect on regression calculations like the slope and subsequent interpretations.
Also, up to this point all of our calculations measure how far the x and y coordinates are from their averages. The January '08 y-variable observation for example was -16.68 percent away from it's average as depicted by the light-green curly brace in the chart.
So a good takeaway is that for the slope and intercept we are only concerned with how far the observations are from the dashed lines, not how far they are from the line of best fit. That's a very important point, so we'll return to it later.
Okay, we're through with the slope. The rest will go much more quickly.
The intercept is the point where this line of best fit crosses the y-axis, so slightly below zero it appears.
In formula notation, the intercept is sometimes referrered to as a and it is the average of y minus the slope b times the average of x.
Okay, let's see if this works, in cell H24 it is
=AVERAGE(y)-H23*AVERAGE(x)
for the
result of -0.45, another match.
To be clear because this point throws people, on the chart we said the dashed lines represent these averages, with the average of x at 0.21 shown with the vertical dashed line, and the average of y at -0.21. The origin of those lines sit at those coordinates. The reason the line for x is vertical is because regardless of the value of y, x is 0.21.
The way I think of the intercept, is as a starting point, meaning if x has a value of 0, then y will be -0.45.
One last point is that the intercept is given different names, including constant, and in Finance, we call it alpha.
Let's shift our focus to SST, SSR and SSE, as we move from finding the equation of that line to assessing the strength of the relationship. That's the second objective of regression mentioned earlier.
As an introduction, what is consistent with all three measures is that SS means sum of squares. T refers to the total, R stands for the regression and E means the error, also referred to as the residual.
Let's visualize this with the color-coded curly braces for one observation from January '08 on the scatter plot. The first, in light green, with -16.68, refers to one of the 15 data points that goes into the calculation of SST, or the Total.
The second, in pink, with -14.19 next to it, goes into the calculation for the Regression, or SSR.
Third, in dark green, with -2.49, is the error, used for SSE.
So the Total equals the Regression plus the Error.
Now it would be much easier if we could sum up the deviations, but can't because the negative and positive ones cancel each other out, right? As we saw in columns H and I. Once squared though, the negative signs go away and we can proceed. That's the procedure and where the term SS comes from.
So there's the concept, let's now collect the data and then return to interpret.
For the three Sum of Squares measures we need columns L through O.
We already filled out column L but for the rest we need ŷ which is the predicted value of y for each observation of x.
To see this and be a little bit more formal, let's put the equation in
cell T4, where
ŷ = 1.15x - 0.45 + e
. This time
we include the error term.
So this is perfectly clear, let's create a ŷ Calculator and plug in three independent x variables and see what that equation would predict. The results, of course, correspond to the points the sit on the line of best fit.
In cell T7 we have
=R19*U7+R18
. So, when we input
0
for x, as we would
expect, ŷ is -0.45, our intercept.
T | U | |
---|---|---|
6 | ŷ | x |
7 | -0.45 | 0 |
Second, let's input the January '08 observation covered earlier, of
-12.15
for x and we
would expect y to be -14.39, and when it is a
predicted value we call it ŷ, right? So -14.19
minus the average of -0.21 is -14.40, off just a hair due to rounding.
T | U | |
---|---|---|
6 | ŷ | x |
7 | -14.39 | -12.15 |
Third, circling back to another point made earlier, input the average
for x, or 0.21
and
the result is -0.20, which corresponds with ӯ,
narrowly missing due to rounding.
T | U | |
---|---|---|
6 | ŷ | x |
7 | -0.20 | 0.21 |
So the line of best fit passes through the origin of the dashed lines with those coordinates. That concept sure could make for an interesting test question, I'd say.
Now that we have a good grasp of ŷ, let's populate columns M through O. After that we'll return for the calculations and interpretations.
Notice too that for these calculations we are only concerned with the dependent variable y, or Acme returns.
In cell M5, we have
=D5*$R$19+$R$18
, pulling the slope and
intercept from the regression. The dollar signs here allow you to
copy that range down without messing up the formula.
Now for column N, here in the first row we have
=(M5-G5)^2
.
Finally, in cell O5 we have
=(E5-M5)^2
.
L | M | N | O | |
---|---|---|---|---|
3 | (y-ӯ)^{2} | ŷ | (ŷ-ӯ)^{2} | (y-ŷ)^{2} |
4 | Acme | Acme | Acme | Acme |
5 | 0.45 | -1.08 | 0.76 | 0.04 |
6 | 8.27 | -1.52 | 1.72 | 2.45 |
7 | 26.56 | 5.60 | 33.67 | 0.42 |
8 | 2.73 | -3.09 | 8.33 | 1.52 |
9 | 2.59 | 0.61 | 0.67 | 0.63 |
10 | 67.57 | 7.09 | 53.22 | 0.86 |
11 | 0.65 | -0.84 | 0.40 | 0.03 |
12 | 10.84 | -5.05 | 23.48 | 2.41 |
13 | 0.50 | -2.72 | 6.30 | 3.24 |
14 | 6.20 | 1.41 | 2.62 | 0.76 |
15 | 60.37 | 6.10 | 39.76 | 2.14 |
16 | 21.98 | 8.47 | 75.23 | 15.89 |
17 | 6.45 | -3.24 | 9.19 | 0.24 |
18 | 0.49 | -0.44 | 0.06 | 0.22 |
19 | 278.07 | -14.39 | 201.29 | 6.19 |
20 | 493.73 | 456.69 | 37.04 |
Okay, we are ready to move on.
So the Sum of Squares for the Total regression, or SST, is the sum of
column L, and in cell L20
we have =SUM(L5:L19)
. This gives us
493.73, a match from the Data Analysis output, so in cell
H25 we can bring this down with
=L20
.
To interpret, SST refers to the sum of all of the squared differences between each observed y minus the average of y. Again, the total of all variations squared.
Okay, now that we have the total, we can break it into a part that is predicted or explained by the equation, SSR, and an unexplained part, SSE.
Now, the Sum of Squares for the Regression answers just how much of that SST the line of best fit predicts or explains, right? It measures the explained deviations from the mean.
Here we have the distances from the line, calculated with ŷ and the ӯ. Once we square -14.19, for example, we get 201.29.
When we sum column N, with
=SUM(N5:N19)
in N20
it gives us the Sum of Squares for the Regression of 456.69, capturing
all 15 ŷ minus ӯ
predictions in pink, and matching what we saw in the Data Analysis
output.
Let's carry that down to H26 with
=N20
and make a two observations.
First, what do you think the value of SSR would be if all observations fell on the line of best fit? In other words, if the ŷs perfectly explained the observed ys.
Or described a different way, if the green curly brace that was -16.68 for the January '08 observation was instead -14.19 and sat right on the line, as did all other observations.
I think you can see where I'm going with this. The SSR would equal SST and we would have no SSE, so no errors. The point I'm trying to get across is that the closer all observations are to the line of best fit, the closer SSR is to SST, and the smaller the SSE.
And one last subtle point, in the example just mentioned the ŷ underestimated the distance the observed y is from the mean, right? But by definition, underestimates are going to be offset by overestimates.
To solidify this point, I encourage you to pause and think about why the curly braces are aligned differently for the observation on June '07, with XY coordinates -4.01, -3.50.
Okay, with that, let's tackle the calculation of SSE and return for additional interpretation.
At this point, the Sum of Squared Errors should be straightforward. It refers to the unexplained deviation from the mean, so those over and underestimates I just mentioned.
Here we are squaring all of the errors, like the -2.49, to get 6.19.
Now when we collect the sum in cell O20, with
=SUM(O5:O19)
we get 37.04, again
matching the output from Data Analysis and we can carry that down
to cell H27 with =O20
.
Great, so now that we've completed our three Sum of Squares measures, the takeaway is that the total equals the explained part plus the unexplained part.
Next up are the two related measures correlation and R-squared.
The correlation, or correlation coefficient, sometimes simplified with the letter R, is a measure that quantifies the strength of the relationship between two variables.
According to the formula, we have all of the pieces calculated. The numerator sits in cell J20 and for the denominator we need the square root of the product of K20 times L20.
In cell H28, let's try it out with
=J20/SQRT(K20*L20)
giving us 0.96, a
match once again.
For the interpretation, let me grab a visual from the Stock Correlation Analysis tutorial earlier in this Playlist.
First, correlation measures always fit within the range from -1 to +1.
Second, the closer the correlation is to either end, the tighter the data points are to the line of best fit. Meaning, if this looked like a shotgun pattern, the correlation would be close to zero and the relationship between x and y would be non-existent and non-linear.
Third, on the direction of the the line, if it points up and to the right correlation is positive. This means that as one variable increases the other variable increases, which is our case here. Keep in mind too, correlation has nothing to do with the slope of the line.
If the line points up and to the left, then the correlation coefficient is negative. As an example, let's say you had an x-variable that measured the weight of a car, and a y-variable that measured gas mileage. As you increase the x-variable weight, you would see gas mileage decrease. Here the relationship is negatively correlated.
See that tutorial I mentioned for a more in-depth exploration of correlation.
Now the R-squared measure, also called the Coefficient of Determination, is correlation squared when you have one independent variable.
So in cell H29 it can look like
=(J20/SQRT(K20*L20))^2
or simply
H28^2
, your choice. The answer of
0.92 again matches, so let's spend a moment to interpret.
Because we're squaring correlation, the sign goes away and the scale falls on the range from 0 to 1. The relationship between the two is important because it isn't linear. As the correlation approaches +1 and -1, the R-squared increases exponentially, and for correlations of below 0.5 the R-squared figures quickly approach zero.
One way to think of R-squared is that it shows the fraction of the total variability in the dependent variable that is explained by the independent variable.
So in this regression, R-squared represents the percent of the total variation in Acme returns, explained by the Benchmark returns.
If this is sounding familiar, it should. Let's hop over to cell
X13 and if you take the explained part, SSR, divided
by the total, SST, with =N20/L20
what will that
give you? R-squared.
In fact, algebraically you can find R-squared three ways and may see any of these on test day.
Hopefully that provides a better way of understanding R-squared than just saying it is correlation squared, as many do.
Our last measure is standard error, or the standard error of the estimate. It is akin to the standard deviation of the error terms.
Again we have all of the pieces for this formula, which is kind of scary, so I'll show you what I think is an easier way after that.
In cell H30, using this formula,
=SQRT((L20-((J20^2)/K20))/(R14))
, we
will get another match, of 1.69.
And if we go a different way, if you think about it, column O is already the error squared, right? So if we divide 37.04 by the degrees of freedom, of 13 here, so n-2, and then take the square root we get 1.69, and we can see if this is an exact match. Which it is.
I don't want to spend a lot of time on interpretation of standard error here other than to say if you would like to see an application of it used for Portfolio Risk Decomposition, zip back one tutorial.
There we used this measure Standard Error for the Regression, but there are also standard errors for the intercept and slope, used to compute confidence intervals. Like I said, if you want another in-depth tutorial like this one the other measures from Data Analysis, please let me know in the Comments.
Okay, now you've seen how these 8 measures are calculated manually. That was a lot to cover in one sitting, so like I said, running through it a few times will help. You can always return here when you need a refresher.
Now, let's look at two other ways to pull regression measures, using functions, because it will be eaiser to let Excel calculate them for you.
Okay, here's the plan for Step 4.
First, we go through pros and cons of each regression procedure in
Excel: Data Analysis, regression functions and
LINEST
. Then I'll do a quick review
of the second two procedures here and point you to where you can learn
more.
I touched on Data Analysis earlier and the clunky nature of the Excel ANOVA table. It really wasn't meant for running multiple regressions, say over many time periods, which is where the other two procedures come in handy.
That said, what is nice about Data Analysis is that it includes the ability to generate a variety of charts, so it is highly customizable. Output is somewhat standard as you often see Excel ANOVA output printed in textbooks.
Excel regression functions are well suited for performing rolling regressions over multiple time periods, coding them is quick and Excel offers a helpful formula completion feature. The downside here is that you can't customize settings for the regression, like whether you want to include or exclude the intercept.
Now the reason the LINEST
function
works so well is that it gives you the ability to customize the
regression and keep the output to just the one cell you are looking
for. The downside is the added complexity, as I'll show you.
Advantages | Disadvantages | |
---|---|---|
Data Analysis (Excel ANOVA) |
- Charting - Easily customized - Widely used |
- Clunky output - Not easily automated |
Excel Regression Functions | - Rolling regressions - Quickly coded - Formula completion |
- Not easily customized - Not all 8 measures available |
The LINEST Approach | - Easily customized - One measure per cell |
- Added complexity - Not all 8 measures available |
Okay, so with that as a backdrop, let's look at what is likely the
second procedure people learn to pull regression measures, with
Excel's built-in regression functions like
SLOPE
,
CORREL
and
STEYX
.
Using regression functions is similar to how it works in a programming language. Here you can quickly pull 5 of our 8 measures.
=SLOPE(y,x)
=INTERCEPT(y,x)
=CORREL(y,x)
=RSQ(y,x)
=STEYX(y,x)
If you want to see the Excel regression formulas used here, again, access that web page.
Now that you know why the LINEST
function is worth learning, let's zip through it.
In column J sits each of the 6 available calculations
using the LINEST
function. When used
alone it too dumps an output block; however, if you combine it with the
INDEX
function, you can pull just the
measure you're looking for.
LINEST
is an array function and to
generate a 5-row and 2-column output block of 10 measures from a
single-variable regression, we need to select a 5x2 output block, then
type =LINEST(y,x,TRUE,TRUE)
, for our
data here and use the Ctrl+Shift+Enter keystroke
combination. Excel will populate the whole block at once.
You can tell this is an array calculation because of the curly braces that surround the formula. Notice too that the same formula sits in all cells. Also, cells in an array formula can't be changed individually, so if you make a mistake, you'll need to erase the formula and start over.
W | X | |
---|---|---|
4 | =LINEST() Output | |
5 | 1.15 | -0.45 |
6 | 0.09 | 0.44 |
7 | 0.92 | 1.69 |
8 | 160.30 | 13 |
9 | 456.69 | 37.04 |
Excel's INDEX
function is similar to
indexing in programming languages. We use it to pull data from an array
of data arranged in rows and columns.
Here we wrap the LINEST
function
inside an
INDEX
function, thereby telling Excel
that we want individual items from the regression instead of the whole
output block. We just need to specify the coordinates.
So rather than taking time to explain each formula individually, find the formulas on that web page.
=INDEX(LINEST(y,x,TRUE,TRUE),1,1)
=INDEX(LINEST(y,x,TRUE,TRUE),1,2)
=INDEX(LINEST(y,x,TRUE,TRUE),5,1)
=INDEX(LINEST(y,x,TRUE,TRUE),5,2)
=INDEX(LINEST(y,x,TRUE,TRUE),3,1)
=INDEX(LINEST(y,x,TRUE,TRUE),3,2)
For a more thorough exploration see my tutorial on Excel LINEST and INDEX functions.
So hopefully this exploration of Excel regression functions has proven helpful. As you advance with Excel, these can save you a great deal of time.
Now for Step 5, let's wrap up with a summary, some homework and tips for further study.
By way of summary, we covered the mathematics behind the eight most important measures from a linear regression using Excel ANOVA output called Data Analysis.
We built visuals and saw two other ways to customize and access regression measures on the fly that conveniently can be copied over ranges of cells and charted to show rolling regressions. This type of reporting can be very impactful.
Now, to keep advancing and commit this to memory, I suggest taking a simple data set of your own, preferably one with small integers, create an empty table and walk through the calculations by hand. When you are done, check your work with one of Excel's three approaches covered here.
Then, you can also try it out with these three simple data sets I provided and after that you will be certain to ace your exam. This was one way I prepared for the CFA exam and now I find I don't have to pause to think about how to interpret regression measures, because they've been committed to memory.
Set 1 | Set 2 | Set 3 | ||||
---|---|---|---|---|---|---|
Observation | x | y | x | y | x | y |
1 | 1 | 2 | -3 | 1 | 2 | -2 |
2 | -1 | -3 | -2 | 2 | 3 | 4 |
3 | 2 | 5 | 2 | -4 | -2 | 3 |
4 | 0 | 1 | 3 | -3 | -5 | 0 |
5 | -3 | -2 | -1 | 2 | 0 | 0 |
6 | -1 | -4 | 5 | -3 | 2 | -2 |
Also, if you don't have experience with statistical programming languages, I'd suggest repeating this in either R or Python. They're free and in many surveys they rank in the top 10 programming languages to learn.
Oh, if your numbers and mine here aren't spot on, don't sweat it, it could be the result of rounding on your end or mine.
Also, this resource will sit here as a reference so bookmark the page or video, because we all need a refresher.
As for feedback, like I said this is my 250th tutorial and I'm trying to improve each time, so kindly leave a comment with constructive criticism, or just a few words of encouragement.
If I helped you today, likes and referrals to friends or colleagues are ways you can help me out. Plus if you Subscribe you won't miss the second go around for other measures like t-stat, P-value, confidence levels and even the F-Stat for a multi-variable regression. Again, chime in if you'd like to see that.
Also if you have a different way of explaining a tricky concept that may help someone else, please share it in the Comments section so others can learn from your insights.
In the next episode we return to the structure and data set in Quant 101 to cover the Security Characteristic Line, then Portfolio Optimization.
Oh yeah, if you're studying for a test, let me know if you aced it.
Please join us any time and have a nice day.
To learn faster make sure you are watching the videos because unlike our more code-based tutorials, much of this you will need to follow in Excel.
I encourage you to check out our YouTube Channel for other opportunities to advance your skills.
/ factorpad.com / fin / quant-101 / linear-regression.html
A newly-updated free resource. Connect and refer a friend today.