/ factorpad.com / fin / quant-101 / stock-return-distributions.html
An ad-free and cookie-free website.
Beginner
Videos are available at one organized Quant 101 Playlist on YouTube.
Stock return frequency distributions and histograms in Excel (18:05)
Welcome. Today's goal is to generate a histogram chart in Excel based on data sitting in a frequency distribution table.
I'm Paul, and I'm a big believer in that saying "a picture is worth a thousand words", at the same time I know how long it can take to create a compelling picture in Excel.
So here we will walk through steps, saving you time.
In this Quant 101 Playlist our data set is monthly returns on stocks for a 5-year period, so 60 months, and like our other financial modeling tutorials, we will use the data set downloaded in a previous tutorial. So if you want sample data to play with, head back to the System Setup tutorial.
Some people benefit from reading the text while watching the video, so we have both, a web page with everything I say here and a YouTube video Playlist for the visuals. You will also find in the video's Description a link to the web page and an outline so you can skip straight to the section you are interested in.
Let's go over what we will cover here.
First, we will take a quick peek at our data set.
Second, we walk through a 7-step process to create a frequency distribution table using Excel's array calculation method.
Third, we create a histogram from that table.
Fourth, we will turn to interpretation and cover why visualization is important, especially for stock returns.
Fifth, we will make observations about this histogram from a portfolio manager's perspective as it is a good first step in evaluating stock risk.
And in our next episode we will talk about the history of stock markets before a further exploration of stock risk in Chapter 3.
Okay, so our data set includes seven columns of data but in this tutorial we will only use four, columns C, D, E and F, representing returns for four stocks over 60 monthly periods.
We won't need dates here, and will ignore columns marked Market and Rf for now.
C | D | E | F | |
---|---|---|---|---|
6 | MSFT | EBAY | ABT | MRK |
7 | 0.056175 | 0.089087 | 0.083536 | 0.062066 |
8 | -0.037544 | 0.094070 | 0.096480 | -0.044689 |
So our range here will be C7 to F66.
Now for Step 2, let's cover how to create the frequency distribution table. In our case here, a frequency distribution shows the monthly returns in table form for each stock. So you can think of this table as four frequency distributions, one for Microsoft in column F, then another for eBay, Abbott Labs and Merck.
E | F | G | H | I | |
---|---|---|---|---|---|
7 | Return | MSFT | EBAY | ABT | MRK |
8 | <-10% | 2 | 9 | 2 | 6 |
9 | >-10% and <-5% | 8 | 4 | 3 | 5 |
10 | >-5% and <0% | 16 | 13 | 20 | 16 |
11 | >0% and <5% | 22 | 15 | 23 | 19 |
12 | >5% and <10% | 11 | 7 | 11 | 8 |
13 | >10% | 1 | 12 | 1 | 6 |
14 | Total | 60 | 60 | 60 | 60 |
We have all 60 returns for each company, placed into one of the 6 bins, or buckets. The number shows the count of returns in each bin. The first bin starts with negative returns, below -10%. The sixth bin has positive returns, above 10%.
So Microsoft had eight monthly returns of between minus-5 and minus-10 percent. In 12 months, eBay provided a return of greater than 10% for this historical period.
As is common in Excel there are several ways to do things. Here we will use the function method. You can also use tools in the Data Analysis menu item to generate a frequency distribution, but we won't here.
Instead, we want the results to sit right on this page and customize how it looks, so we will use the function method for its flexibility.
First of all, we need to estimate the breakpoints which can require a bit of trial and error. Basically the way I do it is by scanning the data on the Returns tab.
An easy way to do this would be to copy the data to another location, sort it by return, from low to high, and then search for logical breakpoints.
Of course, here I have created breakpoints for you, so I won't repeat it but you should consider creating logical and clean buckets, meaning round numbers. Here you can see I used 5% increments because I think the visual depiction tells an interesting story and can easily be read and interpreted by any person looking at this spreadsheet for the first time, like your boss, upper management, or even clients.
Notice how the histogram loosely takes the shape of a bell-shaped curve? For other data sets you won't know this up front, as this is specific to our data set here, stock returns, which over thousands of observations takes this shape. As we press forward with this series, this shape simplifies the statistical measures we use, which is very convenient.
Other data sets may, of course, not look like this.
Okay, I mentioned we have four frequency distributions
here, one for each company. So the range of output of the
=FREQUENCY()
function for Microsoft
would be in the range from F8:F13. For eBay it will be
G8:G13, and so on.
Around that, we put labels and add formatting so the table is easy to read and interpret. So column E and row 7 are simply for aestetics, not for actual data output. I added borders and shading as well. Again, a picture is worth a thousand words, so think about the needs and sophistication level of your intended audience.
Of course I've already completed this table and it was used for the histogram, so I won't write over it, instead I'll walk through an example on the right so you can see how it works.
Before we do that, I want to show you one thing. If I click on cell F8 you will see curly braces surrounding the formula.
{=FREQUENCY(Returns!C7:C66,Distributions!K8:K12)}
If you haven't seen this before, it is called an array calculation which performs many calculations at once, rather than cell-by-cell, as you may be accustomed to in Excel. To illustrate, whether we inspect F8, F9, F10, or any cell in that range, it shows the same formula. We'll come back to this in a moment, but I wanted to introduce array calculations in case you haven't seen them.
Next, we need to set up what is called a bins_array, which is a separate table completely. I put in column K.
K | |
---|---|
7 | bins_array |
8 | -0.1 |
9 | -.05 |
10 | 0 |
11 | 0.5 |
12 | 0.1 |
It is the bins_array that Excel uses to screen all 60 returns and drop them into each bin, then Excel will keep score for us. Imagine a table with thousands of records, and you can see what a time-saver this can be.
You can, of course, set up the bins_array off the screen somewhere, or on a different page, if it annoys you or is a distraction to your audience.
The breakpoints for the bins represent the lower bound, meaning that for every area of the output table, the bin defines the line between each row.
So the -0.1 refers to a -10% return and is the defining line between the first bin and the second bin. This throws some people so take a few seconds to think about it. So, as a rule, the number of rows in the bins_array, 5 here, is always 1 less than the number of buckets, in this case 6 buckets. Remember, the bins_array simply defines the dividing lines.
Again, as mentioned earlier, you have flexibility to determine how to define this table, which in the end will define how the histogram looks. Also, note the bins_array must be in decimal format, just like our data.
Next, the data_array is simply the set of data points that, for us, sit on the Returns tab, again one column for each stock. Here, for Microsoft, it is C7:C66.
At this point, I'll show you how to create the output for Microsoft and let you create the rest on your own.
The output block is where we will place the output from the
=FREQUENCY()
function. First, I'll
put a label for Microsoft. Then our output block goes in the range
of 6 cells below that. Here, for now, we will highlight that range.
The output block is always one row greater than the bins_array, so five breakpoints creates six bins.
Next, we input the function, =FREQUENCY
and once we get to the opening parenthesis
(
, Excel shows us what we need, which
is two arrays, the data_array and the
bins_array.
For the data_array, select or type in the range
Returns!C7:C66
, followed by a comma.
After that, we input the bins_array by typing or selecting cells
Distributions!K8:K12
, then the closing
parenthesis )
.
Now, for the tricky part. We have to tell Excel this is an array calculation type, again doing a bunch of calculations at once.
To do this, we need to close this out with a special keystroke
combination by holding down Ctrl
and
Shift
, then hitting
Enter
.
Excel populates the output block with the count, or frequency, of observations in each bucket.
Let me make a few comments before moving on. First, we will perform other array calculations later when we multiply matrices and collect regression outputs. So for more advanced Excel use array calculations are quite common.
Second, when you go to one of the cells and try to edit it, say if you wanted to change the data_array range, Excel will warn you that you "cannot change part of an array". This is logical because, as mentioned, all cells in the output block have the same formula, so Excel won't allow you to edit just one cell at a time.
So, if you make an error or want a different range for your data_array, just delete the whole output block and start over.
Let's say I wanted to use a shorter data_array, from
Returns!C7:C65
, with the same
bins_array, followed by that keystroke combination and now we sorted
observations for 59 periods instead of 60.
And last, let's verify the totals. This is something I like to do, because I think it makes the presentation look better and helps you spot mistakes.
To show you how this works, go to an empty cell, and use the
=SUM()
function by typing or selecting
the output block and hitting Enter
.
Here we don't need to use the array keystroke combination because
=SUM()
is a standard Excel function
sitting in one cell instead of across multiple cells.
So here we have 59 data points in the data_array. We removed the last data point for Microsoft, a return of roughly 4.34%, which means that instead of 22 observations in the range greater than 0% and less than 5%, we have a count of 21.
Okay, if you are following along in your version of Excel, I think you have enough to get going to complete the frequency distribution that matches my output table in cells F8:I13. We will use this for the histogram, so pause here if you'd like to finish that off.
Recall, you will need to do a frequency distribution for each of the remaining stocks, eBay, Abbott Labs and Merck, selecting the appropriate ranges from the Returns data tab.
Next, for the easy part, we will create a histogram using standard charting and put the resulting chart right in the spreadsheet. After that we can move it around and customize its appearance.
Start by selecting the range, including the labels, but not the totals, so in this case, cells E7:I13, then under the Insert menu item, select the Column chart type, and for personal preference I've selected the first choice in the 2-D section for a two dimensional chart, and there you have it.
Now for editing charts, as mentioned this part can take some time, and you have to make the trade-off between your free time, your graphic design skills and the importance of the tables and charts for your intended audience. If this is just for you, maybe the standard output is sufficient. But if this is for your manager, or your manager's manager then you may stand out by making this look appealing.
For me, I spent about 10 minutes, customizing the color scheme, adding a title, plus x and y axis labels. Fine-tuning this isn't our focus here, but to do that you would right-click on the chart and on the menu bar, several choices show up, at least in my version of Excel. Yours may be a bit different.
Here you can customize Design, Layout and Format. I suggest gaining a comfort level with making this chart look nicer than the standard output.
Again, editing charts can be a major time suck. In my experience, generating and analyzing data is a fairly quick process; however, presenting it in a way that gets the point across to the viewing audience, can be exhausting. That said, if you are sticking around for other tutorials in this Playlist, I won't be spending time on customizing the visual presentation, so that's up to you.
Let's move on to step 4 and talk about visualizing this data relative to stock returns and analyze this a bit like a portfolio manager or successful financial model builder might.
Besides improving with Excel, why are we spending so much time on visualization?
In my view, visualizing data should be the first step of data analysis. If you are a visual learner, like most of us, it may be the most important step.
I can think of 5 solid reasons why visualization is important.
First, it helps you spot potential errors. Every once in a while you see a data point that just smells a little funny. So instead of scanning rows-and-columns of data, when using a chart the potential error just pops out.
It is easy to perform calculations without spot checking; in fact many financial model builders do just that. They have hundreds or thousands of calculations going on at one time, but often don't check the distributions to see if errors would make their assumptions moot. Errors can nullify the greatest of insights.
Second, visualization allows you to identify trends, or peaks and valleys that may be interesting for further study. Also, outliers may skew results, requiring further digging.
Specific to stocks, let's say a company's price-to-earnings ratio changed over time from 10 to 20, then back to 10. The fundamental analyst would want to look into that and these trends become more evident in a chart.
Third, do the distributions built in to your financial models fit the data?
Later, when statistical measures come into play, like confidence intervals, do these measures fit within your expectations? These expectations are important because they may trigger other calls to action, like using a different statistical procedure whether the data follows a normal distribution or a lognormal distribution. We will see more visual examples in Chapter 4, when we cover correlation and regression.
We also know that many people dive right in to the table of statistics, as if they were looking at a box score, the day after a baseball game. However, smartly created charts can pack in so much more information. And having 'information' is better than just having 'data'.
Fourth, visualizing the distribution also helps us see the variability of returns, or risk. We will add many layers to our understanding of risk, using statistical measures, but sometimes visualizing the distribution gets the point across better, whether it is for yourself, your boss or paying clients.
Fifth, the trend in the career marketplace is changing. It used to be that you could get with throwing statistics at people and they think you're smart and just believe what you say.
In my view, this changed in the post dot-com bubble era after 2004. This period corresponds with what is called Web 2.0, with the advent of user-generated content, interactive websites, mobile devices, Apps, and more focus on usability like in social-media apps. This, to me, raised the bar for data analysts.
So it is incumbent on us, as financial model builders, to not only know what we're doing with the numbers but to make the presentation compelling.
Now, let's move on to Step 5 and make observations about our four stocks to tie this all together. Later we will review statistical measures that should confirm what we see here. If they don't then we either made a mistake with the frequency distribution, histogram, or when calculating statistical measures.
You may want to pause here and look at the histogram and jot down your own observations.
We often use the statistical measure standard deviation as a way to convey risk as it is a measure of dispersion around a central point. It and a histogram should rhyme. We will explore standard deviation fully in the tutorial after the next one, so don't worry about the calculation just yet, but if I asked you to rank-order our four stocks by dispersion what would you say?
Higher risk stocks, using standard deviation, have more returns farther away from the center. So here, just using the naked eye we can see that Abbott Labs has the highest frequency of observations near the middle, and fewer in the tails, so I would say it ranks as the lowest risk stock.
After that, I would say Microsoft, followed by Merck. Notice how the bars for eBay indicate more observations in the tails at a much higher frequency? So with that I would say eBay ranks highest in terms of standard deviation.
So with a histogram we can pick up these returns quickly, which was our whole goal here.
By way of summary, we've worked through the steps required to get our first look at a distribution of returns, an important topic for future discussions. We started with a frequency distribution in tabular form, then we created a chart called a histogram.
Later, we will layer on this purely statistical measure standard deviation that provides additional information to analyze, but this is a good first step.
In the next episode we will cover the history of stock markets from both academic and practical viewpoints, also addressing different investment philosophies, which in the end will help us build better financial models.
Please join us any time and have a nice day.
Make sure you supplement this text-based tutorial with our YouTube videos as it will help these concepts sink in.
Don't miss new content, join the mailing list below.
/ factorpad.com / fin / quant-101 / stock-return-distributions.html
A newly-updated free resource. Connect and refer a friend today.