FactorPad
Faster Learning Tutorials

Data Science in a Spreadsheet?

The pros and cons of Excel versus a statistical programming language like R or Python for Data Science.
  1. Pros and cons - Discuss spreadsheets with data science in mind.
  2. Terminal - Take notes in nano on advantages of spreadsheets and disadvantages. Cover statistical programming languages as well.
  3. The punch line - See if a spreadsheet will do.
  4. Our stack - Detail our software stack.
Paul Alan Davis, CFA, January 19, 2017
Updated: August 10, 2018
Now let's see if a spreadsheet will do for Data Science.

Outline Back Next

~/ home  / tech  / full stack  / programming data science


Spreadsheets versus Programming for Data Science

Beginner

Learn distinctions between data analysis in Excel versus data science in a statistical programming language.

Video Tutorial

Data Science in a Spreadsheet? (4:53)

Videos can also be accessed from our Full Stack Playlist 1 on YouTube.

Code Examples and Video Script

Welcome. Today's question: Is programming required for Data Science? The pros and cons of spreadsheets.

I'm Paul, and I'll admit it. I have an affliction. Maybe you have it too. My background in Finance taught me to run and visualize data myself, and my tool of choice was always a spreadsheet.

In this video (tutorial), we will talk about spreadsheets and how far you can go with them in Data Science.

To do that, we'll start with the second part of the question and return for the first.

We'll take notes in a text-editor called nano at the Linux Terminal, returning for the punch line, is programming required? Then conclude with a few words on our Data Science software stack.

Step 1 - Pros and Cons of Spreadsheets

Okay, if you are not familiar with the Linux command line, don't worry, I'll come back and explain each command in other videos (tutorials).

Step 2 - The Terminal

Here I am creating a file to organize 4 pros and cons for each.

paul@fullstack:~$ cd notes paul@fullstack:~/notes$ nano video0003.txt
The advantages and disadvantages of spreadsheets

Okay, the advantages (of spreadsheets). First, they're pervasive, meaning they're cheap or free and many people have memorized the functions.

Second, they're easy to learn. The grid and cell format helps you draw relationships and make quick calculations.

Third, you can visualize and edit data directly, formatting colors and fonts, making nice presentations.

Fourth, many statistical operations are included, plus links to third party software.

GNU nano 2.2.6 File: video0003.txt Spreadsheet Advantages - pervasive - easy to learn - easily customized - formulas included Spreadsheet Disadvantages - errors - auditing - file size - visualization by hand ^G Get Help ^O WriteOut ^R Read File ^Y Prev Page ^K Cut Text ^C Cur Pos ^X Exit ^J Justify ^W Where Is ^V Next Page ^U UnCut Text ^T To Spell

Now, disadvantages. First, they're prone to errors, studies say 80% of spreadsheets have errors, and this is particularly true for those shared across an organization because users have varying levels of knowledge, meaning someone can be a user one minute and a programmer the next.

Second, auditing spreadsheets is costly and time-consuming, and you'd have to inspect every cell, right?

Spreadsheets contain repetitive information, for example 1000 records may have 1000 independent calculations.

And the advantage listed earlier about visualization is a double-edged sword as charting is often painstakingly done by hand.

The advantages and disadvantages of programming

While we're at it, let's repeat this same exercise for programming. First, version control means you can hold everything constant, roll-back updates and maintain code in one working state called a release, rather than being in a constant state of change, which helps you catch bugs.

Second is speed. Statistical programs, for example, allow you to run calculations and visualize huge data sets, so you can proceed with your analysis.

Third, repetition, meaning you can change variables and run thousands of cases, like in a Monte Carlo simulation.

And fourth, programming is really superior for building logic.

GNU nano 2.2.6 File: video0003.txt Programming Advantages - version control - speed - looping - conditional logic Spreadsheet Disadvantages - cost of labor - less standardization - cost of time - less visual ^G Get Help ^O WriteOut ^R Read File ^Y Prev Page ^K Cut Text ^C Cur Pos ^X Exit ^J Justify ^W Where Is ^V Next Page ^U UnCut Text ^T To Spell

As for disadvantages, because programming is more difficult to learn, the cost of labor is higher.

Second, the variety of programming interfaces results in less standardization, you know, with fewer experts focused on the improvement of one program. Plus the documentation suffers as a result. In other words, a spreadsheet was built for the masses while statistical analysis programs were built with scholars in mind.

Third, the cost of implementation with programming is higher.

And fourth, you have to let go of the GUI (graphical user interface) and use text editors, like nano here, which takes me back to my problem of having to view and touch the data.

Step 3 - The Punch Line

So let's go back for the punch line. Is programming required for Data Science?

paul@fullstack:~/notes$ ls video0002.txt video0003.txt paul@fullstack:~/notes$ cat video0003.txt Spreadsheet Advantages - pervasive - easy to learn - easily customized - formulas included Spreadsheet Disadvantages - errors - auditing - file size - visualization by hand Programming Advantages - version control - speed - looping - conditional logic Spreadsheet Disadvantages - cost of labor - less standardization - cost of time - less visual

I'll give one big emphatic, Yes. In this Playlist, we're heading from here (Beginner) to here (Advanced).

For building logic into models and algorithms, including advanced techniques like machine learning and artificial intelligence, frankly the spreadsheet won't do.

Step 4 - Our Stack

So it's time to let go of that spreadsheet. This picture here (below) is our software stack, which we will add to, and it shows one path you can take.

  • Client : HTML, CSS, JavaScript
  • Software : Python Scientific Stack
  • Data : PostgreSQL, MySQL
  • OS : Linux (command line), Debian

Summary

I suggest watching video (tutorial) 1 for an outline and please join us for the answer to: "Which hardware is most used in Data Science? Linux, Mac or PC?" in video 4.

Have a nice day.


What's Next?

If you learned something then please consider subscribing to our YouTube Channel. Every subscription helps.

  • To access all tutorials, click Outline.
  • For a definition of Data Science and success factors, click Back.
  • To see which operating system is most common in data science, click Next.

Outline Back Next

~/ home  / tech  / full stack  / programming data science



 
 
Keywords:
data science programming
code data science
advantages of spreadsheets
excel for data science
disadvantages of spreadsheets
advantages of programming
disadvantages of programming
statistical analysis software
excel pros
data analytics using excel
excel cons
learn data science
data science tutorial
excel for data analysis
pros and cons of spreadsheets
excel statistical analysis
data science tutorial