/ factorpad.com / tech / full-stack / programming-data-science.html
An ad-free and cookie-free website.
Beginner
Learn distinctions between data analysis in Excel versus data science in a statistical programming language.
Videos can also be accessed from our Full Stack Playlist 1 on YouTube.
Data Science in a Spreadsheet? (4:53)
Welcome. Today's question: Is programming required for Data Science? The pros and cons of spreadsheets.
I'm Paul, and I'll admit it. I have an affliction. Maybe you have it too. My background in Finance taught me to run and visualize data myself, and my tool of choice was always a spreadsheet.
In this video (tutorial), we will talk about spreadsheets and how far you can go with them in Data Science.
To do that, we'll start with the second part of the question and return for the first.
We'll take notes in a text-editor called
nano
at the Linux Terminal, returning
for the punch line, is programming required? Then conclude with a few
words on our Data Science software stack.
Okay, if you are not familiar with the Linux command line, don't worry, I'll come back and explain each command in other videos (tutorials).
Here I am creating a file to organize 4 pros and cons for each.
Okay, the advantages (of spreadsheets). First, they're pervasive, meaning they're cheap or free and many people have memorized the functions.
Second, they're easy to learn. The grid and cell format helps you draw relationships and make quick calculations.
Third, you can visualize and edit data directly, formatting colors and fonts, making nice presentations.
Fourth, many statistical operations are included, plus links to third party software.
Now, disadvantages. First, they're prone to errors, studies say 80% of spreadsheets have errors, and this is particularly true for those shared across an organization because users have varying levels of knowledge, meaning someone can be a user one minute and a programmer the next.
Second, auditing spreadsheets is costly and time-consuming, and you'd have to inspect every cell, right?
Spreadsheets contain repetitive information, for example 1000 records may have 1000 independent calculations.
And the advantage listed earlier about visualization is a double-edged sword as charting is often painstakingly done by hand.
While we're at it, let's repeat this same exercise for programming. First, version control means you can hold everything constant, roll-back updates and maintain code in one working state called a release, rather than being in a constant state of change, which helps you catch bugs.
Second is speed. Statistical programs, for example, allow you to run calculations and visualize huge data sets, so you can proceed with your analysis.
Third, repetition, meaning you can change variables and run thousands of cases, like in a Monte Carlo simulation.
And fourth, programming is really superior for building logic.
As for disadvantages, because programming is more difficult to learn, the cost of labor is higher.
Second, the variety of programming interfaces results in less standardization, you know, with fewer experts focused on the improvement of one program. Plus the documentation suffers as a result. In other words, a spreadsheet was built for the masses while statistical analysis programs were built with scholars in mind.
Third, the cost of implementation with programming is higher.
And fourth, you have to let go of the GUI (graphical user interface)
and use text editors, like nano
here,
which takes me back to my problem of having to view and touch the data.
So let's go back for the punch line. Is programming required for Data Science?
I'll give one big emphatic, Yes. In this Playlist, we're heading from here (Beginner) to here (Advanced).
For building logic into models and algorithms, including advanced techniques like machine learning and artificial intelligence, frankly the spreadsheet won't do.
So it's time to let go of that spreadsheet. This picture here (below) is our software stack, which we will add to, and it shows one path you can take.
I suggest watching video (tutorial) 1 for an outline and please join us for the answer to: "Which hardware is most used in Data Science? Linux, Mac or PC?" in video 4.
Have a nice day.
If you learned something today then please consider subscribing to our YouTube Channel. Every subscription helps to spread the word, because Google / YouTube's algorithms (aka, data science) wield so much power over consumer behavior.
/ factorpad.com / tech / full-stack / programming-data-science.html
A newly-updated free resource. Connect and refer a friend today.