Espy Data Science Blog

May 11, 2018May 23, 2018

Why Traditional Analytics Don’t Cut It Anymore

Today, data is the new oil. But oil needs to be refined to be usable and so does data. But what does that mean exactly? Well to answer that we need to take a slight detour and delve just a bit into the neuroscience of human cognition as it relates to processing and understanding information.

To begin though, let’s start with the big picture and ask why is data valuable in the first place? The simple answer is because data can inform your decisions and thereby improve their quality. Ok, cool… However, before that information can be leveraged there’s one minor detail that needs to happen and that is the information needs to get inside our heads.

But this is where we run into issues because there are two important facts about the brain you should know. First, your brain is vastly more powerful and complex than any computer in existence and by orders of magnitude. In fact, simulating just one second of brain activity took the 4th fast supercomputer a full 40 minutes. Thus, once the brain has data, it can do a lot with it, i.e. it’s pretty easy for your brain to make a good decision if it has the necessary information. However, getting information INTO the brain is another story.

You’ve probably heard that line about human working memory is only seven chunks (letters, digits, words, etc.) which is based on a study from 1956 by Miller. Turns out Miller was a little optimistic, given that the latest research indicates our true working memory capacity is closer to 3 – 5 chunks (on average we’ll just say four). Now why is this important? Simple. For most types of abstract information (like numbers) to be absorbed by the brain, they must pass through and be processed by our working memory system. So if you can only process four things at a time and you have a lot of numbers/data points, well that could take a while.

And to make matters worse, in any analysis you not only need to know what the numbers are but also need to understand how they relate to each other. e.g. knowing sales for the past four quarters is nice, but if you don’t know that each quarter was 20% lower than the last, you probably won’t be around long. But a relationship is also a thing, i.e. a chunk, which means it’s taking up one of those four slots in your working memory. So practically you’re really down to only 2 – 3 things.

At this point, it might seem as if we’ve run astray. Why are we talking working memory slots again? Because all the data in the world won’t help you make a better decision if it can’t be compressed into a series of bite sized chunks, each consisting of no more than 3 – 5 points that can be absorbed by our brains. And this is where the analytical tools of data science that allow you to reduce large amounts of data down to a few key points come in handy. In some sense, data science is just a data pre-processor which refines the oil that is raw data into the gasoline (insight) that our brains can run off of.

March 7, 2018May 23, 2018

Why Data Science is So Popular (and why that’s not necessarily a good thing)

Starting around 2012, interest in data science has exploded as the graph below of monthly searches for data science, ‘big data’, or ‘machine learning’ shows. Providers of data science products and services often point to the explosion of big data as the cause of this interest, quoting cognitively incomprehensible stats such as our daily production of 2.5 quintillion bytes of data, and assert that it is this mass of data that necessitates data science and by extension us as data scientists. After all, how else are you going to monetize all that valuable info in those 500 million tweets sent yesterday?

Source: Google Trends. Note, graph shows normalized search trend for each term.

But our society’s increasing reliance on technologies which insists upon being eternal witness to our digital alter egos is only partially the blame for why the world has gone data science crazy lately. And here’s why. Data science’s core value proposition ultimately has little to do “big data,” but rather is about the ability to extract insight from data (large or small). Hence, at its core, data science’s value is centered about the algorithms we apply to data. Algorithms which are complex and require hundreds to thousands of lines code to implement correctly. Prior to the recent maturing of open source options like R and Python, we were left with two bad options: spending countless hours implementing the models ourselves or paying for very expensive commercial implementations like MATLAB or SAS.

Now this wouldn’t have been that big an issue, IF we knew that the required temporal and financial investment would pay off. But therein lay the issue. Data science is an inherently exploratory endeavor and in the majority of cases, one cannot know if useful and relevant information is even contained in the data until a number of techniques and models have been attempted and compared. The problem of course is that in a business context, that’s a very hard sell to make (unless you work at Google).

Hence, we toiled in the darkness till at last, there pierced a ray of light that was R…soon followed by the serpent of goodness that is Python. Which leads us to the happy state we find ourselves today, where data exploration and modeling has never been easier. Need to create a linear regression? Two lines of code. What about a neural network? That’s five lines.

Now having written my own linear algebra library in C++, all I can say is this is f***ing awesome! Now when I want to test a new model on some data, I just mozy on over to Google and in an hour or two I have a preliminary prototype up and running.

However, as great as having all these high quality open source libraries is, there was an upside to not having them. And that was, if someone was doing data scientist, they probably knew what they were doing given they at least had to know the math/stats/algorithms well enough to implement them. With that barrier to entry removed, we’ve opened up the world of data science to a whole new segment of the population, which overall, I think is a good thing. But much like a chain saw, just because you can start the thing doesn’t mean you won’t cut your foot off with it if you’re not careful and know what you’re doing.

Like all forms of power, it comes with the risk of abuse. And today, given the hype of being a data scientist (remember it’s the sexiest job of the 21 century according to HBR), there’s a huge financial incentive for people to become “data scientists” after an introductory class in Python. And for companies that’s a problem. But I’ll end with a tip that can help. Whenever I interview a data scientist, I asked them to pick a model they’re very familiar with. Then I ask them to write out the sudo code on how to implement it. It never fails to weed out the posers 🙂

February 4, 2018May 23, 2018

What Is Data Science?

“Data science is an interdisciplinary field of scientific methods, processes, algorithms and systems to extract knowledge or insights from data in various forms, either structured or unstructured, similar to data mining.”
-Wikipedia

So there you have it. The world’s foremost crowd sourced definition of what data science is. While I don’t disagree with the collective wisdom of Wikipedia, I would like to offer a slight alternative in the form of what I consider to be the archetypical illustration of what data science ought to be: Google Maps.

First, yes, I realize that an example, archetypical or not, is, by definition, not a definition. And second, can we for a moment appreciate the number of commas I was able to cram into that sentence? Now moving on to three, let’s get back to one.

The point I wish to make here is this, data science as it is typically is described really offers no more insight into what it is than can be inferred from the term’s raw grammatical structure. Data science is the science of data, and science is just structured investigation, ergo data science is the structured investigation of data, i.e. the definition we just read from Wikipedia.

But while technically accurate, I think this line of explanation belies an important aspect of science critical to its effective use, which is its connection to art. Dr. Featherstone wrote an interesting piece on this topic, which I’ll leave to you to enjoy in its entirety, but the main point I wish to steal is that both science and art arise from our need as humans to understand the world we live in and to communicate that knowledge to others.

Or as Einstein put it

While I think most data scientists would agree with Einstein (and if you don’t, it doesn’t really matter because he’s Einstein and you’re, well not) being data scientists I think we often forget about the aesthetics of what we do. Which brings me back to why I think Google Maps represents to large degree the ideal example of how data science ought to be. Google Maps has an aesthetics that I absolutely love. And when I use the term aesthetic, I don’t so much mean in terms of the stylistic design of the app, but rather in the simplistic elegance of how it represents its value to the end user which it does by:

Showing you a map (context)
Showing you where you are on the map (relevance to you)
Showing you several options on how to get where you want to go (actionable info)
Showing a comparative ranking of the options in terms drive time
Justifies its estimates of drive times by showing you visually both distance and traffic on each route

And it does all of this in an interface that takes no more than a few seconds to look at and understand. In other words, it hides all of the complexity of Dijkstra’s shortest path algorithm being run on a network whose edge weights are updating in real time off billions of data point pulled from millions of users and instead shows the user simply the relevant state of the world (traffic colored map), their position in that world, and a few options on how they can get to where they want to go. And that, in my opinion, is what data science ought to be.