Data Science

Stop Fooling Yourself With Data

Pinterest LinkedIn Tumblr

Human beings excel at and are prone to finding patterns in everything we see and everything we do. We can’t help it, it’s an instinct that has been internalized over thousands of years of evolution. Through our understanding of patterns came predictability, which partly helped us to survive.

So, we find comfort in finding patterns… anywhere and everywhere. 🧐

In our quest to be data-driven or better still, data-informed, why should it be any different? And what are the perils of our pattern-led behavior?

Without data you’re just another person with an opinion.

W. Edwards Deming, Statistician, Professor, Author, Lecturer, and Consultant

Patterns, Patterns Everywhere

We can identify three different types of data patterns:

🔎  Patterns that are present in our dataset and beyond

🔎  Patterns that are only present in our dataset

🔎  Patterns that are only in our imagination

Whether these patterns are useful or not depends on our aims. We might be searching for inspiration, trying to get to hard facts or even making decisions.


If what we are looking for is simply inspiration, then all three patterns are the perfect fit for the task at hand. We just have to grab the data and go and have fun! There’s no need to worry, we can be as creative as we like, safe in the knowledge that there is no right or wrong with creativity.


When what is required is solely descriptive analytics that only deal with the data at hand, either type 1 or 2 pattern would work best.  So, a water supply company is not interested in patterns that fall outside of our water consumption for the past year.  The company just analyzes last year’s data to make a decision on how much money we owe, using a formula and applying it.

Decisions Under Uncertainty

We always want what we don’t have, and sometimes the facts that we do have are nothing like what we wish we had. When we really want to come to a decision, but we don’t have all of the lowdown required, we go through quite some uncertainty while we try to choose the best way forward.  And that is where statistics comes in.

Statistics is the science of changing your mind under uncertainty.

We have to take that step beyond what we know without failing miserably. But it’s important to be sure that the patterns that we find must generalize in order to be of use. 

Only Patterns that are present in our dataset and beyond (Type 1) are useful if we’re going to make decisions under uncertainty. However, as with most things in life, there is a downside too and we might find the other types of pattern lurking around somewhere close. We betide us if we end up less informed that before we even started looking at the data.

Generalization – Generally Speaking

Humans are not alone when it comes to magicking useless patterns out of data—machines can also do the same. Machine learning (ML) or Artificial Intelligence (AI) is a way of making several similar decisions, by finding patterns in data and then using these patterns to respond appropriately to spanking-new data.

It is of no use if it only works on old data. ML/AI is about generalizing correctly to brand-new situations. Pattern type 1 is therefore the only one that’s good for machine learning. It’s the part that is signal and everything else is just noise (our own decoys and red herrings that exist just in our old data and stop us from arriving at a generalizable model).

A solution that handles just old noise rather than new data is what in the machine learning world is called overfitting. It is one of the trickiest obstacles in applied machine learning… along with its opposite, underfitting

Signal Or Just Plain Noise

If the pattern we or our machine obtained from our data is present outside our imagination, what sort of pattern is it then? Can we say that it is the signal or real phenomenon that exists in our population of interest or just plain noise, a peculiarity of our current database. How are we going to know what type of pattern it is we have found?

Errors using inadequate data are much less than those using no data at all.

Charles Babbage, Inventor and Mathematician

If we look at all of that data available, we can’t know which pattern it is. We cannot see if this pattern exists somewhere else. It would be like seeing the shape of a sheep in a tree and then testing to see whether all trees look like sheep… using the same tree. 

New trees are needed to test the theory. Therefore, we can only use a datapoint once, we cannot use it again to test the theory it inspired. That’s the problem with having just one, poor, lonely dataset.

Let’s All Do The Data Split!

A solution to the single dataset disability is a data split. Years ago, we didn’t have so much data and so splitting datasets would have meant there was nothing left. But nowadays, with the current data bonanza, it is possible to divide data into two, not necessarily equal, parts. We then have an exploratory dataset to be used by everyone to rummage through for inspiration and another test dataset that can be used by specialists to confirm any insights that we found during the exploratory part. Then we are finally on to a winner.

Redemption Is Near!

Ultimately, if it’s help we are after with predictions for the future and data-driven actions to achieve tangible results, then putting our money on (or into) visionary solutions is far better than following our noses or gut reactions, up to our waists in a murky pool of data.

For many, data science has fast become more like rocket science with a huge array of evolving complexities, for others…it always was and always will be rocket science! But some data scientists often do find the time to work on and eventually pull out some super-smart solutions from the sleeves of their anoraks.

RetentionX is one of them. It is an out-of-the-box solution that uses AI-driven data analysis and it can be up and running in less than two hours. Easy-to-use and affordable, the product can be fully integrated with existing systems. It reveals hidden value in data, helping to predict future outcomes and providing recommendations for action. Whether it’s insight into customer retention, product performance, customer structure or just some focus on financials, RetentionX is the definitive solution to us fooling ourselves about the information we possess.



Cassie Kozyrkov, „The most powerful idea in data science



SaaS expert operating at the interfaces of business, data and technology – Head of Implementation at Personio, Europe's leading HR software.

Write A Comment