Bad Data

Bad Data

 

As owners of data assets, we have all encountered those dreaded four little words:

THE DATA IS BAD

Let me set the scene.  You were heads down working away when all of a sudden someone says: “The data is bad”.   Boom!  A statement large enough to make you stop what you are doing and pay attention.  It can seem harmless enough for a user to say this when they encounter an error.  But, the implication of the statement is that the data system as a collective is all wrong and none of it can be trusted. If the statement is true, it must be dealt with immediately. 


Why is this statement so frustrating?

Because to people who create these data and analytic assets, the punishment does not fit the crime.  This statement is often thrown around when a particular report or data system has some flawed logic.  No doubt, these are frustrating scenarios for the consumer.   However, if you look objectively at the situation, the flawed item likely represents a miniscule part of the analytic platform while the remainder of the data system is chugging along swimmingly.

To put this in context, people wouldn’t walk into a house, find a damaged light switch and say the house is broken. 

Is it fair?

At first blush, no, it does not feel fair to the owning team, because the owning team understands that data “breaks” along with anything else.  Data systems are not flawless.  They are as complex and as prone to breakage as anything else. 

Usually the construction of one metric relies on multiple physical systems, many lines of code and numerous logic points.  Therefore, like a light switch, it has many points of failure.  So why is it more ok for a light switch to fail, but not a piece of a report?

Broken data is deceptive.  When it breaks, it’s not obvious.  It lurks among us unnoticed.  A broken light switch would fail to turn on.  When data systems break, they usually keep running with the data presenting itself as available and correct.   

dirtydata4.png

As such, the answer to this question “Is it fair?” has to be yes.  The user rightfully feels misled and therefore they are unsure if they can now trust any of the data.  They likely do feel that the data is collectively bad.


How should we handle data system failures?

AcceptFate.png

Step 1: Accept your fate

Of course the easiest way to handle these situations is to have them not happen at all.  And yes, we absolutely need to put testing in place to prevent the failures, or detect them before the consumers do.  We need to invest in our data systems quality control as we would any core piece of technology in our stack.

But if I didn’t clearly articulate this before, the reality is that the data will break.  Just as with any other complex system, we will experience failures. This is an unchangeable truth.

A study from insidebigdata.com indicates that 27% of top companies’ data is flawed.  Essentially, every company has flawed data.  We can’t change this.  However, we can affect the degree to which it is flawed and how quickly issues are resolved.

It's time to get used to this idea. Data size and usage will continue to grow.  We double our data production every two years .   Corporate citizens are expected to make data driven decisions and take data driven actions.  To do this, our consumers need to have direct access to data in the tools appropriate for the job to be done.   Often, multiple niche tools are required to accomplish the full task.  These niche SaaS tools are swapped out frequently to find the optimal business solution. 

This is our new normal.  It’s a great for business, but it does mean that our systems have more data flowing than ever.  Data is no longer protected in our highly controlled warehouse governed with a strict schema and exposed only by mature reports. Today the data is sitting in a variety of internal and external data sources of varying structures and being exposed by an uncontrollable amount of end systems or users.  We can no longer cage the beast.   

Step 2: Embrace the conversation

You have accepted your fate: data will break.  Now it’s time to figure out what is wrong.

The first step must involve the consumer or the person who found the issue.  The small effect of involving the consumer is that you gather much more details for accurately assessing the issue.  The larger and more important effect is that it drives a cultural change towards consumer empowerment.  The ideal end state is a sense of community ownership of the analytic system.  

Similar to our newly decentralized analytic architecture, analytic capabilities must also be decentralized.  We need dashboards as a starting point of truth.  However, they can no longer be served up for mindless consumption with all deeper insights delivered on a silver platter.  In fact, those insights - the “why” behind the number - were likely never the full reality. Insights are infinite, and the best ones are usually found by empowered subject matter experts.

We need to take every opportunity to empower the consumers and drive a greater culture of analytic ownership. 

Step 3: Deal with the Situation

This is a hard one.  Typically, data systems do not have a traditional QA team like other software projects.  This means that the data system owners need to plan for unexpected debugging and fixing in their development cycles.   Otherwise, bug fixing will always be in direct competition with new features. 

As someone who has lived this dream, I can tell you it is not a smart idea to allow quality issues to backlog while a new feature is being worked on.  The cost for not dealing with bad data can be large.  When data quality issues are ignored it can cause all manner of downstream inefficiencies such as snowflake implementations and hidden data factories.  It is estimated that the inefficiencies caused by bad data amount to 3 trillion annually in the US alone.

Step 4: Invest back into the system. 

It is more exciting to deliver shiny new things vs the unsexy work of fixing the old.  It’s hard and something every team struggles with.  But without investing in maintenance, your new work will be overshadowed by the sins of the past.

There is no “one solution” to investing back in your system.  You need to find a method that works for you team given your strategic priorities.  A good middle ground our team has been exploring is to combine a mix of strengthening our deliverable criteria and taking on a few key quality projects to invest in that will harden the quality of our overall analytic platform. 

Regardless of how you do it, teams need to carve out development cycles to invest back in quality big and small.

Final Word

We are all figuring this out.  It is incredibly complex to set the right data integrity protocols for highly distributed systems without compromising the speed and agility needed by the business.   However, if you follow the steps above, hopefully they will allow you to stay calm, keep collaborative and proactively handle the situation.  And remember, the next time you hear "The data is bad", take a deep breath and dig in.

Written by Laura Ellis

Dashboards for Everyone!

Dashboards for Everyone!

PixieDust  - A Python library spreading magic on our notebooks

PixieDust - A Python library spreading magic on our notebooks