Data smells too

Have you heard of the term “Data smells”. As developer, I heard about “Code smells”. In programming, Code smells describe a combination of signals that indicate larger, deeper problems. I started to hear “Data smells” in recent past and similar to code, analytics can stink, requires a considerable effort to solve the problem.

When one looks in to data for insightful answers, one ends up uncovering more quality problems?. Hence with little bit of scrutiny, results change drastically. Different parts of the organizations have different version of the truth about same data. Here are some reasons why smelly data gets in to the system at the first place.

In organizations, line managers, with little or no training in data science are the consumers of data analytics. They are not able to judge the quality of the research — or determine whether a project should take as long as it does. Less experienced data scientists sometimes ignore the experience/insights that line managers could offer. These insights can improve the result or shorten the research process.

Analytics compiles data snapshots that happens in the world and these may not fit into well-structured or clean models. In addition,the world continues to change, even if the systems don’t. A system made sense in isolation or at the time it was developed. How to make sense of generated results connecting disparate sources of data?

Organizations are looking to obtain insights in shorter time, leading to short-term solutions. Once short-term solutions are in place, it becomes tricky to find time to go back and create longer-term practices that are more robust. Do leaders have required discipline to decide where to standardize now versus later? Can organization have capability to invest resources upfront to ease later growth?

In programming, refactoring is performed as series of small changes to change the internal working of a system without changing its observable behavior. This has inherent risk in the short term it can break working systems and absorb resources without apparent payback. Removing stench in data is easier said than done.

  • Impact or payback from investment done in analytical refactoring is not immediate. Cost are spend immediately. We are back to Now versus Later.
  • Identify the areas that need attention. Find approach to perform refactoring activities decoupled from major projects. Effectively reduce risks associated with the larger, more strategic projects.
  • Like incremental/iterative approach to software code, an incremental strategy applies to refactoring analytics. For example, start with “what is the data inconsistency do the most people struggle with?” An incremental approach can reduce risk & help people to see benefits earlier.

If you think there would be no time to clean up data, make sure that you are not spending time to mess up. Be aware that it may be far easier to avoid creating problems than it will be to clean them up later.