WHAT IS THE BEST DEFENSE AGAINST DIRTY DATA? INCORPORATE BUSINESS KNOWLEDGE INTO YOUR DATA INTEGRATION PRACTICES
By Dan Inzana & Rich Sokolosky, Sentier Analytics
Good data equals good analytics and bad data equals bad analytics. We get it. This statement is not particularly helpful or original. It is like starting a best man speech with “Websters defines marriage as….” or opening a mystery novel with “It was a dark and stormy night...”
What causes “bad” or “dirty” data and what is the antidote to ensuring it does not creep into your analytics practice? While data engineering will get you part of the way there, it is the business knowledge incorporated into the data integration efforts that makes the difference.
Dirty data can be defined as having the following (obvious) characteristics related to processing:
Duplicate rows
Incorrect field entries
Bad formatting
From automated source profiling to capabilities inherent in the ETL software used, there are many ways to ensure these types of data issues never see the light of day. Conspicuous data issues can still happen but the fixes are straightforward and fall well within the capabilities of data engineering.
But what are some other (less obvious) characteristics of dirty data? These can include:
Erroneous filter or rule definitions - Not using the all the related IDs to define a campaign
Wrong calculations - an ROI calculation is missing a commonly applied factor
Missing data - 1,000 emails were sent but you only see 200 in the data
Incorrect master or reference data - the revenue per script is $125 but $150 is being used
How do you put a full stop on these from happening to your data? Good data engineers intuitively know how to integrate data based on how the data is structured. In other words, they know that field A can be joined with field B, and that campaign names are identical across separate tables. Yet, the less pronounced characteristics of dirty data need business knowledge coupled with solid data engineering to be truly foolproof. Failure to couple a business analyst into your integration efforts can be more costly and impairing than missing the “obvious” dirty data scenarios.
Providing data engineers with context-less data and asking them to spin it up introduces risk. The effort has to be coupled with a nuanced understanding of the business processes that created the data, and there needs to be clarity on how the data will be used. It is inevitable that the data will be dirty in some major way; the business analyst’s job is to catch the subtleties and provide corrective measures.
So, what are some strategies for incorporating business knowledge into data integration?
Assign data stewards from the business - Resources from the business who know specific business practices are assigned to help define and check data within the realm of their expertise. This is typically formalized in large organizations and is informally practiced in small ones.
Create a business liaison function - The liaisons most likely come from a line of business and work in a role as a bridge between the business and data engineering to ensure the knowledge is applied to data integration. This has the advantage of keeping the data engineers focused on their tasks while the liaisons handle the business knowledge.
Build business fact-finding into projects - Ensure the process for data integration includes a full understanding of the business context in which it will occur. This is akin to fully understanding the business question before executing the analytics.
There are a host of ways to get the required business knowledge into the data integration process. The final forms are always guided by the culture and structure of the organization. The key point is that business context and knowledge have to be part of the process. Without this link the data integrated for reporting, analytics, and AI will be prone to being dirty and not usable for the intended purpose – regardless of how solid your data engineering capabilities are.
----
Trusted by life science companies, Sentier provides a data-driven understanding of their promotional and digital efforts. For the past 5 years, we have been delivering disruptive insights through innovative AI-powered solutions. Always at the forefront of the most effective data integration and machine learning approaches, we have created unrivaled models. As a result, our clients are better able to optimize their marketing and sales resources, and build stronger customer relationships.
Sentier is committed to the concept of High-Velocity Decision Making and we believe it can only become a reality with a strong and innovative analytics component.
We strive to be the leaders in the actionable application of new and emerging data and analytics approaches and to remove the barriers to our clients of benefitting from these solutions.
We believe that the ethical application of our services will benefit patients and doctors as well as the pharma and biotech companies we serve.