Via TechRepublic : Fuzzy data sources have become mainstream with analytics. Here’s how to properly handle less-than-pristine big data sources.
For companies that build their competitive advantage on big data analytics, data sources are everything. If you are putting garbage into your high-powered analytic system, there’s not much value that can come out.
When defining big data for the purposes of a competitive strategy, I contend that big data should be freely available with no obligations or constraints. However, the price you pay with most freely available data is uncertainty. Great big data strategists do, in fact, consider their sources very carefully and learn how to live with this uncertainty.
Imperfect data is still usable
The axiom that you must have pristine data sources was challenged once the era of big data analytics emerged. We’ve had the idea of fuzzy math in the artificial intelligence world for quite some time, but it never applied to the business intelligence and data warehousing world until the data science movement. Now, fuzzy data sources aren’t a categorical reject—they’re something that deserves careful consideration.
For instance, with the aid of recent big data analytic technology, many companies are combing through social media feeds to conduct sentiment analysis. There is no possible way to get a high level of data quality from a Twitter feed; however, it’s still very useful, given you’re clear on its confidence rating—a very important concept in profiling data sources. A confidence rating is an overall assessment (typically in percentage terms) of your data source’s quality.
Without the concept of a confidence rating, you’re likely to over-cleanse your data. It’s easy to arbitrarily consider imperfect data sets or sources as invalid, and exclude them from your analytic system. Social media feeds aren’t the only place where you can make this mistake—unstructured data that lies in documents, video, and audio is all fair play now. Instead of ignoring it, map the data through with a confidence rating that provides a disclaimer for what you’re analyzing.
Document your data mapping techniques
Your transformation maps with fuzzy data sources must be tight, though. I approach data mapping in these situations like I approach audit-proofing. Imagine your system could be audited at any given time, and you must be prepared to substantiate and explain all transformations from your sources to your targets. To do this, there must be a logical trace from your target data back to your—sometimes fuzzy—source.
Expert systems are the best way to handle this situation. Even if you have a transformation tool like Informatica to explain the logic of the transformations, in most cases it won’t explain the reasoning behind the logic. Comments certainly help, but it’s better to manifest the knowledge of your own experts into a reliable system that details the rationale for every source to target transformation, including an explanation of confidence ratings.
Don’t throw away the comments though! Complementing your expert system should be a robust set of documentation that explains how your data sources are transformed to targets for analysis; this includes specifications, designs, and guides that resemble an internal auditor’s manual. You should be building these documents as part of your software development lifecycle, so it shouldn’t be hard to translate them into operational documents.
Change data capture and impact
Another operational concern is how to handle changes to your fuzzy and unstructured source data. In the days of clear and structured data, this was much easier. The structure and clarity allowed us to identify natural keys that signaled whether to insert, update, or delete. Now, with unstructured data, you must solve the same problem, but in a different way.
This is done by identifying rules that classify your source data in a way that allows you to identify natural keys. You could use a machine learning classification system to pull this off, but then you might run into an auditability problem. So, although you might start there to give you some clues, in the end I recommend a clear rule set; if that’s not possible, go with a brute-force map, which is an exhaustive lookup that translates every possible input to an output. This solves both the problems of auditability and broad system awareness that something has changed.
This change event must be published so that subscribers, like the analytic system, can take action. You must have clear communication between your change data capture system and your analytic system. A publish and subscribe model like this will allow you to optimize your analytic engine to process only the data that’s changed, instead of the entire, massive data set.
Fuzzy data sources have become mainstream with big data analytics, so it’s important to treat them properly.
First, recognize that there’s value in imperfect data if you qualify it with a confidence rating. Then, be careful when mapping this data to target sources. Establish clear rules and documentation to defend against critics. And be smart about how you handle change data capture. I’ve explained several techniques for doing this correctly.
You can get started today by assessing a fuzzy data source, assigning a confidence rating, and exploring how it might map into your primary data store. You never know—that fuzzy data might bring some not-so-fuzzy benefits.