The bigger your business, the more likely your data is…interesting. Old methods of data collection put data in silos across several departments. The computing power to handle data in all its forms wasn’t there, so breaking it up in the name of agile operations was key. Now, we’ve got machine and deep learning and a hope of finally making sense of it all. The three V’s of data—volume, variety, and velocity—shape the way we think about data collection and processing. Each presents a unique challenge when moving from that jumble of legacy systems you’ve got now to the streamlined, flexible system of your dreams. Here’s what you need to know about data variety and how it fits into your data puzzle.
Why should you care about data variety?
The old method of storing data was to silo it for expediency. Not every data decision needed to go through every department, so to prioritize agility, companies sectioned their data. Now, the era of truly big data, we can’t process what we have using traditional data methods. It’s time to put it all back together again.
Except now you’ve got data coming in that doesn’t match. One department labels their columns and rows differently. You’ve got repeat data just with different forms. And we haven’t thought about all the ways data comes into your organization. It’s a mess. You must plan to break down those silos and improve your data quality, whether its in lake form or otherwise.
So what are some of the ways your data variety appears?
- unstructured data
- the structure between departments or companies
- Natural Language
- media (pictures, video, etc.)
What’s the Obstacle?
So what’s the holdup getting your data into a manageable form for big data? You’ve got a few different things standing in the way of pivoting your data management. It’s not just about throwing the next shiny piece of tech at it. Your human teams are your most significant resource making the shift to a big data culture, but first, you’ve got to:
- Address concerns – You know exactly where your shoes are in your messy house, but you’d never send someone into your messy house to look for them. Data is the same way. Departmental data swamps are comfortable but could be embarrassing to share.
- Give the right people access – those of you who have sensitive information rightfully restrict access, but it’s a delicate balance. You have to find the line between protecting your data and allowing those who need it to have access. It could be a matter of figuring out a better way to process sensitive data with anonymity.
- Stop Ignoring the issue – proceeding without some data integration plan isn’t going to work either. It prevents you from using data in unique ways that might reveal insights you didn’t have before and it builds your enterprise data debt.
The Old Ways Won’t Work
Traditional models of data management won’t keep up with big data variety. Many businesses are still operating under one of two types of data management systems. These are bound up in legacy systems that are difficult to scale and human labor-intensive for very little return.
Master Data Management (MDM)
MDM merges all data to match an existing entity, swapping a common nickname with a full name, for example. Once they’re in place, MDM delivers a rule of thumb for your structured data, helping deal with duplicates and null fields.
They’re human-intensive, however, requiring long hours setting up the record definitions. At the scale most enterprises have now, this would be prohibitively difficult, and that’s if a human team manages to define fields without error. It won’t touch unstructured data – there’s no way it can – and using it will cost you.
Extract, Transform, Lead (ETL)
This method uses conversion routines written upfront with cleaning and updating over time for accuracy. It’s excellent for the structured data you have now, but very difficult to scale because it’s time-intensive and expensive.
It won’t give you a data standard either. For intensive training, it just doesn’t deliver the consistency you need across your points of consumption. It’s suitable for handling things like data volume, but variety just isn’t going to deliver.
Thinking Differently about Data Variety
Your problem isn’t necessarily the amount of data you have but integrating data from more sources than ever before. While a handful of Silicon Valley giants are managing this issue, most companies are still struggling. They’re operating legacy systems with unstructured data not meant for processing that way, and it’s going to slow things down.
[Related Article: Missing Data in Supervised Machine Learning]
Some of the biggest promises come from a concept called Human Guided Machine Learning. It uses domain level experts to help guide and train machines to label data that doesn’t necessarily fit in simple columns and rows. Whatever you do, throwing money at your data may not solve your issue if you aren’t investing in the right thing. As you begin to consider your data variety solutions, look to the annex of human input and machine learning because that could be the most viable solution.