“Data can either be useful or perfectly anonymous but never both.” – Paul Ohm
Privacy and data collection go together like peanut butter and jelly, but in the world of big data, it’s becoming increasingly difficult to work with anonymous data without crossing a privacy line. So what’s a business supposed to do? Steve Touw offers some insights into how organizations can honor privacy without rendering data useless in his talk Privacy and Machine Learning: Peanut Butter and Jelly.
Not So Anonymous
Gawker ran an article recently outlining how much celebrities tipped their taxis using supposedly anonymous data. The problem is that linking data together often causes broken privacy. In the case of the taxis, the media was able to link picture information to anonymized government data to build a picture that broke those celebrities’ privacy.
So what do you do? The case for more privacy initiatives.
If you hide all the data from taxis (name, date, medallion, pickup time, etc.) the data becomes useless. You don’t know how much data the other party knows, so hiding identifiers is difficult. In reality, this is why privacy across data is important even in small businesses that don’t carry sensitive information. Each new piece of information is one step closer to revealing identities.
The amount of data out there, whether from anonymous sources such as New York taxi data to the data we willingly give (social media for example), makes these types of link attacks easier and more common. Newer compliance such as GDPR has tightened restrictions used to directly or indirectly identify EU citizens. Other regulations are growing.
- Are you using personal data beyond the scope of what they expect?
- Are you confining your data crawl only to what’s necessary for decision making?
- Do you understand what data was used to train your models?
The regulations may seem scary, but baking these types of precautions in your data can create a better environment all around, one that doesn’t put you in danger of violating regulations or your customers’ civil liberties.
The Right To Privacy
At the turn of the century, an invasion of privacy centered around photographs. Today, we accept being observed for the most part, but being identified is a serious issue. Even further, the future of privacy enforcement is going to be how your data is used rather than be collected. The observation is already there, and we can’t do much about it. We can direct how our data can be used, including used against us.
Considering these three questions center around how you’re using your customer’s data and better preserves what we understand about privacy now. Now, there are ways to mask data without removing the usefulness of the data.
Ways To Mask Data
Masking your data without removing the utility involves working around common issues.
Differential Privacy: One way to handle these privacy issues is to perform something called Differential Privacy. Adding noise to the data gives someone a way to deny information about them without actually removing that information. It’s both utility and some anonymity. If the query is too sensitive, you just don’t respond at all.
Private Aggregation of Teacher Ensembles (PATE): Instead of sampling data, you use an ensemble of models. Each ensemble is trained on exclusive data, and so it adds a little noise to each query. If the query isn’t evenly split, just like the differential privacy initiatives, it doesn’t respond.
Understanding Your Data Models
Two errors in risk make privacy a harder issue. At the input layer, regulating data itself is part of privacy. Remember, the data can only be used in a way that the user consented, and it must be relevant to the question. Once a model is in production, you need to understand the data going into it because once policy changes, you’re on the hook whether you understand the data or not.
Your input level needs visibility to understand consent for data. Mechanisms should be in place to control who sees data, how they see it, and how that data is used. Data doesn’t have to be rendered completely useless, but the more your organization understands the type of data involved, the more compliant you’ll be as regulations evolve and our understanding of privacy changes and privacy initiatives.
Privacy and Your Data
The point of data is to find insight, but spending time upfront protecting privacy and setting controls for using data is a worthy start to continuing to glean the kind of insights you need from big data without violating privacy.
We can’t do anything about where our data is collected. Generally, we all know our data is being harvested, but we’re demanding greater control over how our data is used, putting the onus on organizations collecting the data to begin initiatives for compliance.
At the end of the day, it’s a collaboration between data scientists, the IT folks, and the pressure for compliance. These three factors drive compliance and adoption of new forms of privacy. Reducing the risk is paramount to moving forward with big data for business and organizations. It’s time for more innovative privacy initiatives!