Data Privacy & Data Science: The Next Generation of Data Experimentation
Collection of consumer data is mainstream and global. The generally accepted premise of the big data movement is to collect as much data as possible, since storage is in theory cheap, and then to use distributed computing and advanced analytics to cypher through the mess to be able to answer complex questions across your enterprise. While I still believe this isn’t necessarily a great idea (there is such a thing as bad data), I digress. To the point of this article, we have, as a society, reached the point where the global populace is asking for the IT industry to be held responsible for the safe-guarding of individual data. If the cat is out of the bag, and the data collection will not stop, then the next logical question is: how do we protect the privacy of individuals/groups?
In this new world order, data collection must come with a corporate responsibility to protect data. Sometimes this is a legal responsibility. Other times it’s a social responsibility.
Social responsibility is quite complicated, and truly a grey area. It’s all about what you feel is “right.” As far as legal responsibility is concerned, some rigidly-defined data privacy controls are starting to appear in the form of new regulations around the world. To name a few:
- General Data Protection Regulation (GDPR)
- Russian Federal Law on Personal Data
- German Bundesdatenschutzgesetz (BDSG)
Lets take the EU’s GDPR, which comes into full effect in May of next year, as an example of enforceable data privacy legislation. The intent of the GDPR is to provide a single set of enforceable rules for data protection throughout the EU, thereby making it easier for non-European companies to comply with these regulations. The regulation applies if the data controller or processor (organization) or the data subject (person) is based in the EU. Furthermore (and unlike current EU privacy laws) the Regulation also applies to organizations based outside the European Union if they process personal data of EU residents.
GDPR is not just a slap on the wrist. If you incur a breach or misuse the data, you may be fined up to 20,000,000 EUR or up to 4% of the annual worldwide revenue, whichever is greater.
So what does all this mean? Enterprises must begin to separate security and privacy. Encryption, defensive cyber controls, etc. are security policies.
Privacy is a data management problem with a business process wrapped around it, which culminates in a data governance strategy for an organization. This includes actual human roles such as a data protection officer, data controllers, and data processors. And it also includes audit/compliance reporting that contains data lineage/provenance attached to data.
Sounds boring, kind of. But a well-built governance strategy creates a workflow for the creation of advanced analytics, with data privacy at the core of the design. And that is important. Designing models/analytics and then going back to add data privacy controls is much, much more difficult, sometimes impossible, or at the very least incredibly risky.
The problem is that there is no way to help organizations get started. It’s completely up to leadership to enforce. My time working with sensitive data within the US Government taught me that data privacy is an on-going process and tools must be inserted to abstract away human decisions on legal policies. Otherwise, engineers either take their own approach to rules and regulations or, even worse, they circumvent them. And don’t forget that regulations change. If you bake your logic into your code in a custom way, then your total O&M costs skyrocket for every change.
We started a company (Immuta) to solve this problem. Any sane data scientist wants to inherit data lineage and data access controls instead of implementing their own custom solution. Our goal is to make the entire legal process around data completely transparent to the data scientist.
Data scientists face two problems when it comes to designing models in highly regulated environments:
- How do I design models on top of regulated data without risking violating regulation, the privacy of the consumer, or having to spend a lot of time writing custom controls into my code?
- How do I deploy models that run on top of data in which the policies on the data are constantly changing?
To mitigate these issues, the following must be implemented and enforced:
- Policies built into the data source that can be changed dynamically
- A common access layer to enforce policies and control data access
- An abstraction of existing identity management from the app or analytic, much like a single sign-on but for data
This concept starts with data access. You must empower data owners to expose their data in ways they can control and monitor access while providing a simple way to apply mission-unique policies. The following video walks through the Immuta approach to virtualize data and apply GUI-driven policies onto data without needing an engineer in the loop. Data Owners and Data Consumers are able to expose data sources via databases, APIs and file systems through a GUI-based approach.
Once a data source is exposed, and access is controlled, the next logical problem is how to execute code on top of that data while enforcing dynamic policies on the data, per each user and/or machine. The following video goes through the Immuta process of enforcing policies while data scientists query and analyze data based on changing policies on the data, as well as managing authorizations of the user:
At Immuta, our goal is to help bring forth the next evolution of data experimentation. We believe privacy will be at the core of that revolution. No longer is it acceptable to risk the exposure of sensitive data. And that means we need to help the enterprise build in privacy from the start.
Originally posted at www.immuta.com