The Role of DevSecOps in Ensuring Data Privacy and Security in Data Science Projects
Business + Managementposted by ODSC Community March 16, 2023 ODSC Community
The process of collecting and organizing data is tedious. This can be attributed to the fact that data from each field come with a different and diverse set of challenges. Data security is very important for organizations as they need their clients to trust them with their sensitive information. The organizations are solely responsible for protecting the entrusted data from unauthorized access and misuse.
The increasing number of data breaches and the new legislation related to data security and privacy have been the driving forces for organizations to treat data as the paramount entity in data-driven applications.
In order to solve the data security issue, we must think of securing data straight from the data collection phase. Guidelines must be streamlined across the organizational level to deeply imbibe the MLOps lifecycle with best security practices. This shift in thinking has led us to DevSecOps, a novel methodology that integrates security into the software development/ MLOps process.
DevSecOps has emerged as a promising approach to address the above challenges. DevSecOps includes all the characteristics of DevOps, such as faster deployment, automated pipelines for build and deployment, extensive testing, etc., In addition to these capabilities, DevSecOps provides tools for automating best security practices. These have to be included in the far left of the pipeline than during deployment stages. This enables the developers to write code with security in mind, thus reducing development time to a great extent.
Purpose of Using DevSecOps in Traditional and ML Applications
The DevSecOps practices are different in traditional and ML applications as each comes with different challenges. In traditional applications, DevSecOps concentrates on identifying vulnerable 3rd party libraries, analyzing licensing issues, scanning to find vulnerable docker images, and identifying Kubernetes manifest misconfigurations. The DevSecOps tools also identify exposed secrets or keys in the application and alert the developers and the security team of this vulnerability.
When it comes to using ML-based applications, DevSecOps tools concentrate more on analyzing the provenance and quality of the data, introducing RBAC-based authorization, promoting centralized storage, and securing the data for future audits. The characteristics which we saw for DevSecOps for traditional applications also apply to ML-based applications. However, in addition to that, ML-based applications require the above-discussed functionalities.
Where and Why is Data Security Required in the MLOps Lifecycle?
MLOps lifecycle mainly involves data collection, data transformation, model training, model deployment, and maintenance and post-deployment monitoring phases. DevSecOps needs to be incorporated into all these phases to support security, privacy, and timely audits.
The MLOps lifecycle starts with the data collection phase. Data security must begin by understanding whether the collected data is compliant with data protection regulations such as GDPR or HIPAA. In this case, the provenance of the collected data is analyzed and the metadata is logged for future audit purposes.
In the data transformation phase (includes feature engineering and feature extraction), the sensitive data or the feature which uniquely identifies an individual is anonymized. This is done to prevent the misuse of information against a particular individual or a group. Only authorized users must have access to the transformed data, and it needs to be encrypted.
In the model training phase, ample precautions need to be taken to avoid bias against a particular group. This makes sure that the model doesn’t ostracize or incorrectly discriminate against a particular group. Subpopulation analysis, partial dependency plots, and Shapley values can be used to prevent bias.
Another aspect that needs to be considered in the data training phase is the ability to avert Adversarial attacks. It is not true that the security breach will happen only during model inference. Minor changes in the input data that are very apparent to human intelligence are not so for deep learning models. Deep learning is essentially matrix multiplication, which means even small perturbations in the coefficients can cause a significant change in the output. If the attacker gets hold of training data, he can simply use brute force to generate problematic data to make the model output erroneous results.
In the deployment phase, the inference needs to be secured. The input data sent for inference needs to be encrypted, and the model endpoints must be deployed with the best security measures. Injection and poisoning attacks can be stopped by incorporating DevSecOps in the model deployment phase.
Maintenance and Post-deployment Monitoring
This is the final phase in the MLOps lifecycle. After the model is deployed, Resource monitoring and health checkups need to be periodically done to detect breaches. In addition to that, the model has to be monitored continuously for data and model drifts. All these steps have to be automated as the complexity increases multiple folds once the number of models deployed increases.
Role of GDPR and CCPA on responsible AI and Governance
The General Data Protection Regulation (GDPR) was implemented by the European Union in 2018 to set guidelines for the collection and processing of personal information for EU citizens. Some of the salient features of GDPR are Correcting inaccurate data, Deleting data once it is no longer required, restricting the processing of personal data, curbing the reuse of collected data elsewhere, and educating people on the data collected.
The California Consumer Privacy Act of 2018 (CCPA) gives consumers more control over the personal information that businesses collect about them. The salient features of CCPA are the right to know, the right to delete, the right to opt out, and the right to non-discrimination. A new wave of regulations and guidelines specifically targeting AI have started emerging, thereby promoting responsible AI and model governance.
Mark Treveil & the Dataiku Team, Introducing MLOps: How to Scale Machine Learning in the Enterprise (Shroff Publishers & Distributors Pvt. Ltd, 2020), p:112
In conclusion, DevSecOps plays a crucial role in ensuring data privacy and security in data science projects. DevSecOps ensures that data privacy and security are maintained throughout the application’s lifecycle by promoting collaboration and automation. With the advent of bills like GDPR and CCPA, the use of DevSecOps will become increasingly important in data science projects. Additionally, DevSecOps ensures that data privacy regulations are strictly followed, reducing the risk of legal and financial penalties.