Many companies are now utilizing data science and machine learning, but there’s still a lot of room for improvement in terms of ROI. A 2021 VentureBeat analysis suggests that 87% of AI models never make it to a production environment and an MIT Sloan Management Review article found that 70% of companies reported minimal impact from AI projects. Yet despite these difficulties, Gartner forecasts investment in artificial intelligence to reach an unprecedented $62.5 billion in 2022, an increase of 21.3% from 2021.
Nevertheless, we are still left with the question: How can we do machine learning better? To find out, we’ve taken some of the upcoming tutorials and workshops from ODSC East 2022 and let the experts via their topics guide us toward building better machine learning.
#1 Build a Drift Detector
Although powerful, modern machine learning models can be sensitive. Seemingly subtle changes in the data can destroy the performance of otherwise state-of-the-art models, which can be especially problematic when ML models are deployed in production. Drift detection is a discipline focused on detecting such changes, understanding the ways in which drift can occur, and the types of drift. Building a drift detector using tools like Alibi Detect, an open-source Python library, will enable you to detect drift in a principled manner. Alibi Detect is a powerful library containing algorithms for adversarial, outlier, and drift detection. Since data can take many forms, such as image, text, or tabular data, one often needs to use existing machine learning models to preprocess data into a form suitable for drift detectors. To gain further insights into the causes of drift, one can employ state-of-the-art detectors which are able to perform fine-grained attribution to instances and features. To assess whether model performance has been affected by drift, you can experiment with using supervised and uncertainty-based detectors. Additionally, in production environments, data often arrives sequentially, so drift can be detected using online detectors.
Abstracted from: An Introduction to Drift Detection – Ashley Scillitoe, Research Engineer, Seldon, and Ed Shee, Head of Developer Relation, Seldon – April 19th, 2022
#2 Understand Machine Learning Safety
Machine learning systems are rapidly increasing in size, acquiring new capabilities, and are increasingly deployed in high-stakes settings. As with other powerful technologies, safety for machines should be a major priority for any researcher or practitioner. Some of the problems in machine learning safety are well known and addressable but many unsolved problems remain. Recent large-scale models in particular are an emerging safety challenge in machine learning. The three pillars of machine learning safety are namely, withstanding hazards (“Robustness”), identifying hazards (“Monitoring”), and steering ML systems (“Alignment”).
Abstracted from: Unsolved ML Safety Problems – Dan Hendrycks, Research Intern, DeepMind
#3 Learn To Mitigate Bias in Machine Learning
It’s well established that bias is everywhere–in data, in algorithms, in humans. As data increases exponentially in nearly all dimensions (velocity, volume, veracity) and machine learning systems trained on these data sets are becoming omnipresent, dealing with bias becomes more complex and important. Practitioners may underestimate the complexity that bias introduces into machine learning workflows. Thus, the first steps are awareness, understanding data quality, and proficiency in measuring and monitoring machine learning algorithms. Next is an understanding of how to mitigate these problems.
Abstracted from: Dealing with Bias in Machine Learning – Thomas Kopinski, PhD, Professor for Data Science. University of South Westphalia
#4 Measure Technical Debt
Technical debt was a concept introduced by Ward Cunningham in 1992 to understand the long-term costs (debt) incurred by moving quickly in software engineering. This can be readily applied to machine learning and can take on many facets such as feedback loops (direct or hidden), data dependencies, high-debt design patterns, poor abstraction, and more as is explained in this 2015 paper by Sculle et al. 7 years after the publication this paper, “MLOps” is now a hot topic. MLOps’ goal is to quickly and reliably deploy and maintain machine learning models in production. Its widespread adoption, however, has the potential to substantially increase technical debt in production deployments. Understanding technical debt in machine learning workflow goes beyond just modeling libraries. It requires additional methods for monitoring, lineage, and deployment.
Abstracted from: MLOps: Relieving Technical Debt in ML with MLflow, Delta and Databricks – Sean Owen, Principal ML Solutions Architect & Yinxi Zhang, PhD, Senior Data Scientist, Databricks
#5 Build New Methods To Handle Missing Values
Missing values are ubiquitous in data analysis and many methods to tackle this problem are widely understood. Classical methods for dealing with missing values include single imputation, multiple imputations, and likelihood-based methods developed in an inferential framework. The aim of these methods is to estimate at best the parameters and their variance in the presence of missing data.
Recent results from efforts to use a supervised-learning setting to predict a target when missing values appear in both training and testing data look promising.
One striking result from these efforts, is that naive imputation strategies (such as mean imputation) might be the best method to use, as the supervised-learning model does the hard work. That such a simple approach can be relevant may have important consequences in practice. Additionally, missing-value modeling can be readily incorporated into tree models, such as gradient-boosted trees, giving a learner that has been shown to perform very well, including in difficult missing-not-at-random settings.
Abstracted from: Overview of Methods to Handle Missing Values – Julie Josse, PhD, Advanced Researcher, Inria and Gael Varoquaux, PhD Research Director | Director, Scikit-learn, Inria
#6 Build Defense Methods for Adversarial Attacks
Classifiers and other machine learning models can be easily tricked into making embarrassingly false predictions. When this is done systematically and intentionally, it is called an adversarial attack. Specifically, this kind of attack is called an evasion attack. Defense methods for these kinds of attacks include spatial smoothing preprocessing and adversarial training. Methods to ascertain that the model can withstand such attacks include the robustness evaluation method and the certification method.
Abstracted from: Adversarial Robustness: How to make Artificial Intelligence models attack-proof! Serg Masís, Climate Data Scientist at Syngentaand author of the bestselling book “Interpretable Machine Learning with Python”.
#7 Go Beyond Basic Model Evaluation
Machine learning has made rapid progress, but at the cost of getting more complex and opaque. Despite widespread deployment, the practice of evaluating models remains limited to computing aggregate metrics on held-out test sets. This practice can fall short of surfacing failure modes of the model that may otherwise show up during real-world usage. A better method would be to ask why did the model make this prediction? One approach to answering this question is to attribute predictions to input features–a problem that has received a lot of attention in the last few years.
Integrated Gradients (ICML 2017), is an attribution method that is applicable to a variety of Deep Neural Networks (object recognition, text categorization, machine translation, etc.). Evaluation workflows based on feature attributions have several applications and attributions can be used for monitoring models in production.
Abstracted from: Evaluating, Interpreting and Monitoring Machine Learning Models – Ankur Taly, PhD, Staff Research Scientist, Google
Learn Better Methods For Better Machine Learning at ODSC East 2022
To dive deeper into these topics, join us at ODSC East 2022 this April 19th-21st. The conference will also feature hands-on training sessions in focus areas, such as machine learning, deep learning, MLOps and data engineering, responsible AI, machine learning safety and security, and more. What’s more, you can extend your immersive training to 4 days with a Mini-Bootcamp Pass. Check out all of our free and paid passes here.
Perform better machine learning with some of these sessions:
- Need of Adaptive Ethical ML models in the post-pandemic era: Sharmistha Chatterjee/Senior Manager Data Sciences & Juhi Pandey/Senior Data Scientist | Publicis Sapient
- AI Observability: How To Fix Issues With Your ML Model: Danny D. Leybzon | MLOps Architect | WhyLabs
- Data Science and Contextual Approaches to Palliative Care Need Prediction: Evie Fowler | Manager/Data Science Product Owner | Highmark Health
- Demystify the gap between Data Scientist and Business Users: Amir Meimand, PhD | Data Science/ML Solution Engineer | Salesforce
- Dealing with Bias in Machine Learning: Thomas Kopinski, PhD Professor for Data Science University of South Westphalia
- Mastering Gradient Boosting with CatBoost: Nikita Dmitriev Member of CatBoost Team Yandex
- Network Analysis Made Simple: Eric Ma, PhD Author of nxviz Package
- End to End Machine Learning with XGBoost: Matt Harrison Python & Data Science Corporate Trainer | Consultant MetaSnake