Each ODSC conference is a unique opportunity to reconnect with your community and learn about trending topics and tools from incredibly talented individuals. We are gathering attendees from around the world and that includes some of the top minds in AI.
Here’s just a sample of 21 of over 110 machine learning talks from industry leaders you just shouldn’t miss this April 19th-21st at ODSC East 2022. Grab a free Bronze Pass and attend in-person or virtually to see these 21 free machine learning talks.
#1: Overconfidence in Machine Learning: Do Our Models Know What They Don’t Know? [ Keynote] Padhraic Smyth, PhD, Chancellor’s Professor, UC Irvine
The past few years have seen major improvements in the accuracy of machine learning models in areas such as computer vision, speech recognition, and natural language processing. These models are increasingly being deployed across a variety of commercial, medical, and scientific applications. While these models can be very accurate in their predictions they can also still make mistakes, particularly when used in environments different to those they were trained on. A natural question in this context is whether models are able to calibrate their predictions: can we trust the confidence of a model? Can models “self-assess” in terms of knowing what they don’t know? In this series of free machine learning talks, I will discuss key ideas and recent research in this area including work on prediction confidence and human-AI collaboration.
#2: AI in Finance: Examples and Discussion [Keynote]: Manuela Veloso, PhD, Head of AI Research, J.P. Morgan Chase AI Research, Former Head of the Machine Learning Department, Carnegie Mellon University
In this talk, I will present examples of our recent AI research and practice experience in the finance domain, addressing data, reasoning, and execution of AI approaches. Presented projects will be on AI for data discovery, data standardization, synthetic data, behavior understanding, and reasoning and explainability. We are driven by the goals of servicing and assisting humans in their complex tasks.
#3: Machine Learning for A/B Testing: Alex Peysakhovich, PhD, Senior Research Scientist, Facebook AI Research
In this series of free machine learning talks, we will discuss how tools from machine learning can help practitioners move beyond the “run an A/B test, check significance, launch if the test group is better than the control group.” We will focus on the case where hundreds of A/B tests are run in the course of daily practice.
#4: Learned optimizers: Luke Metz, Research Scientist, Google Brain
Much as replacing hand-designed features with learned functions has revolutionized how we solve perceptual tasks, learned algorithms also have the potential to transform how we train machine learning models. Learned optimizers are one such learned algorithm. Instead of writing down mathematical expressions to perform optimization, a learned optimizer learns the function to perform optimization. This talk will outline how these learned optimizers work, and discuss a number of difficulties that arise when training them. Finally, I will share some interesting behaviors which are starting to emerge.
#5: Data Science and AI at Moderna: Andrew Giessel, Director of Data Science and Artificial Intelligence, Moderna
In this talk, I will highlight the way that digital and platform mindsets have shaped the way Moderna carries out its mission to deliver on the promise of mRNA therapies to help patients. I will introduce my group, discuss our vision for data science and AI at Moderna, and highlight a recent example of applying machine learning to the challenge of protein sequence design.
#6: What I love and hate about Dask; Matthew Rocklin, PhD, CEO and founder, Coiled
Dask is a well-used framework for parallel and distributed computing in Python. It is used in many ways, including scalable versions of pandas, numpy, and other libraries. It’s also used as a general-purpose toolkit for lower-level task parallelism. Dask optimizes deployment, network communication, resilience, and load balancing so that you don’t have to.
However, like any well-used open-source framework (pandas, numpy, python itself) Dask also has warts that get in the way of an optimal experience. What have we learned over the last several years of scaling Python, and what could we do better? In this series of free machine learning talks, we will discuss Dask’s strengths, its weaknesses, and the developer community’s plans moving forward.”
#7: Concepts and Conceptual Leaps: Ramakhrishna Vedantam, Research Scientist | Associate Professor, Facebook AI Research
Ensuring that machine learning models truly understand the concepts that they wish to learn, and are able to make conceptual leaps to generalize to novel concepts is a major problem on the path to human-level AI. In this series of free machine learning talks, I will describe some of the work we have been doing in Facebook AI Research in pursuit of this problem. We will first describe a novel task, CURI which tests the conceptual leap ability of AI models. We then probe the question of concepts from a more foundational perspective. Testing when models truly understand concepts and generalize out of distribution, when they work in a manner that demonstrates an understanding of objects in a scene or important/noteworthy waypoints in reinforcement learning. I will then conclude with thoughts on future work.
#8: The Wisdom of the Cloud: Allen Downey |Computer Science Professor | Olin College and Author of Think Python, Think Bayes, Think Stats
What can we learn about data science by watching data science competitions? During a data science competition like the ones hosted by DrivenData and Kaggle, the leaderboard lists the teams that have submitted models and the scores the top models have achieved. As the competition proceeds, the scores often improve quickly as teams explore a variety of models and then more slowly as they approach the limits of what’s possible. Using 170,000 scores from more than 50 competitions hosted by DrivenData, we explore the aggregated behavior of the competing teams. What patterns can we see?
Based on early returns, can we predict the limits? What factors influence the time, and the number of submissions it takes to reach the performance plateau? Do models tend to overfit the data as the contest progresses? And what guidance can we provide for deciding when to stop searching? In this series of free machine learning talks, we will answer these questions and share other observations from the other side of the leaderboard.
#9: Telling stories with data: Gulrez Khan, Data Science Lead, PayPal
According to a recent study, the attention span of humans has reduced to less than that of a Goldfish! How do you still make connections with the audience while presenting your work as a Data Scientist (or analyst)? In this series of free machine learning talks, Gulrez Khan will walk through the entire journey of what it takes to communicate effectively with data. The first step is to begin with effective stories to engage the audience in a TED-like talk. Once you garner the attention you would use the visualization principles to reduce their cognitive load and get the message through.
#9: Understanding and Optimizing Parallelism in NumPy-based Programs: Ralf Gommers, PhD, Co-Director, Quansight Labs
Most of us have been there: your code works, but it needs to run faster. You start thinking about making it run in parallel – now how to go about that? Or you discover it already does some things, but not everything, in parallel – how do you build on that without running into unexpected problems?
#10: When SQL is Not the Best Answer: Identifying “Graph-y” Problems and When Graphs Can Help: Dr. Clair J. Sullivan, Data Science Advocate, Neo4j, Inc.
[Abbreviated] According to the 2021 Kaggle Machine Learning & Data Science Survey, variants of SQL such as MySQL, PostgreSQL, and Microsoft SQL Server dominate the field for Enterprise databases used by data scientists and machine learning engineers. The reasons for this are obvious – In addition to the ease of creating data collection pipelines, there is an abundance of tooling in the form of open-source software such as Python packages designed for interfacing with relational databases. However, the use of tabular data makes a key assumption: that all data points, represented as rows, are independent of each other, or at least independently drawn from the same statistical distribution. But what about when this is not the case? What about when there are relationships between the data points, or even several different relationships between them? This can be seen in SQL queries where there are multiple JOIN statements required to create a suitable output.
In this talk, we will explore the advantages and disadvantages of using graph structures for data science problems. We will discuss how to identify “graph-y” problems, examples of problems that are easier to solve with graphs, how to model the data in a graphical representation, and the resulting output with a look at how these results can be used to further enhance data science and machine learning workflows.
#11: From Experimentation to Products: The Production ML Journey: Robert Crowe, TensorFlow Developer Engineer, Google
An ML journey typically starts with trying to understand the world, and looking for data that describes it. This leads to an experimentation phase, where we try to use that data to model the parts of the world that we’re interested in, often because they directly affect our users or our business. Once we have one or more models that deliver good results, it’s time to move those models into production. Deploying advanced Machine Learning technology to serve customers and/or business needs requires a rigorous approach and production-ready systems.
We discuss the use of ML pipeline architectures for implementing production ML applications, and in particular, we review Google’s experience with TensorFlow Extended (TFX), as well as the advantages of containerizing pipeline architectures using platforms such as Kubeflow. Google uses TFX for large-scale ML applications and offers an open-source version to the community. TFX scales to very large training sets and very high request volumes, and enables strong software methodology including testability, hot versioning, and deep performance analysis.
#12: Unlocking the Value of Siloed Data with Multi-party ML: Roshanak Houmanfar, VP Machine Learning Products, Integrate.ai
Once you have built a model and optimized the code, the next step to better model performance is using more data. But what if the data you need sits in a silo that you cannot access, whether due to privacy regulations, trust barriers, or the costs of transferring the data? Organizations today are increasingly exploring data collaborations using multi-party ML to unlock the value of siloed data. In multi-party ML systems, multiple organizations form a network across which datasets are “shared” through various forms of data derivatives. In turn, the ML models trained on these additional data derivatives become more performant and robust. Layering on privacy and data governance controls enables data scientists to build new or better models while complying with partners’ privacy and security requirements. Multi-party ML use cases are popping up in multiple industries.
This talk introduces the evolving world of technologies that support multi-party ML, and then quickly gets into the hard questions. How do you determine which technology is right for your use case? What are the biggest challenges you will face trying to implement a multi-party ML system? And most importantly, how can you incorporate privacy and governance controls to build a system that will be trusted by all parties to keep their data and models secure?”
#13: Preventing Stale Models in Production: Milecia McGregor, Senior Developer Advocate, Iterative.ai
Deploying a machine learning model production is not the end of the project. You have to constantly monitor the model for model drift and the underlying data drift that causes it to get accurate predictions for your use cases. That means you have to re-train your model on new datasets often and consistently deploy those new models to production. In this talk, we’ll cover how you can use DVC to version your dataset as it changes with production values.
The example project we’re working with uses two years’ worth of bicycling data and we’ll take a look at how a model can change over time and when you can identify that it’s time to train a new one. We’ll go through a few examples of experiments with different dataset versions, algorithms, and hyperparameter values. Using DVC, we’ll see how you can run experiments to get the optimum model while keeping the code bundled with the data. We’ll cover how to build a simple MLOps pipeline with a data pre-processing stage, a training stage, and an evaluation stage. You’ll also see how to reproduce experiments and how you can share those experiments and their results with others on your team through tables and plots. You’ll also see how you can take a model from any experiment you’ve run and get a file you can automatically deploy to production. By the end of the talk, you should feel comfortable using DVC for data versioning and experiment tracking to handle your model before you deploy it.
#14: Methods and Tools for Time Series Data Science Problems with InfluxDB, an Open-Source Time Series Database: Anais Dotis-Georgiou, Developer Advocate, Influxdata
In this talk, we’ll explore the ways a time series platform supports data scientists. We’ll learn how you could use Telegraf open-source collection agent to perform forecasting at the edge. We’ll explore how you can use Flux query language to prepare and clean your data as well as some preliminary data analysis. Next, we’ll learn about integrations with Jupyter and Zeppelin notebooks. Finally, we’ll cover some statistical properties of time series and some general recommendations for forecasting and anomaly detection algorithm selections.
#15: Best Practices for Data Annotation at Scale: Jai Natarajan, Vice President, Strategic Business Development, iMerit
The long-term success of machine learning relies on consistently labeled high-quality data. While most machine learning initiatives begin in the lab, they take on a life of their own and can create significant challenges once they scale. ML data ops practitioners can find themselves being consumed by the logistics of data annotation and management instead of focusing on the science. Wherever you are in your team’s machine learning journey, you must think about evolving towards large-scale production. Proactively planning a data management strategy can generate progressively better results, but it requires thought and stakeholder buy-in. A key ingredient of this journey is your data labeling and annotation framework.
A data pipeline designed for human judgment and incremental training on edge cases provides that last mile of acceptability, enabling the machine learning solution to go to production. This session will reveal the implications of a live data loop in a production environment and how it significantly impacts the customer experience. Attendees will also take away trends and challenges in combining humans with the machine learning pipeline. In this session, iMerit’s Jai Natarajan reveals best practices to build scalable and repeatable data labeling pipelines with a balance of tools and humans-in-the-loop. Through peer, manager, and machine-learning expert collaboration, data annotators refine their skills and master tasks well beyond the expertise of crowdsourcing. In a collaborative framework, annotators and ML experts negotiate and create meaning through an iterative feedback process as they identify new concepts and nuances in the data. Attendees will learn concepts like designing to break the ML, edge case knowledge management, and workflow management.
#16: Shadow AI, the Silent Killer of Deep Learning Productivity: Gijsbert Janssen Van Doorn, Director of Technical Product Marketing, Run:Ai
Meet Shadow IT’s younger sibling. Shadow AI is the result of one-off AI initiatives inside organizations, where siloed AI teams buy their own infrastructure or use cloud compute resources. While no organization wants to stifle the ambition of its data science teams, this decentralized approach results in many AI initiatives never making it to production. In this series of free machine learning talks, learn from Gijsbert Janssen van Doorn at Run:AI how to centralize AI so that it is accessible and productive across an entire organization.”
#17: Analyzing Dynamic Global Markets with Places Data: Fletcher Berryman, Product Manager, SafeGraph
The world is dynamically changing and so are businesses. By some estimates, 20% of new businesses close within one year of opening. Even before the pandemic accelerated business closures, the Small Business Administration reported in 2018 that since 1990, about 7-9% of all businesses close each year. At the same time, the number of new business applications has continued to increase year over year, adding more complexity to staying on top of what businesses are in operation and where.
So with all of these changes, how can data scientists accurately model the real world to keep analyses fresh? Many POI datasets are only updated annually or quarterly, creating stale data inputs for models and forecasts. Hear from SafeGraph Places product manager Fletcher Berryman about how data scientists can stay on top of a dynamically changing world and maintain a fresh, precise, and high-quality POI database for confidence in their analytics.
#18: The Power Of Hexagons: How H3 & Foursquare Are Transforming Spatial Analytics: Nick Rabinowitz, Senior Staff Engineer, Foursquare
Geospatial analysis can be challenging and time-consuming – from preparing data of different shapes, forms, sizes, to processing large and complex datasets at scale. Find out how Foursquare’s Unfolded platform is solving these problems with H3, a hexagon-based grid system that can simplify data unification and processing, opening your data to new kinds of geospatial analysis. In this session, Senior Staff Software Engineer Nick Rabinowitz will discuss how H3 is transforming geospatial analysis and introduce Hex Tiles. A new tiling system built on H3 for fast visualization and analysis of massive spatial datasets. Supporting parallel processing for datasets with millions of rows, the Hex Tiles system is designed to easily ingest and enrich spatial datasets of all types and sizes – going from data unification to visualization in minutes. By transferring spatial data over the web in a tiled, grid-based format, Hex Tiles make it easy to visualize and explore massive spatial datasets and conduct analytics on the fly.
#19: Z by HP Panel Discussion on the Diverse Role of Data Science in Education
In the 21st century, data science has revolutionized every industry and education is no exception. Join this panel of industry experts as they discuss in-depth the much-anticipated revolution of data science within schools, colleges, and universities. How these institutions are using data science to improve learning outcomes, measure student and teacher performance metrics, and even use data analytics to make them a more successful institutions.
Max Urbany, M.S. Data Science, Harvard University
Daniel Chaney, VP, Enterprise AI, Future Tech Enterprise
Kristin Hempstead,Business Development Manager, HP Data Science Ambassador
#20: A New Indexing Technique for Quickly Fuzzy-Matching Entire Dataset Records: Dan S. Camper, Thaumaturge, HPCC Systems Solutions Lab
One challenge of building and maintaining error-free datasets often involves searching for and removing duplicate records. The problem of detection and elimination of duplicate database records is one of the major problems in the broad area of data cleansing and data quality. A single real-world entity may be listed multiple times in a dataset under different records due to variations in spelling, field formats, etc.
One method for approximate (“fuzzy”) matching two field values is to compute the Levenshtein distance between the string representation of those values and accept a suitably low-valued result. One indexing technique that allows for this type of matching in a time-sensitive manner is called Deletion Neighborhoods.
Deletion Neighborhoods are a classic space-time trade-off: You pre-compute and create a large index structure so that your later search operations are fast. While Deletion Neighborhoods may help when determining if field values between records approximately match, the problem of approximately matching the records themselves remains. Doing this quickly is necessary when working at a big data scale.
Interestingly, the problems of fuzzy deduplication and fuzzy search are essentially the same. The former is approximate matching records within a dataset, while the latter is approximate matching between a single record — constructed from a user’s webform entries, perhaps — and the dataset.
In these free machine learning talks, we review string-oriented Deletion Neighborhoods and present a novel application of them where a similar technique may be applied to entire records within a dataset. Further, we will show that combining both string and record-oriented techniques allows for powerful searching and record deduplication capabilities.
#21: Managing Data Science Projects via an Agile Framework: Jeffrey Saltz, PhD, Associate Professor, Syracuse University
Data science managers (and senior leaders managing data science teams) need to think through many questions relating to how to best execute their data science efforts. For example, how should the team brainstorm ideas? How should the team prioritize those potential ideas? More generally, how to help ensure the team delivers actionable insights? While these management challenges are very different from technical machine learning challenges that most teams focus on trying to solve, the management challenges are equally important to address to ensure a successful data science project. In other words, the focus of this talk is not on which specific algorithm a team should use, but rather, how to ensure that the data science effort is progressing effectively and efficiently.
Register for ODSC East 2022 and see all of these free machine learning talks
We just listed off quite a few interesting free machine learning talks coming to ODSC East 2022 this April 19th-21st – and everything above can be seen for free when you register for Bronze Pass. You can still upgrade to a training pass for 30% off and get access to all of our machine learning training options. Sessions include:
- Tutorial: Building and Deploying Machine Learning Models with TensorFlow and Keras
- Tired of Cleaning your Data? Have Confidence in Data with Feature Types
- The Future of Software Development Using Machine Programming
- Telling stories with data
- Sculpting Data for ML: The first act of Machine Learning
- Overview of methods to handle missing values
- Overview of Geocomputing and GeoAI at Oak Ridge National Laboratory: Exploitation at Scale, Anytime, Anywhere
- Network Analysis Made Simple
- Mastering Gradient Boosting with CatBoost
- Machine Learning for Trading
- Machine Learning for Causal Inference
- Introduction to Scikit-learn: Machine Learning in Python
- Intermediate Machine Learning with Scikit-learn: Evaluation, Calibration, and Inspection
- Intermediate Machine Learning with Scikit-learn: Cross-validation, Parameter Tuning, Pandas Interoperability, and Missing Values
- End to End Machine Learning with XGBoost
- Beyond the Basics: Data Visualization in Python
- Automation for Data Professionals
- An Introduction to Drift Detection
- Advanced Machine Learning with Scikit-learn: Text Data, Imbalanced Data, and Poisson Regression