ODSC West is just a few months away and we couldn’t be more excited to announce our first 50 sessions! There’s not enough space in this blog to talk about them all, but we’ve highlighted a few below. You can find a full list of the first 50 sessions here.
A Semi-Supervised Anomaly Detection System Through Ensemble Stacking Algorithm
Chuying Ma | Senior Data Scientist | Walmart
To address the complex problem of anomaly detection in customers’ activities to prevent inventory loss and shrinkage, this work proposes a systematic, flexible, extensible and holistic anomaly detection architecture to augment the existing labels and detect anomalies with a low cost.
This session will explore how this new system can flexibly incorporate deep learning-based anomaly detection models, or any other traditional machine learning models, and generate a unified anomaly score by the ensemble stacking algorithm to address different types of anomalies simultaneously.
Personalizing LLMs with a Feature Store
Jim Dowling | CEO | Hopsworks
This session will show you how to personalize LLMs using a feature store and prompt engineering. You will walk through how to build an example free serverless, personalized LLM application using Hopsworks, an open-source feature store with a built-in vector database, and look at how to build templates for prompts, how to fill-in prompt templates with real-time context data, and how we can incorporate documents from vector databases in prompts using a combination of user-input and historical user data from the feature store.
What is a Time-series Database and Why do I Need One?
Jeff Tao | Founder & CEO | TDengine
With the advent of IoT and the cloud, the volume of time-series data has begun growing exponentially in an unprecedented way, representing a major challenge for general database management systems like relational and NoSQL databases. Purpose-built time-series databases, on the other hand, are optimized to handle the special characteristics of time-series data, making them more efficient in terms of ingestion rate, query latency, and data compression.
Evaluation Techniques for Large Language Models
Rajiv Shah, PhD | Machine Learning Engineer | Hugging Face
Selecting the right LLM for your needs has become increasingly complex. During this tutorial, you’ll learn about the practical tools and best practices for evaluating and choosing LLMs.
You will explore the existing research on the capabilities of LLMs versus small traditional ML models, as well as several techniques, including evaluation suites like the EleutherAI Harness, head-to-head competition approaches, and using LLMs for evaluating other LLMs. Finally, you will touch on subtle factors that affect evaluation, including role of prompts, tokenization, requirements for factual accuracy, and model bias and ethics.
Understanding the Landscape of Large Models
Lukas Biewald | CEO and Co-founder | Weights & Biases
Join this session to explore the current landscape of large models from GPT-3 to Stable Diffusion. You’ll also discuss how the teams behind some of the open source projects are using W&B to accelerate their work.
Scaling your Data Science Workflows by Changing a Single Line of Code
Doris Lee | CEO and Cofounder | Ponder
Tools like pandas and NumPy have enabled practitioners of all levels to work with data efficiently, however as practitioners look to scale their workflows to production, these tools present some challenges. This session will explore the limitations of these tools and pain points that data scientists encounter when working with data at scale. You will also cover how the open-source project Modin (10M+ downloads) addresses this issue by seamlessly scaling up your pandas code with just a single line of code change.
Troubleshooting and Measuring Embedding/Vector Drift for Production Deployments of Language Models
Amber Roberts | Data Scientist, Growth Lead | Arize AI
In this presentation, Amber Roberts, Machine Learning Engineer at Arize AI, will present findings from research on ways to measure vector/embedding drift for image and language models. With lessons learned from testing different approaches (including Euclidean and Cosine distance) across billions of streams and use cases, Roberts will dive into how to detect whether two unstructured language datasets are different — and, if so, how to understand that difference using techniques such as UMAP.
Democratizing Fine-tuning of Open-Source Large Models with Joint Systems Optimization
Kabir Nagrecha | PhD Student | UC San Diego
This session will provide an overview of the core ideas behind Saturn, how it works on a technical level to reduce runtimes & costs, and the process of using Saturn for large-model finetuning. You’ll explore how Saturn can accelerate and optimize large-model workloads in just a few lines of code and describe some high-value real-world use cases from industry and academia.
Machine Learning Has Become Necromancy
Mark Saroufim | Engineer on PyTorch | Meta
Much has been said about how breakthroughs are made but not too much on how breakthroughs are lost. This talk explores the evolution and destruction of necromancy and draws parallels to recent proposed regulations in Machine Learning.
Sign up here
Join us at ODSC West this October 30th to November 2nd for the chance to attend these and many more hands-on training sessions, workshops, and talks. Plus when you register now, you’ll save 50% on any in-person or virtual pass.