We couldn’t be more excited to announce the first sessions for our second annual Data Engineering Summit, co-located with ODSC East this April. Join us for 2 days of talks and panels from leading experts and data engineering pioneers. In the meantime, check out our first group of sessions.
How to Practice Data-Centric AI and Have AI Improve its Own Dataset
Jonas Mueller | Chief Scientist and Co-Founder | Cleanlab
Data-centric AI is poised to be a game changer for Machine Learning projects. Manual labor is no longer the only option for improving data. Instead, Data-centric AI introduces systematic techniques to utilize the baseline model to find and fix dataset issues, enabling you to improve your model’s performance without changing the code.
In this session, you’ll learn how to operationalize fundamental data-centric AI ideas across a wide range of datasets. With an exploration of real-world data, this session will equip you with the knowledge to immediately retrain better models.
Tutorial: Introduction to Apache Arrow and Apache Parquet, using Python and Pyarrow
Andrew Lamb | Chair of the Apache Arrow Program Management Committee | Staff Software Engineer | InfluxData
Take a deep dive into the basics of Apache Arrow and Apache Parquet with Andrew Lamb. You’ll learn how to load data to/from pyarrow arrays, csv and parquet files, and how to use pyarrow to quickly perform analytic operations such as filtering, aggregation, joining, and sorting.
In completing these tasks you’ll experience the benefits of the open Arrow ecosystem firsthand, as well as see how Arrow facilitates fast and efficient interoperability with pandas, pol.rs, DataFusion, DuckDB, and other technologies that support the Arrow memory format.
Data Engineering in the Age of Data Regulations
Alex Gorelik | Distinguished Engineer | LinkedIn
As AI advances, so do data regulations like GDPR, CCPA, DMA, and many others. These regulations allow users to control their data and put limitations on what companies can do with it. In many cases, the ability to operate in a country is dependent on adhering to these restrictions.
This talk will illustrate a real-world example of how to convert these regulations into policy and subsequently, how to integrate policy enforcement in data engineering practices.
The 12 Factor App for Data
James Bowkett | Technical Delivery Director | OpenCredo
To deal with an increasingly data-centric world, the 12-factor app helps define how to think about and design cloud-native applications. This session will take you through the 12 principles of designing data-centric applications that have been useful across 4 categories: Architecture & Design, Quality & Validation (Observability), Audit & Explainability, and Consumption.
Engineering Knowledge Graph Data for a Semantic Recommendation AI System
Ethan Hamilton | Data Engineer | Enterprise Knowledge
This in-depth session will teach how to design a semantic recommendation system. These systems represent data as knowledge graphs and implement graph traversal algorithms to help find content in massive datasets. These systems are not only useful for a wide range of industries, they are fun for data engineers to work on.
Data Pipeline Architecture – Stop Building Monoliths
Elliott Cordo | Founder, Architect, Builder | Datafutures
Although common, data monoliths present several challenges, especially for larger teams and organizations that allow for federated data product development.
In this session, you’ll explore possible solutions from Microservices and Event Based Architecture, with a focus on multi-Airflow infrastructure, micro-DAG packing and deployment, DBT multi-project implementation, rational use of containers, and data sharing/publication strategies.
Is Gen AI A Data Engineering or Software Engineering Problem?
Barr Moses | Co-Founder & CEO | Monte Carlo
At the start, Gen AI seemed like a software engineering and API integration project. However, as production and talent become more accessible, the teams who got a head start on finding ways to utilize Gen AI will be ahead of the game. Join this session with Barr Moses to get his take on the question of whether Gen AI is a data engineering or software engineering problem.
Dive into Data: The Future of the Single Source of Truth is an Open Data Lake
Christina Taylor | Senior Staff Engineer | Catalyst Software
Join this session for an exploration of building a centralized data repository that ingests from a wide variety of sources, including service databases, SAAS applications, unstructured files, and conversational data. Using real-world examples, you’ll see how you can reduce costs and vendor lock-in by migrating from proprietary data warehouses to an open data lake.
With the insights gained during this session, you’ll be better equipped to choose the most appropriate technology to accommodate diverse analytics, machine learning, and product use cases.
Tale of Apache Parquet reaching pinnacle of friendship with Data Engineers
Gokul Prabagaren | Engineering Manager | CapitalOne
Join this session to see how a 100% Cloud-operated company runs its data processing pipeline and how Apache Parquet plays a pivotal role in each step of our processing. You’ll explore a variety of design patterns implemented using Parquet and Spark, as well as how the company’s resiliency has increased with the usage of Apache Parquet.
At the Data Engineering Summit on April 24th, co-located with ODSC East 2024, you’ll be at the forefront of all the major changes coming before it hits. So get your pass today, and keep yourself ahead of the curve.