Editor’s note: Wes Madrigal is a speaker for ODSC West this October 30th to November 2nd. Be sure to check out his talk, “Using Graphs for Large Feature Engineering Pipelines,” there to learn more about GraphReduce and more!
For readers who work in ML/AI, it’s well understood that machine learning models prefer feature vectors of numerical information. Much work has been done in the area of feature engineering automation, but much of it assumes the input is a flat feature vector. However, the majority of enterprise data remains unleveraged from an analytics and machine learning perspective, and much of the most valuable information remains in relational database schemas such as OLAP. Tapping into these schemas and pulling out machine learning-ready features can be nontrivial as one needs to know where the data entity of interest lives (e.g., customers), what its relations are, and how they’re connected, and then write SQL, python, or other to join and aggregate to a granularity of interest. To further complicate things, data leakage causes serious problems in machine learning models so time must be handled carefully across entities. In plain English “data leakage” is the equivalent of showing the model the answers on the quiz and cheating. Finally, as features are ideated, proposed, and added the interface for feature engineering may become a Frankenstein of pipelines which is hard to extend, maintain, or reuse. In this blog, we propose GraphReduce as an abstraction for these problems. We will demonstrate an example feature engineering process on an e-commerce schema and how GraphReduce deals with the complexity of feature engineering on the relational schema.
Let’s suppose we are predicting if a customer will interact with a notification in order to drive outreach. Unfortunately, our data engineering and machine learning ops teams haven’t built a feature vector for us, so all of the relevant data lives in a relational schema in separate tables. The example schema is as follows:
- Order events
- Order products
- Notification interactions
Since we’re predicting something about the customer we’ll be modeling at the customer granularity.
GraphReduce doesn’t help with this part, so you’ll need to profile the data, talk to a data guru, or use emerging technology. In this case the schema is as follows:
- Customers -> orders
- id = orders.customer_id
- Orders -> order_events
- id = order_events.order_id
- Orders -> order_products
- id = order_products.order_id
- Customers -> notifications
- id = notifications.customer_id
- Notifications -> notification_interactions
- id = notification_interactions.notification_id
Since we’re modeling at the customer level of detail, we need to have a decent understanding of how the relationships interact from a cardinality standpoint. Modeling at the customer level of detail requires that we join all of the relevant relationships to the customer and reduce any relations to the customer granularity. For example, if the customer has 10 rows and a child relation orders has 100, we need to reduce the order entity to the customer granularity by issuing a GROUP BY aggregation on the order entity and then join back to the customer for a total of 10 rows with customer and aggregated / reduced order information combined. Usually understanding granularity is fairly intuitive, but in this case we’ll provide the row counts so it is obvious up front.
The tables have the following row counts:
- Customers: 2 rows
- Orders: 4 rows
- Order products: 16 rows
- Order events: 26 rows
- Notifications: 10 rows
- Notification interactions: 15 rows
Data preparation and filtering:
Data preparation involves removing incorrect or outlier data. This will involve things like filters, discarding rows, transforming or discarding column values with anomalous or incorrect values, and imputing missing values. Data preparation happens at the entity-level first so errors and anomalies don’t make their way into the aggregated dataset. Here is an example of how this might look in Pandas on our orders entity:
Joining and cardinality:
Assuming we’ve taken care of data preparation and quality, now we’re ready to join our data entities together and flatten them to a single dataframe at the customer granularity. The key to doing this right is ensuring a child table gets reduced to the parent’s granularity by issuing a GROUP BY so we don’t introduce data duplication. Here is an example of how this might look in Pandas:
In order to avoid data leakage we need to make sure we’re using reliable date keys and filtering our data properly around the training period and label period. The training period contains information up until a certain date and the label period contains information after said date. This allows us to feed our model the training data and tell it what happened in the future period it is learning to predict. Since we’re not dealing with a single flat feature vector, but instead an arbitrary number of tables, we need to be careful that the handling of time is the same across every relation in the compute graph. To demonstrate how this works we show the orders dataset instantiated and filtered for features and labels using the graphreduce methods:
Problems with one off approaches:
The steps above will be encountered every time one is building models on relational data that isn’t in an ML-ready feature vector, which is most of the time. Feature stores emerged as an abstraction a few years ago to solve much of this, but they take a different architectural approach than that taken by GraphReduce. The particulars of the differences between feature stores like feast, Tecton, or Vertex AI feature store are left for another blog post, but it mostly boils down to the upstart time, learning curve, and maintenance cost of feature stores being higher.
GraphReduce treats entities / tables as nodes and relationships between them as edges in a graph data structure. The library subclasses networkx to take advantage of graph algorithms and extend the interface for data ops. GraphReduce provides an abstraction to house top-level compute parameters, which are mapped to the entire graph:
- The parent node/entity to reduce the data to
- Prefixes so when all data is joined it is clear where each column originated (e.g., orders prefixed as “ord_id”, “ord_total”, etc.)
- Amount of data to include in the historical training dataset
- A cut date to split between training and label periods
- Pluggable compute backend parameter for using Spark, Pandas, or Dask
- Data format parameter
- Whether or not the compute graph is generating labels
The top-level GraphReduce object also centralizes the order of operations which are mapped, typically in a depth-first manner, to the nodes in the graph. This centralizes the orchestration of compute operations across the graph, unifying things like time travel, compute layers, training and label periods, file formats, and join operations between nodes. The diagram below outlines the order of operations.
The Node abstraction in GraphReduce houses the following operations:
- Getting data (each node is responsible for loading it’s own information)
- Prefixing columns
- Necessary for when multiple nodes’ data gets joined together so that each field indicates from where it originated (e.g., customer -> cust_, orders -> ord_)
- Filtering data
- Annotating data
- Normalizing data
- Reducing data
- Preparing data for features by slicing on the cut date
- Preparing data for labels by slicing on the cut date
Bringing it all together with this example:
Instantiate the GraphReduce object with specifications for your graph compute operation:
Instantiate the nodes we are computing over and add to GraphReduce:
Note, we’re only including one add_entity_edge operation for brevity, but the full example code is linked in the summary below.
Visualize the compute graph that will be executed (hat tip pyvis):
Run compute operations:
Use ML ready feature vector and train a model:
Alternatively, you could use SQL and do something like this, but I don’t advise it 🙂
We’ve outlined common problems in aggregating ML-ready features from relational data, abstractions with GraphReduce, and how they work on an example e-commerce dataset. We’ve demonstrated that GraphReduce provides a unified interface for compute operations over arbitrarily large relational compute graphs with a composable and stackable interface, allowing for teams to take advantage of the full enterprise data graph. I will go into more detail during my upcoming ODSC West talk.
About the Author:
Wes Madrigal is a machine learning expert with over a decade of experience delivering business value with AI. Wes’s experience spans multiple industries, but always with an MLOps focus. His recent areas of focus and interest are graphs, distributed computing, and scalable feature engineering pipelines.