Unlocking Tabular Data’s Hidden Potential
Generative AIModelingposted by ODSC Community April 6, 2023 ODSC Community
Experience shows that tabular data is a highly valuable data source for sales, marketing, churn management, operations, and risk management, among other business use cases. Yet tabular data isn’t getting the attention it deserves, even though it often comprises the bulk of an organization’s data and contains its competitive edge and unique intellectual property.
Tabular Data Has a Popularity Problem
So why the bias against tabular data? With text, voice, and images capturing the limelight in the media, tabular data often takes a back seat. Its popularity woes begin with the absence of readily available foundational models and are exacerbated by the underperformance of neural networks (seen as more cutting-edge and attention-worthy) when applied to tabular data.
Let’s face it – humans are hardwired to process visual and auditory stimuli at lightning speed. It’s an innate part of our evolutionary journey. As visual learners, we can process images in a mere 13 milliseconds – a staggering 60,000 times faster than text! Research reveals that images dramatically enhance user engagement on social media, so it’s no surprise that text-to-image generators and ChatGPT are taking the world by storm.
Unfortunately, even the data science industry – which should recognize tabular data’s true value – often underestimates its relevance in AI. Many mistakenly equate tabular data with business intelligence rather than AI, leading to a dismissive attitude toward its sophistication. While it’s historically been simple to build minimally viable tabular data pipelines, optimizing them for maximum business value is a monumental challenge, which is why many use cases fail to meet expectations.
Standard data science practices could also be contributing to this issue. Feature engineering activities frequently focus on single-table data transformations, leading to the infamous “yawn factor.” Let’s be honest – one-hot-encoding isn’t the most thrilling or challenging task on a data scientist’s to-do list.
However, turning to deep learning and unstructured data isn’t the answer to overcoming underperforming AI projects. Instead, we must reevaluate our approach to AI projects that harness the power of tabular data, giving it the recognition and respect it rightfully deserves.
Embrace Data-Centric AI
The key to unlocking value in AI lies in a data-centric approach, according to Andrew Ng. Data-centric AI, in his opinion, is based on the following principles:
- It’s time to focus on the data – after all the progress achieved in algorithms means it’s now time to spend more time on the data
- Inconsistent data labels are common since reasonable, well-trained people can see things differently.
- Data that has errors and is messy, is often fixed by ad hoc data engineering that relies on luck or individual data scientists’ skills.
- Making data engineering more systematic through principles and tools will be key to making AI algorithms work.
- Smaller amounts of high-quality data might be sufficient for industries without access to tons of data.
The examples that Ng provides to explain data-centric AI have one thing in common – they come from his experience in developing deep learning applications on unstructured data such as images. Although tabular data are less commonly required to be labeled, his other points apply, as tabular data, more often than not, contains errors, is messy, and is restricted by volume. Feature engineering of tabular data demands considerable manual effort, making tabular data preparation even more dependent on luck or the data scientist’s skill set.
One might say that tabular data modeling is the original data-centric AI!
Beat the Competition
In today’s fast-paced world, your competitors have the same access to off-the-shelf generative AI systems as you do, which means they’re making similar decisions and producing comparable content. By failing to embrace these technologies, you risk falling behind. However, simply adopting generic AI solutions will only level the playing field rather than giving you a true advantage.
This is where tabular data enters the picture, providing a potential competitive edge. This valuable data is often securely tucked away behind firewalls, remaining inaccessible to both your competitors and generic AI systems. Your proprietary data can be a goldmine of fascinating insights into your unique customer base, products, services, business processes, and overall strategy.
The choice is yours. Are you a data scientist who’s content with merely keeping up with the pack, or do you have the ambition and drive to create something exceptional that outshines the competition? If so, consider embracing the untapped potential of tabular data.
Tabular data holds immense potential, yet it often falls short of delivering the value many anticipate. Although it appears straightforward on the surface, mastering its intricacies requires exceptional skills to transform this data into meaningful insights and real-world value.
ChatGPT suggests that tabular data is “easier to use,” but the reality is more nuanced. If we consider a single, manageable, and flawless table, tabular data appears straightforward. However, focusing on simplicity doesn’t unlock its true potential—it’s the complexities of real-life data that hold the key.
In practice, tabular data is anything but clean and uncomplicated. Its inherent challenges include navigating one-to-many relationships across tables in a database and dealing with missing or incorrect values commonly found in real-life datasets.
Time adds another layer of complexity, with tabular data often spanning periods marked by potential leakage and structural changes. Effectively interpreting and analyzing this data requires both contextual and domain knowledge from practitioners.
Preparing tabular data for AI applications demands feature engineering to make it AI-ready. Unlike unstructured data, it doesn’t benefit from pre-trained models or transformers, necessitating additional time and effort from data scientists.
The challenges continue, as tabular data can be sourced from a near-infinite set of schemas, often without well-defined AI-specific semantics. As a result, it may underperform without appropriate feature selection. Collinearities within the data can also lead to algorithmic breakdowns, further complicating the process.
Tabular data is dynamic and ever-changing, requiring data scientists to remain vigilant and adapt to ongoing modifications to maintain AI model accuracy and effectiveness.
By embracing the challenges of tabular data’s complexity, we can unlock its hidden potential and uncover valuable insights that propel our organizations forward.
But that doesn’t mean that tabular data is nothing but boring hard work! When data scientists use their curiosity and imagination, they can find insights well beyond standard RFM (recency, frequency, monetary) signals. With event sequences, complex human and market behaviors, similarities of attributes, seasonality, and more at play, there’s plenty of opportunity to derive new signal types and uncover amazing insights.
So, if you’re working on tabular data, use your human creativity to make it more interesting. Look for more than the obvious. Here are a few feature engineering ideas to start you off:
- Rather than looking for customers that are similar to each other, create features that identify dissimilarities between customers within the persona groups your marketing team has been using. This can identify new opportunities for your business.
- Instead of assuming the past always predicts the future, create features that identify when a customer’s behavior changes. This not only identifies new opportunities – it is vital for identifying potential fraud.
- Don’t despair that some customers’ behaviors are more predictable than others. Create features, such as entropy, that measure how variable those behaviors have been. Maybe, in doing so, you will identify the early adopters amongst your customer base.
Your training data is a finite resource. Make sure to maximize its potential by focusing on uncovering new insights rather than having the algorithm learn the same old things. The best AI solutions use more than raw data from your database, supplementing data-based learning with valuable human inputs such as business goals, domain knowledge, and feature engineering.
Tabular data holds the key to unlocking untapped potential and driving competitive advantage in a world where AI solutions are becoming increasingly commonplace. Despite its apparent simplicity, navigating the complexities of tabular data requires skill, creativity, and perseverance. As data scientists, we must step up to the challenge and harness the power of tabular data to unveil meaningful insights and actionable information.
The time has come to give tabular data the recognition it deserves, unlocking its hidden potential and leading the charge toward a future where businesses thrive on the cutting edge of AI-driven success.
Article by Colin Priest, Chief Evangelist at FeatureByte
Learn more about tabular data in our recent webinar, “The Secret to Great AI: Great Data“!
With all the attention on generative AI, tabular data isn’t getting the attention it deserves. But tabular data often makes up the bulk of an organization’s data and contains its competitive edge and unique intellectual property. Join us to learn about deriving valuable signals from tabular data. We will focus on how to add rigor to feature engineering, and best practices for deploying and maintaining features in production.