GenAI: How to Synthesize Data 1000x Faster with Better Results and Lower Costs GenAI: How to Synthesize Data 1000x Faster with Better Results and Lower Costs
Editor’s note: Vincent Granville is a speaker for ODSC West this October 30th to November 2nd. Be sure to check out... GenAI: How to Synthesize Data 1000x Faster with Better Results and Lower Costs

Editor’s note: Vincent Granville is a speaker for ODSC West this October 30th to November 2nd. Be sure to check out his talk, “GenAI Breakthrough: Fast, High Quality Tabular Data Synthetization,” there!

There are two aspects to this problem of synthesizing data. First, how do you evaluate results and compare synthesizers? Then, how to essentially eliminate training, thus speeding up algorithms by several orders of magnitude? This in turn results in substantial cost savings, as GPU is not needed, and cloud time is significantly reduced. I first focus on evaluation, and then on fast architecture. I provide a brief overview only. The full details are in my new book “Statistical Optimization for Generative AI and Machine Learning”, available here. Both the new evaluation metric and new data synthesizer are now available as open-source libraries, respectively “GenAI Evaluation” and “NoGAN Synthesizer”. The context is tabular data generation. 

In-Person and Virtual Conference

September 5th to 6th, 2024 – London

Featuring 200 hours of content, 90 thought leaders and experts, and 40+ workshops and training sessions, Europe 2024 will keep you up-to-date with the latest topics and tools in everything from machine learning to generative AI and more.

Evaluating synthetic data

Many metrics are available to evaluate the quality of tabular synthetic data. These metrics measure how similar the original and synthetic data are, in terms of statistical distribution. The goal is to minimize the distance between the two joint empirical distributions (ECDF): one computed on the real data, and the other one on the generated data. The ECDF offers benefits over distances based on the empirical distribution function (EPDF). In particular, 

  • The ECDF always exists. 
  • Being an integral, it is less sensitive to errors.
  • It easily handles a mix of categorical, ordinal, and continuous features.

The distance between the joint (multivariate) ECDFs, referred to here as the Kolmogorov-Smirnov distance (KS), has been studied for some time in academia, with a focus on convergence issues. Yet, I haven’t seen a practical implementation tested on real data in dimensions higher than 3, combining both numerical and categorical features. My NoGAN algorithm, probably for the first time, comes with the full multivariate KS distance to evaluate results. It is adjusted for dimension. Also, it returns a value between 0 (best fit) and 1 (worst fit). Convergence of the approximate KS used here, while evident in all tests, remains an open theoretical question.

The reason to implement this distance, despite its complexity, is to avoid false negatives. Metrics used by vendors frequently rate poor synthetizations as excellent, due to lack of depth. Unlike standard techniques, the multivariate ECDF captures all linear and non-linear feature dependencies spanning across multiple dimensions, thus eliminating this problem. In addition, all evaluations were performed using cross-validation: splitting the real data into training and validation sets, using the training data only for synthetization, and the validation set to assess performance. 

Generating synthetic data

NoGAN is the first algorithm in a series of high-performance, fast synthesizers not based on neural networks such as GAN. It browses the input data only once, creating the minimum number of multivariate bins or hyperrectangles efficiently covering the sparse working area in the feature space. The shapes of these static bins are pre-determined based on feature quantiles. The total number of bins is at most equal to the number of observations.  All categorical features are jointly encoded using an efficient scheme (“smart encoding”). 

To produce the synthetic data, I sample bin counts using a multinomial distribution, to replicate the count distribution observed in the real data. Within each bin, synthetic observations are generated using a uniform or truncated Gaussian distribution, centered at the mean estimated on the real data.

Figure 1: Synthetic data (left) versus real (right), Telecom dataset

The main hyperparameter vector specifies the number of quantile intervals to use for each feature (one per feature). It is easy to fine-tune, allowing for auto-tuning. Indeed, the whole technique epitomizes explainable AI. For instance, if a categorical feature has one category that accounts for only 1% of the observations, the corresponding hyperparameter value must be at least 100 (the inverse of 1%) to make sure it won’t be missed in the synthetization. 

Large hyperparameter values always work well but can lead to overfitting and other issues, especially when comparing synthesized data to the validation set. As a rule of thumb, it is best to use the smallest possible values to achieve the desired quality. Smaller values also lead to richer synthetic data; they are beneficial when using augmented data to increase the performance of predictive algorithms.

About the author on GenAI:

Vincent Granville is a pioneering GenAI scientist and machine learning expert, co-founder of Data Science Central (acquired by a publicly traded company in 2020), Chief AI Scientist at MLTechniques.com, former VC-funded executive, author, and patent owner — one related to LLM. Vincent’s past corporate experience includes Visa, Wells Fargo, eBay, NBC, Microsoft, and CNET.

Vincent is also a former post-doc at Cambridge University, and the National Institute of Statistical Sciences (NISS). He published in “Journal of Number Theory”, “Journal of the Royal Statistical Society” (Series B), and “IEEE Transactions on Pattern Analysis and Machine Intelligence”. He is the author of multiple books, including “Synthetic Data and Generative AI” (Elsevier, 2024). Vincent lives in Washington state, and enjoys doing research on stochastic processes, dynamical systems, experimental math and probabilistic number theory. He recently launched a GenAI certification program, offering state-of-the-art, enterprise grade projects to participants.


ODSC Community

The Open Data Science community is passionate and diverse, and we always welcome contributions from data science professionals! All of the articles under this profile are from our community, with individual authors mentioned in the text itself.