The Expanding Importance of Synthetic Data  The Expanding Importance of Synthetic Data 
Working with synthetic data has long been a benefit for data scientists. For example, R data scientists can use the rnorm()... The Expanding Importance of Synthetic Data 

Working with synthetic data has long been a benefit for data scientists. For example, R data scientists can use the rnorm() function in base R to generate random numbers that adhere to the normal (Gaussian) probability distribution. Python data scientists can do the same with np.random.normal () in NumPy for the same purpose. Both functions have the ability to specify the number of values to generate, along with the mean and standard deviation. Of course, there are many other probability distributions, binomial, uniform, Poisson, etc. that you can use for simulating data for different problem domains. 

The idea is to create simulated data sets for use with data experiments where an intimate understanding of the characteristics of a particular distribution aids in understanding the use of the data set with a machine learning algorithm.

Synthetic data is a once fragmented field now beginning to coalesce around the intent to help mitigate frequent project delays and cancelations. Synthetic data is emerging as an essential element in building accurate and capable machine learning models, as it provides data scientists with vast amounts of perfectly labeled data on-demand. We’re seeing that working with synthetic data can be a very useful learning tool. It can help us better understand real-world data we collect by serving as a way to mimic the structure of data we hope to collect or data we’ve already collected. It can help us better understand the analyses and statistical models we wish to use.

The use of synthetic data has only accelerated in the past few years with the rise of deep learning. Successful deep learning applications require a large amount of labeled observations. Labeled data isn’t easy to come by, and that’s why synthetic data comes to the rescue. Many deep learning algorithms are hungry for labeled data for training purposes. Specifically, facial recognition algorithms yield more accurate results the more training data that are available, and autonomous vehicles are better able to avoid “long-tail” problems when trained on increasing amounts of video data that includes more real-life situations. The question is how to quench this accelerating thirst for data. 

This article reviews the state of the art for synthetic data and highlights some modern-day solutions provided by a number of movers and shakers in this field – start-up companies that have taken an important stake in the demand for synthetic data.

“AI is driven by the amount, quality, and speed of training data. Synthetic training data is already making waves in several industries including autonomous vehicles and robotics. There is a critical need for more education on the underlying technology and benefits to drive broader industry adoption,” said Yashar Behzadi, CEO and founder of Synthesis AI. “Building core synthetic data capability will be the key to whether or not some companies adapt or fall behind in the future. Synthetic data has the potential to deliver perfectly labeled data on-demand, potentially cutting millions of dollars and months of work related to the current process of collecting, preparing, and manually labeling training data.”

Additionally, there is a new (2021), seminal text on the subject, “Synthetic Data for Deep Learning,” by Sergey I. Nikolenko which serves as an all-in-one learning resource for getting up to speed on the subject, along with a comprehensive historical perspective (including an amazing References section for citations to early papers). 

You also might want to check out the synthpop package for R, which has the aim of producing a synthetic version of observed data designed to simulate characteristics in all possible ways.

What is Synthetic Data and Why is it Needed?

Synthetic data is evolving as an advantageous solution for model development. The idea is to allow machine learning methodologies to learn the statistical information from a real data set and simulate it on a new simulated data set, without copying or transforming the original data. Synthetic data is artificially created and keeps the original data properties, safeguarding its business value while maintaining compliance. It’s important to preserve the data quality and structure, ensuring high-quality data for purposes such as training machine learning models. Moreover, using synthetic data, organizations can achieve data set balance, address issues such as bias, and certify more fairness within the data sets used to develop data science initiatives. 

“We believe that having quality data is truly a game-changer and that by creating high-quality data that resembles real-world data that was initially inaccessible, endless possibilities can be unlocked,” explains YData co-founder Gonçalo Martins Ribeiro. “In 2020 we conducted a study that found that the biggest problem faced by data scientists was the unavailability of high-quality data even though it is widely accepted that data is the most valuable resource. Not every company, researcher, or student has access to the most valuable data like some tech giants do. As machine learning algorithms, coding frameworks evolve rapidly, it’s safe to say the scarcest resource in AI is high-quality data at scale.” 

Protecting Privacy

Synthetic data sets look as authentic as a company’s actual customer data reflecting behaviors and patterns with a high degree of accuracy, but without the personal data points. This helps companies with privacy protection regulations such as GDPR and California’s CPPA, while at the same time uncovering important insights from the data. Many data scientists are now expressing a preference for working with synthetic data over real data because real data can represent a liability in terms of being at risk of a breach or a hack, and also because real data can’t always be used for many development tasks including model training due to regulatory restrictions. 

Synthetic data, on the other hand, is not subject to such regulations because it contains no Personal Identifiable Information (PII). That means data scientists are free to use it any way they wish, with no regulatory issues and no risk of a breach or hack. With the right expertise, synthetic data sets can also be improved upon. For example, one can increase the incidence of rare events in the data set, making algorithms more efficient in learning these rare patterns (e.g. in fraud detection), or to boost representation of underrepresented groups (such as women or people of color) in order to remove bias from models and make them more accurate.

De-biasing Data

According to Gartner, today, 85% of algorithms are erroneous due to bias. Most of the blame lies with the data sets used to train AI models, which often lack enough data representation for women, people of color, and other minority groups. Over time, this underrepresentation not only leads to bad business decisions, but also has a tangible impact on consumers – in circumstances like women being approved for credit at a lower rate than men. Synthetic data is emerging as a possible solution to bias in data. 

“Bias is a top concern for financial services companies right now,” said Alexandra Ebert, Chief Trust Officer at MOSTLY AI. “AI explainability is always an issue, for internal buy-in on models but also for regulatory compliance and customer buy-in. Synthetic data has proven to be an excellent approach to AI explainability, and a great way to train or retrain models to eliminate that bias and improve model accuracy and performance.”

Addressing the Needs of Big Data

If you are deploying algorithms with high dimensionality data sets, and critical quality and safety factors, then synthetic data generation provides a mechanism for cost-effectively creating large data sets. Synthetic data is often the only option since actual data is either not available or unusable. 

“The metaverse cannot be built without the use of synthetic data,” said Yashar Behzadi, CEO and Founder of Synthesis AI. To recreate reality as a digital twin, it’s necessary to deeply understand humans, objects, 3D environments, and their interactions with one another. Creating these AI capabilities requires tremendous amounts of high-quality labeled 3D data––data that is impossible for humans to label. We are incapable of labeling distance in 3D space, inferring material properties or labeling light sources needed to recreate spaces in high-fidelity. Synthetic data built using a combination of generative AI models and visual effects (VFX) technologies will be a key enabler of the AI models required to power new metaverse applications.”

Predicted Uses of Synthetic Data

In a recent Gartner study, an important prediction is that synthetic data will result in better privacy. Synthetic data is generated from machine learning models that capture the patterns and statistical properties of real data, but lacking any one-to-one mapping back to an individual. The study builds on predictions previously made about synthetic data, when it was determined that:

  • by 2024, 60% of the data used for the development of AI and analytics solutions will be synthetically generated;
  • by 2024, use of synthetic data and transfer learning will halve the volume of real data needed for machine learning; and
  • by 2025, 10% of governments will use a synthetic population with realistic behavior patterns to train AI while avoiding privacy and security concerns.

“Vienna-based MOSTLY AI pioneered the creation of synthetic data for AI model development and software testing. The company currently works with multiple Fortune 100 banks and insurers in North America and Europe. “Gartner predicts that by 2024, 60% of the data used for the development of AI and analytics projects will be synthetically generated,” says Tobias Hann, CEO at MOSTLY AI. “AI-generated synthetic data is better than real data for many reasons. First, it is truly anonymous and privacy preserving. And second, it can be shaped and formed as needed, for example by up-sampling or mitigating biases. Ultimately, utilizing synthetic data results in better performing and safer AI models.”

How is Synthetic Data Used?

A number of major advancements expected in 2022 suggest a coming synthetic data revolution that will open the floodgates for new computer vision and AI applications across an array of industries. Synthetic data is set to amplify everything from automotive safety to the development of the metaverse. It may even help resolve the supply chain crisis. 

As AI makes its way into pervasive adoption by a growing collection of industries and applications, the demand for robust training data will expand accordingly. Yet, with manual data collection methods already at a breaking point, the fight for AI superiority will only serve to expand the gulf between the availability of training data and the demand for more. Fortunately, a number of new companies (see the Synthetic Data Vendors section below) are making it easier and more affordable to generate high-quality synthetic data sets to train computer vision models. The ability to generate multitudes of synthetic images, all customized to suit the unique constraints of each distinct application, makes synthetic data the apparent solution to the limitations of customary, manually-collected data.

Training data has become a significant stumbling block for computer vision professionals who encounter a number of data-related complications hindering their work: (i) wasted time and/or resources caused by a need to frequently retrain the system, (ii) poor annotation resulting in quality issues, (iii) poor data coverage of the intended application’s domain, and (iv) lack of sufficient amount of data. 

 “We’re approaching a major inflection point for the synthetic data field,” said Ofir Chakon, co-founder and CEO of Datagen. “In 2021, AI underwent a major paradigm shift, in which traditional, model-centric approaches to AI development were reconsidered in favor of data-centrism, which means data scientists are now placing more significance on the quality of their training data as a determinant of performance, rather than the quality of their model. This shift in the zeitgeist — combined with the ability to rapidly iterate one’s data set in a targeted, fine-tuned way — will make 2022 the year in which synthetic data becomes the most widely used training and testing solution in AI.” 

We’re already seeing a new job description surface — the “synthetic data engineer, ” a data scientist who handles the creation, processing, and analysis of large synthetic data sets in an effort to support the automation of prescriptive decision-making through data visualization.

 “By 2025, Gartner expects generative AI – which can generate synthetic data to train models or identify valuable products – to account for 10% of all data produced, up from less than 1% today,” commented Wilson Pang, CTO of Appen. “Generative AI is already being used to address key challenges. For example, it is being used to generate 3D worlds for AR/VR, as well as for training autonomous vehicles and in pharmaceuticals. In 2022, we expect to see a lot more models experimenting with generative AI as an ML method of data collection be implemented. These would be models that can learn from themselves and generate new data. As Gartner also forecasted that by 2024, use of synthetic data and transfer learning will halve the volume of real data needed for machine learning, synthetic data, in this regard, will become increasingly popular when initially training a model. As early implementations for generative AI technology lets companies do things like leverage identify marketing content with a higher success rate and leverage highly nuanced NLP capabilities to diagnose health cases through text and image data, we may see more use cases emerge over the next year as experimentation and adoption picks up.”

“Synthetic 3D Data for the Next Era of AI: The rate of innovation in AI has been accelerating for the better part of decade, but AI cannot advance without large amounts of high quality and diverse data. Today, data captured from the real world and labeled by humans is insufficient both in terms of quality and diversity to jump to the next level of artificial intelligence.  In 2022, we will see an explosion in synthetic data generated from virtual worlds by physically accurate world simulators to train advanced neural networks. – Rev Lebaredian, Vice President of Simulation Technology, Omniverse Engineering, NVIDIA.

Synthetic Data Use Case Examples

A few use case examples of where synthetic data is being used include: (i) creating synthetic data sets for an insurer for retraining algorithms whose performance had degraded, and were exhibiting bias; (ii) synthesizing home addresses and linking synthetic geodata to weather patterns for better risk prediction; and (iii) evaluating a crime/fraud prediction data set and then creating synthetic data that corrected a skew towards racial bias. 

City of Vienna

As a more detailed use case example, consider the Austrian capital city of Vienna and how officials use data to develop a series of more than 300 software applications such as apps to help people with public transportation and parking, visualizations of complex statistics and geodata, along with useful tools for tourists and everyday life. Demographic data tends to be the most valuable data sources for powering the apps but collecting this data was not straightforward due to privacy constraints with GDPR. 

Using synthetic data from MOSTLY AI, Vienna has managed to unlock their existing data sets by creating a synthetic data mashup of demographic data. The synthetic data was generated conditionally to match the actual values of total population, and number of households at the district level. All other data points such as personal data (e.g. marital status, date of birth, etc.), as well as households were synthesized to protect privacy while keeping the distributions and correlations intact, and retaining the overall counts of the original data. These “synthetic twin data sets” shared all the characteristics of the original data sets while ensuring that the personal and sensitive data of the citizens of Vienna remained untouched and fully private. 

Synthetic Data Sets for Genomics

Another use case example involves the use of generative neural networks to recreate artificial versions of the highly complex genomic sequences used by life sciences researchers. Research demonstrates encouraging evidence that state of the art synthetic data models can produce artificial versions of even highly dimensional and complex genomic and phenotypic data.  

Synthetic Time Series Data for Global Financial Institutions

Another use case example is the creation of synthetic time-series data sets for large financial institutions. The temporal, ordered nature of time series data can help track and forecast future trends, which has significant utility for business planning and investing. But due to regulations and the inherent security risks that come with sharing data between individuals and organizations, much of the value that could be gleaned from it remains out of bounds. Synthetic data can help narrow this gap while maintaining privacy. By generating synthetic time-series data that are generalizable and shareable amongst diverse teams, it is possible to provide financial institutions a competitive edge alongside new opportunities.

Synthetic Data Caveats

Many modern AI systems are trained on massive data sets of digital images and videos from various sources. But if the data fails to include a breadth of behavior, then correspondingly, the algorithm won’t learn all the required appropriate behaviors. For instance, in autonomous vehicles, if the algorithm hasn’t been trained to slow down for emergency vehicles, then bad things happen. Software’s role in a number of reported collisions with emergency vehicles is the subject of a federal agency probe. 

One approach to address this situation is to use synthetic data to avoid disastrous events caused by lack of complete training data. Technology is used to create synthetic training data, e.g. photorealistic images of emergency vehicles that never existed in real life. But using this technology (think algorithms used by Hollywood to assemble synthetic imagery in motion pictures) may not be robust. Synthetic training data represents a convenient shortcut when real-world data is too difficult or expensive to collect, but given the potential risks, caution should be used until the synthetic data generation services are able to deliver rigorous and measurably complete data that adequately depicts the full spectrum of reality. 

Real-world data remains central to developing well-performing models. Data science teams, however, encounter challenges with respect to relying on real-world data alone. Data privacy issues can prevent collected data from being used, while collected data sets also inherit the biases of conditions under which data were collected, and collecting enough examples of edge cases and rare scenarios can be time-consuming and costly.

Fortunately, the short-list of vendors below are working hard to fill this void.  As data science teams continue to be limited by the sole utilization of real-world data sets, vendors in the synthetic data space are working to enhance machine learning performance and empower data science teams to complement the benefits of real-world data with diverse and realistic synthetic data to address challenges like data privacy, bias, and special edge-cases.

Synthetic Data Vendors

There are a number of companies that have become players in this important new space. Many are only a couple of years old. Here is a short-list of companies to consider in alphabetical order:





Deep Vision Data 




Replica Analytics 

Scale AI 


Synthesis AI


If you’re serious about the subject, consider becoming part of YData’s open-source Synthetic Data Community that aims to improve access to high-quality synthetic data, specifically tabular and time-series data, the most common formats for storing data. 


In this article, we’ve looked at the topic of synthetic data from a number of perspectives and the common thread is clear — the world’s need for good data is accelerating. But manual data collection and annotation (labeled data sets) won’t be able to satisfy the impending explosion of demand. Synthetic data, on the other hand, offers a fast, customizable, and cost-effective alternative that, in many cases, performs even better than its real-world counterpart. 

“One consistent theme in the history of deep learning is that larger and more diverse data sets lead to stronger models. Synthetic data is an incredibly promising way to increase data set size and diversity and allow us to build stronger models across all computer vision use cases.” Anthony Goldbloom, Founder and CEO, Kaggle.

Synthetic data offers a wide host of benefits to the data scientist by providing faster, and less resource-intensive generation of high-quality, targeted data sets for machine learning model training. As a result, data science teams are able to adopt a “data-centric” approach to machine learning, iterating quickly with progressively more robust and refined data sets, and well-optimized for reliable performance. Further, synthetic data generation removes the need for human annotation, mitigates bias, and eliminates privacy concerns.

Daniel Gutierrez, ODSC

Daniel D. Gutierrez is a practicing data scientist who’s been working with data long before the field came in vogue. As a technology journalist, he enjoys keeping a pulse on this fast-paced industry. Daniel is also an educator having taught data science, machine learning and R classes at the university level. He has authored four computer industry books on database and data science technology, including his most recent title, “Machine Learning and Data Science: An Introduction to Statistical Learning Methods with R.” Daniel holds a BS in Mathematics and Computer Science from UCLA.