From the west edition of the Open Data Science Conference (ODSC), one of the “buzziest” panels was on the topic of synthetic data. This article revisits that topic with a new look at how you can quickly spin out a new fictional data set with
Across many practice areas in the landscape of data science, the value of fictional, yet realistic, data is too often under-appreciated and even more often understated. This article aims to shine a spotlight on a lesser-known module in the popular Scikit-Learn library:
make_blobs coupled with a
MinMaxScaler — together these modules can be a clever tool for generating realistic fictional data, which is crucial for training, testing, education, and demonstration purposes in data science.
This is not the first time I’ve addressed the topic of fictional synthetic data. For example, years ago, I wrote How To Make Fictional Data which guides readers on generating their own datasets for various purposes like testing, training, or demonstration. It emphasizes the usefulness of creating fictional data, especially for data scientists and those learning data science. I presented a detailed example of generating data for two fictional bird species varieties — Western and Eastern — using Python and libraries such as Pandas, NumPy, and Seaborn.
Later in Three More Ways To Make Fictional Data I wrote again for anyone looking to learn even more about fictional data. The main take away from this article was that each tool has its strengths and weaknesses. I suggested that generating data manually or using a combination of these tools might be the best way to fully meet your specific fictional data requirements.
I’m also an advocate for asking data science learners to build their own data. Doing so builds skills in data wrangling, data visualization, and it also builds knowledge of distributions. In A Professional’s Tutorial to Python Making Fictional Data I provide a detailed tutorial.
Image Credit: Author’s illustration created in Canva. Robots in a factory that are manufacturing data.
The Importance of Fictional Data
Before diving into the technicalities, let’s address a fundamental question:
Why use fictional data?
The answer lies in its controlled, yet realistic nature. Fictional data allows us to simulate non-fictional scenarios without the constraints and sensitivities of real-world data. Whether it’s for training machine learning models, testing algorithms, educating new data scientists, or demonstrating a concept, fictional data can be tailored to fit specific needs while maintaining a realistic structure.
It is a mistake to under-estimate the value and importance of fictional data. Here is a partial list of benefits offered by fictional data.
Bias Reduction Fictional data can help address and reduce harmful or dangerous bias by ensuring a more representative mix of all data (including in the case of human demographics). Scientists can create data for underrepresented groups, improving the diversity of the training data and reducing the risk of misidentification or bias in AI systems.
Handling Rare Scenarios: Fictional data enables the creation and testing of rare scenarios that may not be readily available in so-called real data-or that may not be available in sufficient instances. Inclusion and augmentation data that represents rare scenarios is helpful for preparing AI systems that can anticipate a wide range of situations.
Performance Parity with Real Data: One recent finding was that in 70% of a study’s tests, there was no significant difference in the performance of predictive models developed using synthetic data compared to those using real data. For example, artificial data gives the same results as real data.
Image Credit: Author’s illustration created in Canva. Anthropomorphic robots that are manufacturing data.
The proposed hero of this article is
make_blobs — a function in Scikit-Learn that is designed to generate isotropic Gaussian blobs for clustering. At first glance, it may seem like a tool solely for algorithm demonstrations. With just a little creativity we can push its utility further.
How It Works
make_blobs generates data points in a specified number of “blobs” – essentially clusters of points. Each cluster can be made to have different centers and standard deviations, allowing for a wide range of data distributions. This capability makes
make_blobs ideal for creating fictional datasets that mimic real-world data distributions, which is invaluable for testing clustering algorithms, visualizing data patterns, and more.
Combining with MinMaxScaler
make_blobs generates data, the scale (i.e., the minimum and maximum values) of this data might not always align with real-world scenarios. This is where
MinMaxScaler comes into play. A scaling technique adjusts the dataset so that all feature values are scaled within a given range – usually between 0 and 1 (except here, we adjust that a bit). This scaling makes the fictional data generated by
make_blobs more realistic and applicable to a broader range of scenarios.
A Practical Example
Let’s put theory into practice. Here’s a simple example of how to use
make_blobs combined with
MinMaxScaler to generate a fictional dataset
In this example, we’ll create two columns of data. One column will represent the age of adults ranging from 25 to 55. Another column will be the annual income for those adults, ranging from 22,000 to 98,000.
# Standard imports
import pandas as pd
import numpy as np
from sklearn.datasets import make_blobs
from sklearn.preprocessing import MinMaxScaler
import seaborn as sns
import matplotlib.pyplot as plt
# Generate sample data
X, y = make_blobs(n_samples=10000, centers=3, cluster_std=5
# Put data in a data frame
df = pd.concat(
pd.DataFrame(y, columns=['group'])], axis=1)
# Instantiuate appropriate scalers
inc_scale = MinMaxScaler((22000, 98000))
age_scale = MinMaxScaler((25, 55))
# Scale the data
df['income'] = inc_scale.fit_transform(df['income'].values.reshape(-1, 1))
df['age'] = age_scale.fit_transform(df['age'].values.reshape(-1, 1))
# Plotting the scaled data
sns.scatterplot(data=df, y='income', x='age', hue='group')
In this code, we generate a dataset with 10000 samples distributed among 3 centers (or clusters). We start by asking
make_blobs to give the data in these features a relatively high standard deviation of 5, which is higher than the default of 1.0. The relatively higher standard deviation makes the clusters less distinct.
age_scale we rescale the two features to ranges (min and max values) that better match what we would expect for income and age data.
To make sure this data were “realistic” (in this case, essentially meaning that as age increased, we also saw an increase in income), I experimented with random seeds until I saw a result that produced this positive correlation. Here is the resulting scatter plot.
Image Credit: Author’s illustration. Generated with the code shown here.
The combination of
MinMaxScaler provides a swift and efficient way to generate realistic fictional data, offering a valuable asset to any data scientist’s toolkit. Whether you’re testing algorithms, training models, or teaching concepts, this approach opens up new avenues for exploration and innovation in your data science projects.
Remember, the beauty of fictional data lies in its versatility and harmlessness — it’s a sandbox where you can explore, experiment, and learn without the risks associated with real-world data. So, dive in and let
make_blobs breathe new life into your data science practice!
Thanks For Reading
Are you ready to learn more about careers in data science? I perform one-on-one career coaching and have a weekly email list that helps data professional job candidates. Contact me to learn more.
Thanks for reading. Send me your thoughts and ideas. You can write just to say hey. And if you really need to tell me how I got it wrong, I look forward to chatting soon. Twitter: @adamrossnelson LinkedIn: Adam Ross Nelson.