fbpx
MakeBlobs + Fictional Synthetic Data A New(ish) Use Case MakeBlobs + Fictional Synthetic Data A New(ish) Use Case
From the west edition of the Open Data Science Conference (ODSC), one of the “buzziest” panels was on the topic of... MakeBlobs + Fictional Synthetic Data A New(ish) Use Case

From the west edition of the Open Data Science Conference (ODSC), one of the “buzziest” panels was on the topic of synthetic data. This article revisits that topic with a new look at how you can quickly spin out a new fictional data set with make_blobs.

Image Credit: ODSC Conference. Four panelists speaking about synthetic data including, Ali GolshanJay AlammarSheamus McGovern, and Yashar Behzadi. Image used with permission.

Across many practice areas in the landscape of data science, the value of fictional, yet realistic, data is too often under-appreciated and even more often understated. This article aims to shine a spotlight on a lesser-known module in the popular Scikit-Learn library: make_blobs coupled with a MinMaxScaler — together these modules can be a clever tool for generating realistic fictional data, which is crucial for training, testing, education, and demonstration purposes in data science.

This is not the first time I’ve addressed the topic of fictional synthetic data. For example, years ago, I wrote How To Make Fictional Data which guides readers on generating their own datasets for various purposes like testing, training, or demonstration. It emphasizes the usefulness of creating fictional data, especially for data scientists and those learning data science. I presented a detailed example of generating data for two fictional bird species varieties — Western and Eastern — using Python and libraries such as Pandas, NumPy, and Seaborn.

Later in Three More Ways To Make Fictional Data I wrote again for anyone looking to learn even more about fictional data. The main take away from this article was that each tool has its strengths and weaknesses. I suggested that generating data manually or using a combination of these tools might be the best way to fully meet your specific fictional data requirements.

I’m also an advocate for asking data science learners to build their own data. Doing so builds skills in data wrangling, data visualization, and it also builds knowledge of distributions. In A Professional’s Tutorial to Python Making Fictional Data I provide a detailed tutorial.

Image Credit: Author’s illustration created in Canva. Robots in a factory that are manufacturing data.

The Importance of Fictional Data

Before diving into the technicalities, let’s address a fundamental question:

Why use fictional data?

The answer lies in its controlled, yet realistic nature. Fictional data allows us to simulate non-fictional scenarios without the constraints and sensitivities of real-world data. Whether it’s for training machine learning models, testing algorithms, educating new data scientists, or demonstrating a concept, fictional data can be tailored to fit specific needs while maintaining a realistic structure.

It is a mistake to under-estimate the value and importance of fictional data. Here is a partial list of benefits offered by fictional data.

Bias Reduction Fictional data can help address and reduce harmful or dangerous bias by ensuring a more representative mix of all data (including in the case of human demographics). Scientists can create data for underrepresented groups, improving the diversity of the training data and reducing the risk of misidentification or bias in AI systems.

Handling Rare Scenarios: Fictional data enables the creation and testing of rare scenarios that may not be readily available in so-called real data-or that may not be available in sufficient instances. Inclusion and augmentation data that represents rare scenarios is helpful for preparing AI systems that can anticipate a wide range of situations.

Performance Parity with Real Data: One recent finding was that in 70% of a study’s tests, there was no significant difference in the performance of predictive models developed using synthetic data compared to those using real data. For example, artificial data gives the same results as real data.

Image Credit: Author’s illustration created in Canva. Anthropomorphic robots that are manufacturing data.

Introducing MakeBlobs

The proposed hero of this article is make_blobs — a function in Scikit-Learn that is designed to generate isotropic Gaussian blobs for clustering. At first glance, it may seem like a tool solely for algorithm demonstrations. With just a little creativity we can push its utility further.

How It Works

make_blobs generates data points in a specified number of “blobs” – essentially clusters of points. Each cluster can be made to have different centers and standard deviations, allowing for a wide range of data distributions. This capability makes make_blobs ideal for creating fictional datasets that mimic real-world data distributions, which is invaluable for testing clustering algorithms, visualizing data patterns, and more.

Combining with MinMaxScaler

While make_blobs generates data, the scale (i.e., the minimum and maximum values) of this data might not always align with real-world scenarios. This is where MinMaxScaler comes into play. A scaling technique adjusts the dataset so that all feature values are scaled within a given range – usually between 0 and 1 (except here, we adjust that a bit). This scaling makes the fictional data generated by make_blobs more realistic and applicable to a broader range of scenarios.

A Practical Example

Let’s put theory into practice. Here’s a simple example of how to use make_blobs combined with MinMaxScaler to generate a fictional dataset

In this example, we’ll create two columns of data. One column will represent the age of adults ranging from 25 to 55. Another column will be the annual income for those adults, ranging from 22,000 to 98,000.

# Standard imports
import pandas as pd
import numpy as np

from sklearn.datasets import make_blobs
from sklearn.preprocessing import MinMaxScaler

import seaborn as sns
import matplotlib.pyplot as plt
# Generate sample data
X, y = make_blobs(n_samples=10000, centers=3, cluster_std=5
                  n_features=2, random_state=101)

# Put data in a data frame
df = pd.concat(
    [pd.DataFrame(X, columns=['income','age']),
     pd.DataFrame(y, columns=['group'])], axis=1)

# Instantiuate appropriate scalers
inc_scale = MinMaxScaler((22000, 98000))
age_scale = MinMaxScaler((25, 55))

# Scale the data
df['income'] = inc_scale.fit_transform(df['income'].values.reshape(-1, 1))
df['age'] = age_scale.fit_transform(df['age'].values.reshape(-1, 1))

# Plotting the scaled data
sns.scatterplot(data=df, y='income', x='age', hue='group')

In this code, we generate a dataset with 10000 samples distributed among 3 centers (or clusters). We start by asking make_blobs to give the data in these features a relatively high standard deviation of 5, which is higher than the default of 1.0. The relatively higher standard deviation makes the clusters less distinct.

With inc_scale and age_scale we rescale the two features to ranges (min and max values) that better match what we would expect for income and age data.

To make sure this data were “realistic” (in this case, essentially meaning that as age increased, we also saw an increase in income), I experimented with random seeds until I saw a result that produced this positive correlation. Here is the resulting scatter plot.

Displays a scatter plot graphing ‘income’ on the y-axis and ‘age’ on the x-axis. Points are color-coded to represent three different groups, labeled 0, 1, and 2, as indicated in the legend on the top right. The data points for each group are plotted across a range of ages and income levels, with a dense cluster around the middle age range and middle income level, spreading out more sparsely towards the higher and lower ends of the graph.

Image Credit: Author’s illustration. Generated with the code shown here.

Conclusion

The combination of make_blobs and MinMaxScaler provides a swift and efficient way to generate realistic fictional data, offering a valuable asset to any data scientist’s toolkit. Whether you’re testing algorithms, training models, or teaching concepts, this approach opens up new avenues for exploration and innovation in your data science projects.

Remember, the beauty of fictional data lies in its versatility and harmlessness — it’s a sandbox where you can explore, experiment, and learn without the risks associated with real-world data. So, dive in and let make_blobs breathe new life into your data science practice!

Thanks For Reading

Are you ready to learn more about careers in data science? I perform one-on-one career coaching and have a weekly email list that helps data professional job candidates. Contact me to learn more.

Thanks for reading. Send me your thoughts and ideas. You can write just to say hey. And if you really need to tell me how I got it wrong, I look forward to chatting soon. Twitter: @adamrossnelson LinkedIn: Adam Ross Nelson.

Article originally posted here. Reposted with permission.

ODSC Community

The Open Data Science community is passionate and diverse, and we always welcome contributions from data science professionals! All of the articles under this profile are from our community, with individual authors mentioned in the text itself.

1