# Generating data with random Gaussian noise

ModelingStatisticsposted by Nikolay Manchev March 28, 2017

I recently needed to generate some data for xx, with some added Gaussian noise. This comes in handy when you want...

I recently needed to generate some data for yy as a function of xx, with some added Gaussian noise. This comes in handy when you want to generate data with an underlying regularity that you want to discover, for example when testing different machine learning algorithms.

What I wanted to get is a mechanism that will allow me to specify a range for xx and then generate data using

y=f(x)+ϵy=f(x)+ϵ

with capability to control the function f(x)f(x) and the parameters of the Gaussian noise ϵϵ.

I came up with this simple function, which allows me to specify f(x)f(x), the xx interval and step, and the Gaussian distribution parameters (μμ and σσ).

 1 2 3 4 5 6 7 8 9 10 11 12 13 14 def corr_vars( start=-10, stop=10, step=0.5, mu=0, sigma=3, func=lambda x: x ):     # Generate x     x = np.arange(start, stop, step)              # Generate random noise     e = np.random.normal(mu, sigma, x.size)          # Generate y values as y = func(x) + e     y = np.zeros(x.size)          for ind in range(x.size):         y[ind] = func(x[ind]) + e[ind]          return (x,y)

Here are two examples of using the function to generate two data sets – one using y=x+ϵy=x+ϵ, the other – y=2πsin(x)+ϵy=2∗π∗sin(x)+ϵ.

 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 np.random.seed(2) (x0,y0) = corr_vars(sigma=3)    (x1,y1) = corr_vars(sigma=3, func=lambda x: 2*pi*sin(x))    f, axarr = plt.subplots(2, sharex=True, figsize=(7,7)) axarr[0].scatter(x0, y0)         axarr[0].plot(x0, x0, color='r') axarr[0].set_title('y = x + e') axarr[0].grid(True) axarr[1].scatter(x1, y1)         axarr[1].plot(x1, 2*pi*np.sin(x1), color='r') axarr[1].set_title('y = 2*π*sin(x) + e') axarr[1].grid(True)

The snippet above plots the resulting data sets, together with the noiseless function (in red) for comparison.

The full source code is available on GitHub.
The original post is located at cleverowl.uk

## Nikolay Manchev

I have over 10 years of database experience and have been involved in large scale migration, consolidation, and data warehouse deployment projects in the UK and abroad. I am a speaker, blogger, author of numerous articles and a book on advanced database topics. I've been working exclusively in the big data (Hadoop) space since 2015, with focus on Spark and machine learning. I have an M.Sc. in Software Technologies and an M.Sc. in Data Science (City University London).

1