fbpx
The Importance of PreProcessing Data the Right Way The Importance of PreProcessing Data the Right Way
There are so many different aspects of training a neural network that affect its performance. Many data scientists spend too much... The Importance of PreProcessing Data the Right Way

There are so many different aspects of training a neural network that affect its performance. Many data scientists spend too much time thinking about learning rates, neuron structures, and epochs before actually using correctly optimized data. Without properly formatting data, your neural network will be useless, regardless of the hours you may spend optimizing its hyperparameters. Preprocessing data tends to revolve around the following tasks:

  • Data Cleaning
  • Outlier Removal
  • Transformation
  • Normalization

I am going to demonstrate what these actually involve and how they are applied.

Data cleaning

Cleaning data and removing outliers are tedious and compulsory tasks. Between datasets that are from the internet, have conflicting file formats, or are created by non-technical people, you will nearly always find issues with syntax, row delimiters, etc. 

For example, sometimes a dataset might have quote marks around numeric values (‘4’,’5’,’6’), which would be interpreted as a string. We need a way to strip irrelevant surrounding characters for each value. Another example might be row delimiters. Some datasets separate their rows by line breaks (‘/n’), some by spaces, some by unusual characters like a pipe (|). Some don’t separate them at all, and assume you know the field count and can just split rows by counting commas. That can be dangerous, because aligning it incorrectly could shuffle your data and render it completely useless (repeated fields in same row, etc.). You can also get empty values, which in a text file might look like two commas side by side.

To prevent errors later on, you may need a place-holding value for this. It is very normal that datasets don’t do that kind of thing automatically — to them, it’s just a blank value. But to you, it’s a potential breakpoint.

Occasionally, you can even run into unicode character issues. Different datasets from around the world may use different comma characters and won’t recognize the one you specify to be the delimiter. In the following untouched example dataset, you can see a few of these issues:

preprocessing data

Outlier removal

Outlier removing is also very important with this example. Few datasets ever come without data that is irrelevant to the problem at hand.

In this case, we can see the first row and the first column both have data that we don’t want to feed through our neural network. The first column is an ID column. Our algorithm has no interest in this data, as it does not inform the neural network on the dataset’s subject. This column should be removed. Likewise, the first row gives the field labels. This, too, should not but considered by the neural network. Once we have a dataset that consists of only meaningful data, the real preprocessing begins.

Transformation

The dataset above is also a good example of when transformation is needed, specifically alpha classification. It is obvious that we need to do something with columns that contain values that are non-numeric, such as the “Good” and “Premium” descriptions, as well as values like “VVS2,” “SI2,” “VS2,” and so on.

Generally, transformation refers to any conversion between two formats of data, even though both will still attempt to represent the same thing. A really great example of this is how alphabetic values in datasets which represent a classification are converted to numeric binary vectors. The process of this conversion is simple. Firstly, find all the different classes in an alphabetic column. So from just what we can see in the example above, the “cut” column would list all of these:

  • “Good”
  • “Fair”
  • “Very Good”
  • “Ideal”
  • “Premium”

So, we have five classes. We now need to replace the single “cut” column, with five columns, named by each class. This transformation would look like the following, where the resulting data would be incorporated into the original dataset.

preprocessing dataEach row is a binary vector, where [0,1,0,0,0] represents the class ideal. Not only does this actually use less memory (although it may not look like it), but it is now fully interpretable by a machine learning algorithm. The meaning behind the data remains, for as long as all rows use the binary values consistently.

Here is a real-life demonstration, showing my software Perceptron transforming an alpha heavy dataset, with sample rows from the original dataset (all alphabetic).

Every field is a classification, clearly shown by the first letter of the classification name.

For example, the first field has only two classes: “p” and “e,” meaning poisonous or edible. It is worth mentioning that this dataset is for predicting whether mushrooms are edible or not based on mainly visual attributes.

preprocessing data

Here is our resulting transformation. On the left, we can see all the classes that were found per field. Just above, we can see how each field has become multiple fields based on all the classes of that field. Every 0 and 1 (on a given row) will be one neuron on an input layer to a neural network (bar target values).

Let’s have a closer look at the first row, which went from

p,x,s,n,t,p,f,c,n,k,e,e,s,s,w,w,p,w,o,p,k,s,u

to this…

[1, 0],[1, 0, 0, 0, 0, 0],[1, 0, 0, 0],[1, 0, 0, 0, 0, 0, 0, 0, 0, 0],[1, 0],[1, 0, 0, 0, 0, 0, 0, 0, 0],[1, 0],[1, 0],[1, 0],[1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],[1, 0],[1, 0, 0, 0, 0],[1, 0, 0, 0],[1, 0, 0, 0],[1, 0, 0, 0, 0, 0, 0, 0, 0],[1, 0, 0, 0, 0, 0, 0, 0, 0],[1],[1, 0, 0, 0],[1, 0, 0],[1, 0, 0, 0, 0],[1, 0, 0, 0, 0, 0, 0, 0, 0],[1, 0, 0, 0, 0, 0],[1, 0, 0, 0, 0, 0, 0]

The first class of the fields are p, x ,s, n, etc. Which happens to match up with our first-row example. Then, if we look at the result, the first item of each vector is hot. The amount of classes for each field will also match that field’s sample binary vector length.

Normalization

Classification fields never really require normalization. However, numeric values that reflect a specific amount of something (instead of what could be a binary structure or classification identity) nearly always need to be normalized. Normalization is all about scale. If we look at the dataset from earlier

preprocessing data

fields such as depth, price, x, y, and z all have very different scales, because they are measured in entirely different units. To us, this makes sense. However, having these different scales inputted through a neural network can cause a massive imbalance to weight values and the learning process. Instead, all values need to be of the same scale, while still representing the varying quantities being described. Most commonly, we do this by making values between 0 and 1, or at least close to this range. Most simply, we could use a divider on each field:

Price of Diamond Normalised Price of Diamond
326 0.326
334 0.334
423 0.423

Because the prices are mainly three digits, we can divide them all by 1,000. For other fields, we would use any divider that consistently gets the values between 0 and 1. Using an Iris dataset, I am going to demonstrate how a badly scaled dataset performs. Here are some sample rows

preprocessing data
You can see the first field as values as large as 50, and in the second field we have values around 3, and finally some values under 1. Training a neural network with this data produced the following results

preprocessing data
Although there is a decrease in error, the performance is awful at 60%. Now, if we normalize the data, so that the datasets looks like this:

 width=
We then get much better results

 width=
There are a few issues with this trivial method of normalization. If we look at the last numeric column, all the values are 0.02, 0.03, etc. Although these are now between 0 and 1, they are still not very well scaled, and still out of proportion to the other fields. To solve this, we can use a much better method of normalization that actually takes a field’s highest and lowest value into account and then calculates what all the other values should be based on this range. The equation to do this is:

Where x is your normalized value, i is your unnormalized value, min and max are the highest and lowest values in the field, and R1 to R2 are your desired bounds for the normalized value (0 and 1). This will result in every value being perfectly scaled by each field.


Ready to learn more data science skills and techniques in-person? Register for ODSC West this October 31 – November 3 now and hear from world-renowned names in data science and artificial intelligence!

Caspar Wylie, ODSC

Caspar Wylie, ODSC

My name is Caspar Wylie, and I have been passionately computer programming for as long as I can remember. I am currently a teenager, 17, and have taught myself to write code with initial help from an employee at Google in Mountain View California, who truly motivated me. I program everyday and am always putting new ideas into perspective. I try to keep a good balance between jobs and personal projects in order to advance my research and understanding. My interest in computers started with very basic electronic engineering when I was only 6, before I then moved on to software development at the age of about 8. Since, I have experimented with many different areas of computing, from web security to computer vision.

1