The Importance of PreProcessing Data the Right Way
Data Wranglingposted by Caspar Wylie, ODSC October 16, 2018 Caspar Wylie, ODSC
There are so many different aspects of training a neural network that affect its performance. Many data scientists spend too much time thinking about learning rates, neuron structures, and epochs before actually using correctly optimized data. Without properly formatting data, your neural network will be useless, regardless of the hours you may spend optimizing its hyperparameters. Preprocessing data tends to revolve around the following tasks:
- Data Cleaning
- Outlier Removal
I am going to demonstrate what these actually involve and how they are applied.
Cleaning data and removing outliers are tedious and compulsory tasks. Between datasets that are from the internet, have conflicting file formats, or are created by non-technical people, you will nearly always find issues with syntax, row delimiters, etc.
For example, sometimes a dataset might have quote marks around numeric values (‘4’,’5’,’6’), which would be interpreted as a string. We need a way to strip irrelevant surrounding characters for each value. Another example might be row delimiters. Some datasets separate their rows by line breaks (‘/n’), some by spaces, some by unusual characters like a pipe (|). Some don’t separate them at all, and assume you know the field count and can just split rows by counting commas. That can be dangerous, because aligning it incorrectly could shuffle your data and render it completely useless (repeated fields in same row, etc.). You can also get empty values, which in a text file might look like two commas side by side.
To prevent errors later on, you may need a place-holding value for this. It is very normal that datasets don’t do that kind of thing automatically — to them, it’s just a blank value. But to you, it’s a potential breakpoint.
Occasionally, you can even run into unicode character issues. Different datasets from around the world may use different comma characters and won’t recognize the one you specify to be the delimiter. In the following untouched example dataset, you can see a few of these issues:
Outlier removing is also very important with this example. Few datasets ever come without data that is irrelevant to the problem at hand.
In this case, we can see the first row and the first column both have data that we don’t want to feed through our neural network. The first column is an ID column. Our algorithm has no interest in this data, as it does not inform the neural network on the dataset’s subject. This column should be removed. Likewise, the first row gives the field labels. This, too, should not but considered by the neural network. Once we have a dataset that consists of only meaningful data, the real preprocessing begins.
The dataset above is also a good example of when transformation is needed, specifically alpha classification. It is obvious that we need to do something with columns that contain values that are non-numeric, such as the “Good” and “Premium” descriptions, as well as values like “VVS2,” “SI2,” “VS2,” and so on.
Generally, transformation refers to any conversion between two formats of data, even though both will still attempt to represent the same thing. A really great example of this is how alphabetic values in datasets which represent a classification are converted to numeric binary vectors. The process of this conversion is simple. Firstly, find all the different classes in an alphabetic column. So from just what we can see in the example above, the “cut” column would list all of these:
- “Very Good”
So, we have five classes. We now need to replace the single “cut” column, with five columns, named by each class. This transformation would look like the following, where the resulting data would be incorporated into the original dataset.
Each row is a binary vector, where [0,1,0,0,0] represents the class ideal. Not only does this actually use less memory (although it may not look like it), but it is now fully interpretable by a machine learning algorithm. The meaning behind the data remains, for as long as all rows use the binary values consistently.
Here is a real-life demonstration, showing my software Perceptron transforming an alpha heavy dataset, with sample rows from the original dataset (all alphabetic).
Every field is a classification, clearly shown by the first letter of the classification name.
For example, the first field has only two classes: “p” and “e,” meaning poisonous or edible. It is worth mentioning that this dataset is for predicting whether mushrooms are edible or not based on mainly visual attributes.
Here is our resulting transformation. On the left, we can see all the classes that were found per field. Just above, we can see how each field has become multiple fields based on all the classes of that field. Every 0 and 1 (on a given row) will be one neuron on an input layer to a neural network (bar target values).
Let’s have a closer look at the first row, which went from
[1, 0],[1, 0, 0, 0, 0, 0],[1, 0, 0, 0],[1, 0, 0, 0, 0, 0, 0, 0, 0, 0],[1, 0],[1, 0, 0, 0, 0, 0, 0, 0, 0],[1, 0],[1, 0],[1, 0],[1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],[1, 0],[1, 0, 0, 0, 0],[1, 0, 0, 0],[1, 0, 0, 0],[1, 0, 0, 0, 0, 0, 0, 0, 0],[1, 0, 0, 0, 0, 0, 0, 0, 0],,[1, 0, 0, 0],[1, 0, 0],[1, 0, 0, 0, 0],[1, 0, 0, 0, 0, 0, 0, 0, 0],[1, 0, 0, 0, 0, 0],[1, 0, 0, 0, 0, 0, 0]
The first class of the fields are p, x ,s, n, etc. Which happens to match up with our first-row example. Then, if we look at the result, the first item of each vector is hot. The amount of classes for each field will also match that field’s sample binary vector length.
Classification fields never really require normalization. However, numeric values that reflect a specific amount of something (instead of what could be a binary structure or classification identity) nearly always need to be normalized. Normalization is all about scale. If we look at the dataset from earlier
fields such as depth, price, x, y, and z all have very different scales, because they are measured in entirely different units. To us, this makes sense. However, having these different scales inputted through a neural network can cause a massive imbalance to weight values and the learning process. Instead, all values need to be of the same scale, while still representing the varying quantities being described. Most commonly, we do this by making values between 0 and 1, or at least close to this range. Most simply, we could use a divider on each field:
|Price of Diamond||Normalised Price of Diamond|
Because the prices are mainly three digits, we can divide them all by 1,000. For other fields, we would use any divider that consistently gets the values between 0 and 1. Using an Iris dataset, I am going to demonstrate how a badly scaled dataset performs. Here are some sample rows
Where x is your normalized value, i is your unnormalized value, min and max are the highest and lowest values in the field, and R1 to R2 are your desired bounds for the normalized value (0 and 1). This will result in every value being perfectly scaled by each field.
Ready to learn more data science skills and techniques in-person? Register for ODSC West this October 31 – November 3 now and hear from world-renowned names in data science and artificial intelligence!