You never know how your model performs unless you evaluate the performance of the model. The goal of a data scientist is to develop a robust data science model. The robustness of the model is decided by computing how it performs on the validation and test metrics. If the model performs well on the validation and test data, it tends to perform well during the production inference.
There are various techniques or hacks to improve the performance of the model. A correct feature engineering strategy tends to improve the performance of the model. In this article, we will discuss and implement feature extraction using autoencoders.
What are AutoEncoders?
AutoEncoders are an unsupervised neural network that learns efficient coding from the input unlabelled data. The autoencoders try to reconstruct the input data by minimizing the reconstruction loss.
A standard autoencoder architecture has an encoder and decoder layer:
- Encoder: Mapping from Input space to lower dimension space
- Decoder: Reconstructing from lower dimension space to Output space
(Source), Autoencoder architecture
An autoencoder encodes the input data (X) into another dimension (Z), and then reconstructs the output data (X’) using a decoder network. The encoded embedding is preferably lower in dimension compared to the input layer and contains all the efficient coding of the input layer.
The idea is to learn the encoder layer weights by training an autoencoder on the training sample. By training the autoencoder on the training sample, it will try to minimize the reconstruction error and generate the encoder and decoder network weights.
Later the decoder network can be cropped out and feature extraction embeddings can be generated using the encoder network. This embedding can be used for supervised tasks.
Now, let’s dive deep into the step-by-step implementation of the above-discussed idea.
Step 1 — Data:
I am using a sample dataset generated using the make_classification function having 10k instances and 500 features. Split the dataset into train, validation, and test samples to avoid data leakage. Further, Normalize the data samples.
X, y = make_classification(n_samples=10000, n_features=500, n_informative=100, n_redundant=400, random_state=42) # split into train test sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1) X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.2, random_state=1) # scale data t = MinMaxScaler().fit(X_train) X_train = t.transform(X_train) X_val = t.transform(X_val) X_test = t.transform(X_test)
Step 2 — Define AutoEncoder:
I have defined a two-layer of an encoder and a two-layer of decoder network in the autoencoder architecture. Each of the encoder and decoder layers has a batch normalization layer.
The dimension of the encoded layer needs to be decided by performing experiments.
# define encoder visible = Input(shape=(n_inputs,)) # encoder level 1 e = Dense(n_inputs*2)(visible) e = BatchNormalization()(e) e = LeakyReLU()(e) # encoder level 2 e = Dense(n_inputs)(e) e = BatchNormalization()(e) e = LeakyReLU()(e) # bottleneck n_bottleneck = int(bottleneck_features) bottleneck = Dense(n_bottleneck)(e) # define decoder, level 1 d = Dense(n_inputs)(bottleneck) d = BatchNormalization()(d) d = LeakyReLU()(d) # decoder level 2 d = Dense(n_inputs*2)(d) d = BatchNormalization()(d) d = LeakyReLU()(d) # output layer output = Dense(n_inputs, activation='linear')(d)
Step 3 — AutoEncoder Training:
Compile and run the autoencoder architecture using Keras with adam optimizer, and mean squared error loss, to reconstruct the input data.
I will be running the pipeline for 50 epochs with a batch size of 64.
# define autoencoder model model = Model(inputs=visible, outputs=output) # compile autoencoder model model.compile(optimizer='adam', loss='mse') # fit the autoencoder model to reconstruct input history = model.fit(X_train, X_train, epochs=50, batch_size=64, verbose=2, validation_data=(X_val,y_val))
Step 4 — Plot the reconstruction error:
Visualize the change in train and test reconstruction loss with the epochs.
(Image by Author), Train and Validation MSE with Epochs
# plot loss pyplot.plot(history.history['loss'], label='train') pyplot.plot(history.history['val_loss'], label='test') pyplot.legend() pyplot.xlabel("Epochs") pyplot.ylabel("MSE") pyplot.show()
Step 5 — Define Encoder and Save the Weights:
Once the autoencoders, weights are optimized, we can crop the decoder network and use only the encoder network to compute the embeddings of the input data. These embeddings can be further used as features for supervised machine learning tasks.
# define an encoder model (without the decoder) encoder = Model(inputs=visible, outputs=bottleneck) # save the encoder to file encoder.save('encoder.h5')
Step 6 — Train a Supervised task:
The embeddings from the decoder network can be used as features for the classification or regression tasks.
# load the model from file encoder = load_model('encoder.h5') # encode the train data X_train_encode = encoder.predict(X_train) # encode the test data X_test_encode = encoder.predict(X_test)
Now, let’s compare the performance of the model, by changing the raw input features to the encoded features. We will train a Logistic Regression estimator with default hyperparameters for both models.
(Image by Author), Benchmarking performances metrics
The 1st column mentions the metrics performance for raw input data having 500 features. For the encoded features using the encoder network of the autoencoder, we observe an improvement in performance metrics.
Feature extraction using autoencoders captures the efficient coding of the input data and projects it into the same or lower dimension. For input data having a lot of input features, it’s a great idea to project the data into lower dimensions and use the features for supervised learning tasks.
Article originally posted here by Satyam Kumar. Reposted with permission.