How would you tackle the prospects of representing a categorical feature, with 100’s of levels, in a model? A first approach may be to create a one-hot encoded matrix representing each level of the feature. The result would be a large and sparse matrix where the majority of the values are zero. This is a reasonable method, however, with many levels in the feature the size of the matrix quickly grows and eventually either R’s memory fails to handle the size of the data frame or the algorithm receiving the one-hot encoded matrix grinds to a seemingly painful crawl.
In this post, we will discuss using an embedding matrix as an alternative to using one-hot encoded categorical features for in modeling. We usually find references to embedding matrices in natural language processing applications but they may also be used on tabular data. An embedding matrix replaces the spares one-hot encoded matrix with an array of vectors where each vector represents some level of the feature. Using an embedding matrix can greatly reduce the memory needed to handle the categorical features.
If we were applying this to an NLP problem, we would simplify our corpus of text by first identifying all the primary words in the text (e.g. we assume we have accounted for stop words and other clean up work). Then for each word we identify from the corpus, we create a vector of weights that represent the word. For example, if we are analyzing The Lord of The Rings book series and we want to understand the relationship between different parts of the world. We could represent The Shire with a vector of weights, we could also represent Gondor as another vector of weights. We can then use a neural net to train the embedding and discover words that are related. What is key here, is that the vectors representing The Shire and Gondor need only contain a handful of dimensions (there is research from Jeremy Howard that indicates the optimal number of dimension is the minimum of n-levels/2 or 50). If the text was to be represented by a one-hot encoded matrix, the resulting matrix would be thousands of columns wide and contain many zeros (i.e. sparse matrix).
We can take this same idea and apply it to tabular data. For example, maybe we are trying to understand how the day of the weeks are relating in estimating daily sales totals. We can build a matrix in which each day of the week is represented by a vector of weights (in this case we would create 7×3 matrix). We can then identify vectors that are like each other and potentially draw an inference as to which days of the week are similar in their impacts on daily sales.
To understand why this works we must consider how a one-hot encoded matrix and an embedding matrix, result in the same output. With a neural net, to find the impact of a categorical feature, a matrix of weights will be multiplied by the one-hot encoded representation of the categorical features. Since the one-hot encoded matrix consists of 0’s and 1’s, this matrix can be reduced to a simple array containing the weights where there were 1’s. This final array is equal to what the embedding matrix. With an embedding matrix we can skip the matrix multiplication and focus entirely on the final weights in the array. In other words, the embedding matrix removes the need to perform matrix multiplication. We train the weights in the embedding and use these weights as the vector of features in the model.
Building an Embedding Matrix in R with Tensorflow
We will use an embedding to determine if there is a relationship between the days of the week and sales. To do this in R, we will install a Python environment and run the environment from R studio – there are other packages in R that can also generate these embeddings. We first must ensure Python is installed on our machine. We can do this by installing Anaconda Navigator. We can then launch R on our machine and install the keras package. Doing so will create a python environment named r-reticulate – this can be viewed in the Anaconda Navigator under the environments tab.
#install keras and tensorflow packages install.packages('keras') library(keras) install_keras() install.packages('tensorflow') library(tensorflow) #launch python environment using reticulate reticulate::use_condaenv("r-reticulate") # install.packages('pacman') pacman::p_load('ggplot2','lubridate','readr','dplyr','data.table') require(keras) require(tensorflow)
We now create a generic data frame of sales data and add seasonality to the weekends. This is done to manufacture some sort of relationship between the days of the week that we will be able to hopefully identify through the embedding. With many machine learning models, it may be useful to scale the features. This reduces the impact that large difference is scales between features can have on the outcome of a model.
#create generic data frame df <- data.frame(Date = seq.Date(date('2019-01-01'),date('2019-06-01'),by = 'day'), sales = sample(100:1000,152,replace = TRUE)) df$date <- as.Date(df$Date) df$weekday <- lubridate::wday(df$date) #add seasonality to sales df <- df %>% mutate(sales = ifelse(weekday %in% c(1,6,7),sales*3,sales*.75)) #scale sales values df$scaled_sales <- scale(df$sales)
We can now build our neural net and set up the embedding layer. We use common Tensorflow syntax with the primary function being the layer_embedding function. This tells Tensorflow to generate the embedding matrix. We use our rule for determining the embedding size and in this case some arbitrary parameters are used in the dense layers.
#generate embeddings embedding_size <- 3 model <- keras_model_sequential() model %>% layer_embedding(input_dim = 7+1, output_dim = embedding_size, input_length = 1, name="embedding") %>% layer_flatten() %>% layer_dense(units=40, activation = "relu") %>% layer_dense(units=10, activation = "relu") %>% layer_dense(units=1) model %>% compile(loss = "mse", optimizer = "sgd", metric="accuracy") hist <- model %>% fit(x = as.matrix(df$weekday), y= as.matrix(df$scaled_sales), epochs = 50, batch_size = 2) layer <- get_layer(model, "embedding") embeddings <- data.frame(layer$get_weights()[]) embeddings$name <- c("none", levels(lubridate::wday(df$date, label = T)))
We can now plot the first two vectors of the embedding matrix to get a sense of how the days of the week are related in our data set. If the embeddings worked, we should see Friday, Saturday and Sunday appear to cluster together.
We do see Friday, Saturday and Sunday grouping together on the bottom right of the plot. We can use this output to infer relationship among the levels of the categorical feature days of the week, in this case. We can also use the embedding matrix as independent features in another model. This essentially means that once the embedding matrix is developed, the matrix can be carried to other models and uses across projects.
[Related Article: Timing the Same Algorithm in R, Python, and C++]
In this article, an understanding of how embeddings, which have been traditionally used in NLP, can be applied to tabular data. The use of which can be useful for handling categorical data with many levels and in developing an understanding of how the levels within the categorical feature act in relation to some target variable.
Jacey Heuer is a data scientist currently working in the retail and e-commerce industry. He holds master’s degrees from Iowa State University in business and data science. He has analytics experience across many industries including energy, financial services and real estate. Jacey also is a data science author publishing educational content for the PluralSight.com platform. He enjoys reading, writing and learning about all data science topics. His particular interests are in probability theory, machine learning, classical statistics and the practical application of it in business.