Notes on Representation Learning

Tags: ,

TL;DR: Representation learning can eliminate the need for large labeled data sets to train deep neural networks, opening up new domains to machine learning and transforming the practice of Data Science.

Check out “Notes on Representation Learning” in these three parts.

  1. Notes on Representation Learning
  2. Notes on Representation Learning Continued
  3. Representation Learning Bonus Material

Deep Learning and Labeled Datasets

The greatest strength of Deep Learning (DL) is also one of its biggest weaknesses. DL models frequently have many millions of parameters. The extreme number of parameters—compared to other sorts of machine learning models—gives DL models tremendous flexibility to learn arbitrarily complex functions that simpler models cannot learn. But this flexibility makes it very easy to “overfit” on a training set (essentially, memorize specific examples instead of learning underlying patterns that allow generalization to examples not in the training set).

The conceptually simplest way to prevent overfitting is to train on very large datasets.  If the dataset is big in relation to the number of parameters, then the network will not have enough capacity to memorize examples and will be “forced” to instead learn underlying patterns when optimizing a loss function.  But creating large, labeled datasets for every task we want to perform is cost prohibitive (and may even be impossible if the goal is general purpose intelligent agents).

This need for large training-sets is often the biggest obstacle to apply DL to real world problems. On small datasets, other types of models can outperform DL to the extent that the constraints of those models match the task at hand. For instance, if there is a simple linear relationship in the data, a linear regression can greatly outperform a DL model trained on a small dataset because the linear constraint of the model corresponds to the data.

Figure 1. Neural Nets have a tendency to over fit when datasets are too small.  Here the true relationship between the height and weight of an animal and whether it is a dog or cat, is essentially linear.  A linear classifier assumes this relationship and uses the data merely to determine the slope and intercept.  A large neural network will require much more data to learn a straight-line partition.  With a small dataset, relative to the size of the neural network it will overfit on unusual examples, reducing predictive performance. (Source:


That correspondence allows the model to learn from a small dataset much more efficiently than a DL model because a DL model needs to learn the linear relationship whereas a linear regression simply assumes it. Simple linear classifiers are sufficient for a simple problem like that illustrated above, however more complex problems require models capable of modeling complex relationships within the data.  Much of the work in applying machine learning involves choosing models with constraints and power that match the dataset.  While DL has dramatically outperformed all other models on many tasks, to a large extent it has only done so for complex problem where there are big labeled datasets available for training.


Representation Learning

This blog post describes how the need for large, labeled datasets to train DL models is coming to an end. Over the last year there have been many research results demonstrating how DL models can learn much more efficiently than other models—outperforming alternatives even with very small labeled training sets. Indeed, in some remarkable cases, described below, DL models can learn to perform complex tasks with only a single labeled exampled (“one shot learning”) or even without any labeled data at all (“zero-shot learning”). Over the next few years, these research results will be rolled out to production systems, and further innovations will continue to improve data efficiency even more.


The key to this progress is what DL researchers call “representation learning“—a topic considered so important that prominent researchers named the premier DL conference the International Conference on Learning Representations. Part of the enthusiasm for learning representations is that rather than training DL models on labeled data specific to a target task, you can train them on labeled data for a different problem, or more importantly, on unlabeled data. In the process of training on unlabeled data, the model builds up a reusable internal representation of the data. For instance, in an image classification example (further described below), a network first learns to generate bedroom scenes. To do this convincingly it must develop an internal representation of the world: its 3-dimensional structure, visual perspective, interior design, typical bedroom furniture, etc. In other words, using unsupervised learning (on unlabeled data) the model builds an understanding of how the world of bedrooms actually works to produce pictures of bedrooms. Once a network has an internal representation like this it can learn to recognize objects in images much more easily.  Learning to recognize a “bed” could become almost as simple as learning to associate the word “bed” with an object that the network already knows a lot about—it’s 3-dimensional shape, colors, location in rooms, typical surrounding furniture, etc.  As a result, instead of needing hundreds or thousands of labeled examples of objects, the model could learn from just a handful of examples.

Figure 2.  Bedroom scene generated by a DL model.  No information about bedrooms, bedroom furniture, lighting, visual perspective, etc. was programmed into the network but it learns enough about those things to produce realistic looking images and plausible bedroom arrangements purely by training on bedroom images.  (Source:

Breakthroughs in representation learning herald a sea-change in machine learning that will help unlock the insights of the big-data era. Today data scientists work by carefully choosing machine learning models with constraints that match their problem domain and then painstakingly tuning those models to squeeze out every last drop of learning available from small labeled datasets. Over the coming years that workflow will move to selecting DL models pre-trained on enormous unlabeled datasets to build up internal representations, and then training on just a handful of labeled datasets examples to solve the task at hand.  Instead of just choosing the right model, machine learning practitioners will choose a model and a prepackaged representation already trained up on related data. This workflow is already common in image recognition, where deep learning has been dominant for some time, and in certain NLP domains, like parsing, and is spreading to other domains.


As we continue to transition to this new paradigm, the number of problems we can solve with machine learning will explode.  Right now, we are bumping up against the limits of what simple, highly constrained models with no learned internal representation of the world can accomplish. We can’t squeeze more blood from that stone; big jumps in capability will instead come from models that have some understanding of the world and that can thus interpret data within a larger, more meaningful context. The way forward is not magic new machine learning models that can squeeze more accuracy out of labeled datasets without any understanding of the world those datasets come from, but rather pre-trained models that bring an understanding of the world in which they are operating.


Examples of Recent Representation Learning Progress

Here I describe some of the remarkable advances being made in representation learning and how they are increasing the data efficiency of DL models.  In all of these examples, DL models were able to learn with much less labeled data than simpler alternatives require.  Though this is a small sample, I’ve tried to select examples across different problem domains (natural language processing, image classification, and intelligent agents) and learning types (supervised, semi-supervised, unsupervised, and reinforcement) to illustrate the variety of approaches which are seeing impressive success.


Transfer Learning with Progressive Neural Networks

Progressive Neural Networks (“PNNs”) are DL models specially modified to be able to (1) learn multiple tasks from different datasets in sequence without forgetting tasks learned earlier in the sequence and (2) reuse the representations learned from earlier tasks to accelerate the learning of subsequent tasks. Reusing representations like this from one task to another in order to accelerate subsequent learning is called “transfer learning”, a recurring theme in the examples below.  Progressive Neural Networks employ transfer learning to improve data efficiency when learning new tasks—i.e. new tasks are learned with much less labeled data.


To understand the value of transfer learning, you could imagine, for example, a convolutional network learning low level features like edges that are aggregated up to parts of faces like ears and mouths, and ultimately to whole faces.  Later, you may retrain that network on a different visual recognition task, perhaps recognition of cars. In that case, the network may be able to mostly reuse the low-level edge features while overwriting the higher-level features to aggregate edges into car parts instead of into face parts.

Figure 3. (Source:

Retraining a normal convolutional network like this, and, in the process, overwriting some previously learned features, is called fine-tuning.  While transfer learning by fine-tuning has seen very successful application, it has some important drawbacks. Most importantly, we may want to transfer knowledge from multiple tasks to a new task. However, during fine-tuning, the ability to do the first task can be catastrophically forgotten when learning the second one. Imagine that after training a model on faces and then cars, as in the example above, you subsequently want to train the model to recognizing people while in their cars (perhaps for a traffic enforcement application). Many of the features that aggregate edges into human facial features like eyes, ears, mouth, could have been re-used, but unfortunately they were destroyed when learning car features. This means that when learning a third task the network may not be able to draw on useful features learned in the first task. The goal behind PNNs is to have a network that can continue learning from diverse datasets, continually expanding its knowledge as it goes.


If you already understand simple Feedforward Neural Networks, PNNs are not hard to conceptualize. Each task the network learns is allocated a “column” of the network, which is a full multi-layered feedforward network. After learning a task, the associated column is frozen so that it cannot be affected by training on future tasks, and a new column is added for the next task. Each layer in the new column gets input not only from lower layers within the new column, but also from lower layers within the frozen columns previously trained on other tasks. This allows the network to take advantage of features it has learned for other tasks, and repurpose them for new tasks without losing knowledge about the previous tasks.


Figure 4.  (Source:


The result is a network architecture that can often learn new skills with much less training data than a network learning from scratch, or even than a network pre-trained on one previous task and then fined-tuned.  The authors of the PNN paper demonstrated dramatic improvements in the data efficiency of their AI agents:

Figure 5.  Tests of Progressive Neural Networks on variations of the Atari game Pong, illustrating how they learn more efficiently compared to two baselines: Base1, a single column trained on the target task, and Base3, a single column pre-trained and fine-tuned on. (Source:

Zero Shot Natural Language Translation

Recently there have also been some great examples of transfer learning in the NLP space.  A couple of months ago Google announced that it is rolling out DL models for machine translation—called Google Neural Machine Translation (GNMT)—to replace the phrase based models that used to be state-of-the-art. GNMT models use a pair of recurrent neural networks: (1) an encoder that reads in words one at a time and produces a series of vectors representing all words read to that point, and (2) a decoder which reads the encoded vectors and outputs the translation (with an attention mechanism allowing the decoder to focus on the most important encoded vectors for each word it outputs).  This method resulted in dramatic translation improvements for all language pairs, in some cases approaching human level.

Figure 6. According to their paper GMNT “reduced translation errors by an average of 60% compared to Google’s phrase-based production system.” (Source:

A few weeks ago, Google researchers published an impressive paper describing how they made a trivial modification of their GNMT architecture that allowed them to use a single network to translate all language pairs, instead of training a separate network for each language pair.  To accomplish this, they simply modified their network to accept a token representing which language pair was being translated, and then trained on multiple language pairs at once.  This token provides the additional information that the decoder network needs in order to output the appropriate language.  Not only were they able to train a single network to translate between many different languages, but they used the same size network as they would normally use for a single language pair, thereby dramatically reducing the number of parameters used for the entire collection of languages.


The really interesting part is that after training on many language pairs, the network was able to translate between language pairs it had never seen or been trained on. In other words, it achieved “zero-shot” translation.  The implication is that after training on a number of language pairs, the network develops its own “universal interlingua representation” of the meaning of source sentences independent of the source language.  Once it has this representation of the meaning of the sentence it can translate it to any target language it knows about, regardless of whether it has ever seen the source-target combination.


To verify that the neural network actually creates this interlingua representation, the authors used t-SNE to plot a 2-dimensional representation of the intermediate vectors connecting the encoding and decoding networks.  Below, in figure (7a) each color represents the intermediate vectors produced when translating semantically identical sentences in English, Korean and Japanese.  (Each vector is a dot, and vectors produced in a series as part of translating a single sentence from one language are connected by a line.)  The fact that similarly colored (and thus semantically identical) sentences are clustered near each other illustrates that the neural network has understood them to have similar meanings and therefore produces similar intermediate vectors (in its interlingua representation).  Figure (7b) zooms in on one example, and figure (7c) re-colors that example to distinguish between the semantically identical sentences in the three different languages.

Figure 7.  (Source:


The takeaway here is that the network developed an internal representation of the problem domain—of the meaning (semantics) of the sentences represented independently of the particular vocabulary or grammar of a language.  That representation turned out to be so rich that it enabled the network to translate between language pairs with no labeled training data.  The network transferred its learning from language pairs it had seen to pairs that it had never seen before.


The second in the series is available here, and some bonus material here.


New to Open Data? Register for Free

    Latest Posts

    Related posts