Notes on Representation Learning Continued

Tags: ,

This post is part of a three part series.

  1. Notes on Representation Learning
  2. Notes on Representation Learning Continued
  3. Representation Learning Bonus Material

Ten Shot Learning with Generative Adversarial Networks

A very exciting approach to representation learning (but one that sadly does not work on discrete values like text, at least not without some modification) are Generative Adversarial Networks (GANs).  GANs learn to generate images through unsupervised training (on unlabeled data).  Below, the images on the left are original training images, and the ones on the right are generated images.  You’ll notice that some of the images on the right are actually just natural looking blotches of color, arranged in such a way that, at a glance, they might resemble actual photographs.  But you’ll also notice that it’s possible to make out real objects (cars and horses).  That’s a dramatic improvement over natural image generation from a couple of years ago.

Figure 8.  (Source:

The key to this improvement is a novel method for unsupervised training (on unlabeled data) involving two separate neural networks that work together to learn about the structure of images.  One network, the generator, takes a random noise vector as a seed and outputs an image.  The other network, the discriminator, is trained to distinguish between real images and images output by the generator.  At the beginning of training, with randomly assigned parameters, the images output by the generator are random noise.  As training progresses, the networks improve in parallel, each improving on its own weaknesses and taking advantage of the other’s weaknesses (thus the term “adversarial”).  One of the tricks that makes this work so well is that while training the generator, it can be wired directly to the discriminator, allowing gradients to flow through the discriminator to the generator, and thereby giving the generator clear information about how to improve.  The important point is just that GANs are a clever way to define a loss function that results in dramatic improvements to models trained on unlabeled data (without supervision).  Other unsupervised loss functions, such as those used in auto-encoders, disincentivize taking firm guesses and thus produce blurry images. A blurry image is easy for a discriminator network to recognize as unnatural, and thus GANs produce much sharper more convincing images.


Below is another example trained on bedroom images instead of outdoor images.  Here, the GAN is learning how to generate more consistently realistic and recognizable scenes than with outdoor images above.  In many of the pictures it’s easy to make out clearly defined beds, windows, doors, pictures, and lamps.  In almost all the scenes the network has remarkably learned to produce realistic shading, coloring, positioning and three dimensional perspective.  Beds are never on the ceiling, windows cast light, distant objects are smaller, and lines of perspective converge.  Anyone who has played a first-person shooter video game is used to computers being able to do those things, but the important thing to keep in mind here is that none of this has been explicitly programmed.  It was all learned simply from looking at the raw pixels of bedroom images.  The model has learned a rich representation of the physical world of bedrooms, all on its own, with no supervision (no labeled data).

Figure 11. (Source:

GANs have been a hot area of research for a couple of years, and other papers that have tweaked the original model have provided further striking demonstrations that the GANs are truly capturing knowledge about the structure of our three-dimensional world.  By putting various constraints on the parameters, it’s possible to force the model to associate positions in the seed vector with various attributes of the image.  For instance, here certain positions in the seed vector were associated with pose, elevation, lighting and width of generated faces so that modifying each associated number predictably changes those attributes in generated images.

Figure 12. (Source:


This is actually quite amazing.  The ability to rotate and morph objects suggests that the model has actually learned some kind of three-dimensional representation of objects it has only seen (and produced) in two dimensional images.  Again, it’s important to remember that none of this was explicitly programmed.  The model has learned this entirely from viewing the pixels of thousands of individual, unlabeled images. Here is another example that demonstrates that the model has learned sophisticated three dimensional representations of the images it is generating:

Figure 13. (Source:


Once again, the model is able to rotate objects, demonstrating that it has some sort of internal 3-dimensional representation of the structure of the objects it is producing pictures of.  This is all pretty neat, but what are the implications for the typical kinds of supervised machine learning that comprise of the day to day activities of most Data Scientists?  Here is the punchline on GANs (from a recent paper by the folks at OpenAI):


In addition to generating pretty pictures, we introduce an approach for semi-supervised learning with GANs that involves the discriminator producing an additional output indicating the label of the input. This approach allows us to obtain state of the art results on MNIST, SVHN, and CIFAR-10 in settings with very few labeled examples. On MNIST, for example, we achieve 99.14% accuracy with only 10 labeled examples per class with a fully connected neural network — a result that’s very close to the best known results with fully supervised approaches using all 60,000 labeled examples. (


In other words, the internal representations developed during unsupervised training provide GANs with such a rich understanding of the data that they can learn with just a handful of labeled examples (instead of many thousands of examples typically required for DL).  Deep learning with tiny labeled datasets.  This is a game changer.


Representation Learning and Increased Data Efficiency in Reinforcement Learning

A couple of weeks ago Deep Mind released another paper building on their amazing reinforcement learning work.  The overall idea is that by supplementing the main learning task (maximizing reward) with auxiliary learning tasks—such as predicting how game actions result in changes to screen pixels, or replaying moments of high reward—it’s possible to encourage the development of richer internal representations that allows the agent to maximize reward much more efficiently.


Figure 14.  (Source:


They describe it like this:


Consider a baby that learns to maximize the cumulative amount of red that it observes. To correctly predict the optimal value, the baby must understand how to increase “redness” by various means, including manipulation (bringing a red object closer to the eyes); locomotion (moving in front of a red object); and communication (crying until the parents bring a red object). These behaviors are likely to recur for many other goals that the baby may subsequently encounter.  (


By augmenting the reinforcement learning objective with objectives related to predicting and controlling features of the sensorimotor stream, they reduced the amount of data needed to train their state-of-the-art reinforcement learning agents by a factor of ten:


Perhaps of equal importance, aside from final performance on the games, UNREAL is significantly faster at learning and therefore more data efficient, achieving a mean speedup of the number of steps to reach A3C best performance … across all levels… This translates in a drastic improvement in the data efficiency of UNREAL over A3C, requiring less than 10% of the data to reach the final performance of A3C. (


The common theme of all the examples above is that innovations in the structure of neural networks allow them to develop rich internal representations—learned from unlabeled data, or transferred from related datasets—that dramatically improve data efficiency when subsequently learning on labeled datasets.  The above examples are only a small sample of the progress in representation learning, however I’ve tried to choose examples with diverse learning types and problem domains to illustrate how the progress is transforming every area of machine learning.  In all of these examples, DL models need much less labeled data than alternatives and end up performing much better.  The conventional wisdom that DL models are limited to situations where large labeled training sets are available is now being reversed.


Some of these breakthroughs are very new.  Many of the papers referenced above are only a few months old.  The implications are only beginning to ripple through the machine learning world.  Nevertheless, they have produced real state-of-the-art results, and some are already in the process of being production-ized.


Fortunately, for the work Data Scientists do where I work at Thomson Reuters Labs, this trend plays to our strengths.  While we have vast quantities of unlabeled data, the bottleneck is often in procuring sufficient labeled data.  As the need for large labeled datasets disappears we will be better positioned to unlock the knowledge and insight hidden in our petabytes of legal, financial and news data.  More generally, I believe that the firms that survive the next decade will be the ones that understand these implications now, and that have already begun focusing on how to adapt these (and other) DL breakthroughs to their particular problem domains.  Firms that fail to do that will find themselves rapidly unable to compete in markets where deep learning startups, and existing behemoths, offer superior products at lower costs.

The jump back to the first in the series here, or move ahead to some bonus material here.