# An Overview of Multi-Task Learning in Deep Neural Networks

BlogDeep LearningNeural Networksposted by Sebastian Ruder July 4, 2017 Sebastian Ruder

Note: If you are looking for a review paper, this blog post is also available as an article on arXiv.

Table of contents:

# Introduction

In Machine Learning (ML), we typically care about optimizing for a particular metric, whether this is a score on a certain benchmark or a business KPI. In order to do this, we generally train a single model or an ensemble of models to perform our desired task. We then fine-tune and tweak these models until their performance no longer increases. While we can generally achieve acceptable performance this way, by being laser-focused on our single task, we ignore information that might help us do even better on the metric we care about. Specifically, this information comes from the training signals of related tasks. By sharing representations between related tasks, we can enable our model to generalize better on our original task. This approach is called Multi-Task Learning (MTL) and will be the topic of this blog post.

Multi-task learning has been used successfully across all applications of machine learning, from natural language processing [^{1}] and speech recognition [^{2}] to computer vision [^{3}] and drug discovery [^{4}]. MTL comes in many guises: joint learning, learning to learn, and learning with auxiliary tasks are only some names that have been used to refer to it. Generally, as soon as you find yourself optimizing more than one loss function, you are effectively doing multi-task learning (in contrast to single-task learning). In those scenarios, it helps to think about what you are trying to do explicitly in terms of MTL and to draw insights from it.

Even if you’re only optimizing one loss as is the typical case, chances are there is an auxiliary task that will help you improve upon your main task. Rich Caruana [^{5}] summarizes the goal of MTL succinctly: “MTL improves generalization by leveraging the domain-specific information contained in the training signals of related tasks”.

Over the course of this blog post, I will try to give a general overview of the current state of multi-task learning, in particular when it comes to MTL with deep neural networks. I will first motivate MTL from different perspectives. I will then introduce the two most frequently employed methods for MTL in Deep Learning. Subsequently, I will describe mechanisms that together illustrate why MTL works in practice. Before looking at more advanced neural network-based MTL methods, I will provide some context by discussing the literature in MTL. I will then introduce some more powerful recently proposed methods for MTL in deep neural networks. Finally, I will talk about commonly used types of auxiliary tasks and discuss what makes a good auxiliary task for MTL.

# Motivation

We can motivate multi-task learning in different ways: Biologically, we can see multi-task learning as being inspired by human learning. For learning new tasks, we often apply the knowledge we have acquired by learning related tasks. For instance, a baby first learns to recognize faces and can then apply this knowledge to recognize other objects.

From a pedagogical perspective, we often learn tasks first that provide us with the necessary skills to master more complex techniques. This is true for learning the proper way of falling in martial arts, e.g. Judo as much as learning to program.

Taking an example out of pop culture, we can also consider *The Karate Kid* (1984) (thanks to Margaret Mitchell and Adrian Benton for the inspiration). In the movie, *sensei* Mr Miyagi teaches the karate kid seemingly unrelated tasks such as sanding the floor and waxing a car. In hindsight, these, however, turn out to equip him with invaluable skills that are relevant for learning karate.

Finally, we can motivate multi-task learning from a machine learning point of view: We can view multi-task learning as a form of inductive transfer. Inductive transfer can help improve a model by introducing an inductive bias, which causes a model to prefer some hypotheses over others. For instance, a common form of inductive bias is (ell_1) regularization, which leads to a preference for sparse solutions. In the case of MTL, the inductive bias is provided by the auxiliary tasks, which cause the model to prefer hypotheses that explain more than one task. As we will see shortly, this generally leads to solutions that generalize better.

# Two MTL methods for Deep Learning

So far, we have focused on theoretical motivations for MTL. To make the ideas of MTL more concrete, we will now look at the two most commonly used ways to perform multi-task learning in deep neural networks. In the context of Deep Learning, multi-task learning is typically done with either *hard* or *soft parameter sharing* of hidden layers.

## Hard parameter sharing

Hard parameter sharing is the most commonly used approach to MTL in neural networks and goes back to [^{6}]. It is generally applied by sharing the hidden layers between all tasks, while keeping several task-specific output layers.

Hard parameter sharing greatly reduces the risk of overfitting. In fact, [^{7}] showed that the risk of overfitting the shared parameters is an order N — where N is the number of tasks — smaller than overfitting the task-specific parameters, i.e. the output layers. This makes sense intuitively: The more tasks we are learning simultaneously, the more our model has to find a representation that captures all of the tasks and the less is our chance of overfitting on our original task.

## Soft parameter sharing

In soft parameter sharing on the other hand, each task has its own model with its own parameters. The distance between the parameters of the model is then regularized in order to encourage the parameters to be similar. [^{8}] for instance use the (ell_2) norm for regularization, while [^{9}] use the trace norm.

The constraints used for soft parameter sharing in deep neural networks have been greatly inspired by regularization techniques for MTL that have been developed for other models, which we will soon discuss.

# Why does MTL work?

Even though an inductive bias obtained through multi-task learning seems intuitively plausible, in order to understand MTL better, we need to look at the mechanisms that underlie it. Most of these have first been proposed by Caruana (1998). For all examples, we will assume that we have two related tasks (A) and (B), which rely on a common hidden layer representation (F).

## Implicit data augmentation

MTL effectively increases the sample size that we are using for training our model. As all tasks are at least somewhat noisy, when training a model on some task (A), our aim is to learn a good representation for task (A) that ideally ignores the data-dependent noise and generalizes well. As different tasks have different noise patterns, a model that learns two tasks simultaneously is able to learn a more general representation. Learning just task (A) bears the risk of overfitting to task (A), while learning (A) and (B) jointly enables the model to obtain a better representation (F) through averaging the noise patterns.

## Attention focusing

If a task is very noisy or data is limited and high-dimensional, it can be difficult for a model to differentiate between relevant and irrelevant features. MTL can help the model focus its attention on those features that actually matter as other tasks will provide additional evidence for the relevance or irrelevance of those features.

## Eavesdropping

Some features (G) are easy to learn for some task (B), while being difficult to learn for another task (A). This might either be because (A) interacts with the features in a more complex way or because other features are impeding the model’s ability to learn (G). Through MTL, we can allow the model to *eavesdrop*, i.e. learn (G) through task (B). The easiest way to do this is through *hints* [^{10}], i.e. directly training the model to predict the most important features.

## Representation bias

MTL biases the model to prefer representations that other tasks also prefer. This will also help the model to generalize to new tasks in the future as a hypothesis space that performs well for a sufficiently large number of training tasks will also perform well for learning novel tasks as long as they are from the same environment [^{11}].

## Regularization

Finally, MTL acts as a regularizer by introducing an inductive bias. As such, it reduces the risk of overfitting as well as the Rademacher complexity of the model, i.e. its ability to fit random noise.

# MTL in non-neural models

In order to better understand MTL in deep neural networks, we will now look to the existing literature on MTL for linear models, kernel methods, and Bayesian algorithms. In particular, we will discuss two main ideas that have been pervasive throughout the history of multi-task learning: enforcing sparsity across tasks through norm regularization; and modelling the relationships between tasks.

Note that many approaches to MTL in the literature deal with a homogenous setting: They assume that all tasks are associated with a single output, e.g. the multi-class MNIST dataset is typically cast as 10 binary classification tasks. More recent approaches deal with a more realistic, heterogeneous setting where each task corresponds to a unique set of outputs.

## Block-sparse regularization

In order to better connect the following approaches, let us first introduce some notation. We have (T) tasks. For each task (t), we have a model (m_t) with parameters (a_t) of dimensionality (d). We can write the parameters as a column vector (a_t = begin{bmatrix}a_{1, t} ldots a_{d, t} end{bmatrix}^top ). We now stack these column vectors (a_1, ldots, a_T) column by column to form a matrix (A in

mathbb{R}^{d times T}). The (i)-th row of (A) then contains the parameter (a_{i, cdot}) corresponding to the (i)-th feature of the model for every task, while the (j)-th column of (A) contains the parameters (a_{cdot,j}) corresponding to the (j)-th model.

Many existing methods make some sparsity assumption with regard to the parameters of our models. [^{12}] assume that all models share a small set of features. In terms of our task parameter matrix (A), this means that all but a few rows are (0), which corresponds to only a few features being used across *all* tasks. In order to enforce this, they generalize the (ell_1) norm to the MTL setting. Recall that the (ell_1) norm is a constraint on the sum of the parameters, which forces all but a few parameters to be exactly (0). It is also known as lasso (**l**east **a**bsolute **s**hrinkage and **s**election **o**perator).

While in the single-task setting, the (ell_1) norm is computed based on the parameter vector (a_t) of the respective task (t), for MTL we compute it over our task parameter matrix (A). In order to do this, we first compute an (ell_q) norm across each row (a_i) containing the parameter corresponding to the (i)-th feature across all tasks, which yields a vector (b = begin{bmatrix}|a_1|_q ldots |a_d|_q end{bmatrix} in mathbb{R}^d). We then compute the (ell_1) norm of this vector, which forces all but a few entries of (b), i.e. rows in (A) to be (0).

As we can see, depending on what constraint we would like to place on each row, we can use a different (ell_q). In general, we refer to these mixed-norm constraints as (ell_1/ell_q) norms. They are also known as block-sparse regularization, as they lead to entire rows of (A) being set to (0). [^{13}] use (ell_1/ell_infty) regularization, while Argyriou et al. (2007) use a mixed (ell_1/ell_2) norm. The latter is also known as group lasso and was first proposed by [^{14}].

Argyriou et al. (2007) also show that the problem of optimizing the non-convex group lasso can be made convex by penalizing the trace norm of (A), which forces (A) to be low-rank and thereby constrains the column parameter vectors (a_{cdot, 1}, ldots, a_{cdot, t}) to live in a low-dimensional subspace. [^{15}] furthermore establish upper bounds for using the group lasso in multi-task learning.

As much as this block-sparse regularization is intuitively plausible, it is very dependent on the extent to which the features are shared across tasks. [^{16}] show that if features do not overlap by much, (ell_1/ell_q) regularization might actually be worse than element-wise (ell_1) regularization.

For this reason, [^{17}] improve upon block-sparse models by proposing a method that combines block-sparse and element-wise sparse regularization. They decompose the task parameter matrix (A) into two matrices (B) and (S) where (A = B + S). (B) is then enforced to be block-sparse using (ell_1/ell_infty) regularization, while (S) is made element-wise sparse using lasso. Recently, [^{18}] propose a distributed version of group-sparse regularization.

## Learning task relationships

While the group-sparsity constraint forces our model to only consider a few features, these features are largely used across all tasks. All of the previous approaches thus assume that the tasks used in multi-task learning are closely related. However, each task might not be closely related to all of the available tasks. In those cases, sharing information with an unrelated task might actually hurt performance, a phenomenon known as negative transfer.

Rather than sparsity, we would thus like to leverage prior knowledge indicating that some tasks are related while others are not. In this scenario, a constraint that enforces a clustering of tasks might be more appropriate. [^{19}] suggest to impose a clustering constraint by penalizing both the norms of our task column vectors (a_{cdot, 1}, ldots, a_{cdot, t}) as well as their variance with the following constraint:

(Omega = |bar{a}|^2 + dfrac{lambda}{T} sum^T_{t=1} | a_{cdot, t} – bar{a} |^2 )

where (bar{a} = (sum^T_{t=1} a_{cdot, t})/T ) is the mean parameter vector. This penalty enforces a clustering of the task parameter vectors (a_{cdot, 1}, ldots, a_{cdot, t}) towards their mean that is controlled by (lambda). They apply this constraint to kernel methods, but it is equally applicable to linear models.

A similar constraint for SVMs was also proposed by [^{20}]. Their constraint is inspired by Bayesian methods and seeks to make all models close to some mean model. In SVMs, the loss thus trades off having a large margin for each SVM with being close to the mean model.

[^{21}] make the assumptions underlying cluster regularization more explicit by formalizing a cluster constraint on (A) under the assumption that the number of clusters (C) is known in advance. They then decompose the penalty into three separate norms:

- A global penalty which measures how large our column parameter vectors are on average: (Omega_{mean}(A) = |bar{a}|^2 ).
- A measure of between-cluster variance that measures how close to each other the clusters are: (Omega_{between}(A) = sum^C_{c=1} T_c | bar{a}_c – bar{a} |^2 ) where (T_c) is the number of tasks in the (c)-th cluster and (bar{a}_c) is the mean vector of the task parameter vectors in the (c)-th cluster.
- A measure of within-cluster variance that gauges how compact each cluster is: (Omega_{within} = sum^C_{c=1} sum_{t in J(c)} | a_{cdot, t} – bar{a}_c | ) where (J(c)) is the set of tasks in the (c)-th cluster.

The final constraint then is the weighted sum of the three norms:

(Omega(A) = lambda_1 Omega_{mean}(A) + lambda_2 Omega_{between}(A) + lambda_3 Omega_{within}(A)).

As this constraint assumes clusters are known in advance, they introduce a convex relaxation of the above penalty that allows to learn the clusters at the same time.

In another scenario, tasks might not occur in clusters but have an inherent structure. [^{22}] extend the group lasso to deal with tasks that occur in a tree structure, while [^{23}] apply it to tasks with graph structures.

While the previous approaches to modelling the relationship between tasks employ norm regularization, other approaches do so without regularization: [^{24}] were the first ones who presented a task clustering algorithm using k-nearest neighbour, while [^{25}] learn a common structure from multiple related tasks with an application to semi-supervised learning.

Much other work on learning task relationships for multi-task learning uses Bayesian methods:

[^{26}] propose a Bayesian neural network for multi-task learning by placing a prior on the model parameters to encourage similar parameters across tasks. [^{27}] extend Gaussian processes (GP) to MTL by inferring parameters for a shared covariance matrix. As this is computationally very expensive, they adopt a sparse approximation scheme that greedily selects the most informative examples. [^{28}] also use GP for MTL by assuming that all models are sampled from a common prior.

[^{29}] place a Gaussian as a prior distribution on each task-specific layer. In order to encourage similarity between different tasks, they propose to make the mean task-dependent and introduce a clustering of the tasks using a mixture distribution. Importantly, they require task characteristics that define the clusters and the number of mixtures to be specified in advance.

Building on this, [^{30}] draw the distribution from a Dirichlet process and enable the model to learn the similarity between tasks as well as the number of clusters. They then share the same model among all tasks in the same cluster. [^{31}] propose a hierarchical Bayesian model, which learns a latent task hierarchy, while [^{32}] use a GP-based regularization for MTL and extend a previous GP-based approach to be more computationally feasible in larger settings.

Other approaches focus on the online multi-task learning setting: [^{33}] adapt some existing methods such as the approach by Evgeniou et al. (2005) to the online setting. They also propose a MTL extension of the regularized Perceptron, which encodes task relatedness in a matrix. They use different forms of regularization to bias this task relatedness matrix, e.g. the closeness of the task characteristic vectors or the dimension of the spanned subspace. Importantly, similar to some earlier approaches, they require the task characteristics that make up this matrix to be provided in advance. [^{34}] then extend the previous approach by learning the task relationship matrix.

[^{35}] assume that tasks form disjoint groups and that the tasks within each group lie in a low-dimensional subspace. Within each group, tasks share the same feature representation whose parameters are learned jointly together with the group assignment matrix using an alternating minimization scheme. However, a total disjointness between groups might not be the ideal way, as the tasks might still share some features that are helpful for prediction.

[^{36}] in turn allow two tasks from different groups to overlap by assuming that there exist a small number of latent basis tasks. They then model the parameter vector (a_t) of every actual task (t) as a linear combination of these: (a_t = Ls_t) where (L in mathbb{R}^{k times d}) is a matrix containing the parameter vectors of (k) latent tasks, while (s_t in mathbb{R}^k) is a vector containing the coefficients of the linear combination. In addition, they constrain the linear combination to be sparse in the latent tasks; the overlap in the sparsity patterns between two tasks then controls the amount of sharing between these. Finally, [^{37}] learn a small pool of shared hypotheses and then map each task to a single hypothesis.

# Recent work on MTL for Deep Learning

While many recent Deep Learning approaches have used multi-task learning — either explicitly or implicitly — as part of their model (prominent examples will be featured in the next section), they all employ the two approaches we introduced earlier, hard and soft parameter sharing. In contrast, only a few papers have looked at developing better mechanisms for MTL in deep neural networks.

## Deep Relationship Networks

In MTL for computer vision, approaches often share the convolutional layers, while learning task-specific fully-connected layers. [^{38}] improve upon these models by proposing Deep Relationship Networks. In addition to the structure of shared and task-specific layers, which can be seen in Figure 3, they place matrix priors on the fully connected layers, which allow the model to learn the relationship between tasks, similar to some of the Bayesian models we have looked at before. This approach, however, still relies on a pre-defined structure for sharing, which may be adequate for well-studied computer vision problems, but prove error-prone for novel tasks.

## Fully-Adaptive Feature Sharing

Starting at the other extreme, [^{39}] propose a bottom-up approach that starts with a thin network and dynamically widens it greedily during training using a criterion that promotes grouping of similar tasks. The widening procedure, which dynamically creates branches can be seen in Figure 4. However, the greedy method might not be able to discover a model that is globally optimal, while assigning each branch to exactly one task does not allow the model to learn more complex interactions between tasks.

## Cross-stitch Networks

[^{40}] start out with two separate model architectures just as in soft parameter sharing. They then use what they refer to as cross-stitch units to allow the model to determine in what way the task-specific networks leverage the knowledge of the other task by learning a linear combination of the output of the previous layers. Their architecture can be seen in Figure 5, in which they only place cross-stitch units