Prediction rules in deep learning are based on a forward, recursive computation through several layers. Implicit deep learning rules go much beyond, by relying on the solution of an implicit (or, “fixed-point”) equation that has to be numerically solved in order to make the prediction: for a given input vector u, the predicted vector y is of the form
[Related Article: Model Evaluation in the Land of Deep Learning]
where A,B,C,D are matrices containing the model weights, and ϕ is a given (nonlinear) activation function, such as the ReLU. Here, the so-called “state” n-vector x, which contains the hidden features of the model, is not expressed explicitly; rather it is implicitly defined via the “fixed-point” (or, equilibrium) equation x=ϕ(Ax+Bu).
At first glance, the above models seem very specific. Perhaps surprisingly, they include a special case most known neural network architectures, including standard feedforward networks, CNNs, RNNs, and many more. We can specify such architectures with a proper definition of the activation ϕ and by imposing adequate linear structure in the model matrices A,B,C,D. For example, constraining matrix A to be strictly upper block-diagonal corresponds to the class of feedforward networks.
The picture on the left illustrates the structure of the model matrices for a 6-layer network. The matrix A has a strictly upper block-diagonal structure, with the size of each block corresponding to the dimensions of each layer.
Further specifying structure in each block, such as equal elements along diagonals, allows one to encode convolutional layers.
Implicit rules allow for much wider classes of models, as they have a lot more expressive power than standard networks, as measured by the number of parameters for a given dimension of the hidden features.
Recent work on implicit rules has demonstrated their potential. Kolter and collaborators [1,5] showcased the success of their implicit framework, termed Deep Equilibrium Models, for the task of sequence modeling. Chen et al.  used implicit methods to construct a related class of models, known as neural ordinary differential equations. Building on earlier work , the paper  provides some theoretical and algorithmic foundations for implicit learning.
Well-posedness and tractability. One of the thorny issues in implicit rules is well-posedness and numerical tractability: how can we guarantee that there exists a unique solution x, and if so, how can we solve for x efficiently? In standard networks, the issue is not present, since one can always express the hidden state variable in explicit form, thanks to a recursive elimination, that is, via a forward pass through the layers. As seen in [there are simple conditions on the matrix A that guarantee both well-posedness and tractability, for example
in which case the recursion xt+1=ϕ(Ax(t+1)+Bu), t=0,1,2, converges quickly to the unique solution. The constraint above tends to encourage sparsity of A, which in turn brings about many benefits: speedup at test time, architecture simplification and reduced memory requirements.
The training problem for implicit learning can be addressed via standard unconstrained optimization methods that are popular in the deep learning community, such as stochastic gradient descent (SGD). However, computing gradients with a fixed-point equation is challenging. In addition, SGD does not guarantee well-posedness of the prediction rule; handling properly the corresponding constraint requires constrained optimization, for example, block-coordinate descent (BCD) methods, which tend to converge very fast .
A nice aspect of BCD methods is their ability to handle interesting constraints or penalties. In the implicit rule above, a constraint on the input matrix B of the form
where k is a small positive hyper-parameter, will encourage B to be “column-sparse”, that is entire columns of B are zero; in turn, the resulting model will select important inputs, and discard the others, effectively accomplishing feature selection via deep learning.
We generated a synthetic data set of 400 points, using a given implicit model with n = 20 hidden features, 50 inputs and 100 outputs, and with B a column-sparse matrix. Using a training model with an incorrect guess of n = 10 hidden features, the BCD algorithm nevertheless recovers the correct sparsity pattern of the generative model, as evidenced by the fact that the vector of column norms of the learned B matrix (left column) and that of the generative model (right column) matrices match.
[Related Article: Using Mobile Devices for Deep Learning]
There are many other potential benefits of implicit models. In the upcoming ODSC talk, I will provide an overview of implicit learning and detail some exciting developments towards robustness, interpretability, and architecture learning.
- Bai, S., Kolter, J. Z., and Koltun, V. (2019). Deep equilibrium models. Preprint submitted.
- Chen, T. Q., Rubanova, Y., Bettencourt, J., and Duvenaud, D. K. (2018). Neural ordinary differential equations. In NeurIPS 2018, pages 6571–6583.
- El Ghaoui, L., Gu, F., Travacca, B., and Askari, A. (2019). Deep implicit learning. In preparation.
- Gu, F., Askari, A., and El Ghaoui, L. (2018). Fenchel lifted networks: A Lagrange relaxation of neural network training. Preprint arXiv:1811.08039.
- Kolter, J.Z. (2019). Deep equilibrium models: one (implicit) layer is all you need. Presentation at the Simmons Institute, August 2019.
Editor’s note: Laurent is a speaker for ODSC West in California later this year! Be sure to attend his upcoming talk.