At work, I’ve been rolling my own deep learning package to experiment with graph convolutional neural networks. I did this because in graph-centric deep learning, an idea I picked up from this paper, the inputs, convolution kernels, and much more, are being actively developed, and the standard APIs don’t fit with this kind of data.
Here’s lessons I learned (and reinforced) while doing this.
autograd is an amazing package
I am using
autograd to write my neural networks.
autograd provides a way to automatically differentiate
numpy code. As long as I write the forward computation up to the loss function,
autograd will be able to differentiate the loss function w.r.t. all of the parameters used, thus providing the direction to move parameters to minimize the loss function.
Deep learning is nothing more than chaining elementary differentiable functions
Linear regression is nothing more than a dot product of features with weights and adding bias terms. Logistic regression just chains the logistic function on top of that. Anything deeper than that is what we might call a neural network.
One interesting thing that I’ve begun to ponder is the shape of the loss function, and how it changes when I change model architecture, activation functions, and more. I can’t speak intelligently about it right now, but from observing the training performance live (I update a plot of predictions vs. actual values at the end of
x training epochs), different combinations of activation functions seem to cause different behaviours of the outputs, and there’s no first-principles reason why that I can think of. All-told, pretty interesting :).
Defining a good API is hard work
There are design choices that go into the API design. I first off wanted to build something familiar, so I chose to emulate the functional API of Keras and PyTorch and Chainer. I also wanted composability, in which I can define modules of layers and chain them together, so I opted to use Python objects and to take advantage of their
__call__ method to achieve both goals. At the same time,
autograd imposes a constraint in that I need to have functions differentiable with respect to their first argument, an array of parameters. Thus, I had to make sure the weights and biases are made transparently available for
autograd to differentiate. As a positive side effect, it means I can actually inspect the parameters dictionary quite transparently.
Optimizing for speed is a very hard thing
Even though I’m doing my best already with matrix math (and hopefully getting better at mastering 3-dimensional and higher matrix algebra), in order to keep my API clean and compatible with
autograd (meaning no sparse arrays), I have opted to use lists of
Graph convolutions have a connection to network propagation
I will probably explore this a bit more deeply in another blog post, but yes, as I explore the math involved in doing graph convolutions, I’m noticing that there’s a deep connection there. The short story is basically “convolutions propagate information across nodes” in almost exactly the same way as “network propagation methods share information across nodes”, through the use of a kernel defined by the adjacency matrix of a graph.
Ok, that’s a lot of jargon, but I promise I will explore this topic at a later time.
I’m an avid open source fan. Lots of my work builds on it. However, because this “neural networks on graphs” work is developed on company time and for company use, this will very likely be the first software project that I send to Legal to evaluate whether I can open source/publish it or not — I’ll naturally have to make my strongest case for open sourcing the code base (e.g. ensuring no competitive intelligence is leaked), but eventually will still have to defer to them for a final decision.