# Faster deep learning with GPUs and Theano

Deep LearningModelingPlatformsToolsTools & LanguagesConvolutional Neural Networks|Deep Learning|Python|Theanoposted by Domino Data Lab April 8, 2017

*Originally posted by Manojit Nandi, Data Scientist at STEALTHbits Technologies on the Domino data science blog*

Domino recently added support for GPU instances. To celebrate this release, I will show you how to:

- Configure the Python library Theano to use the GPU for computation.
- Build and train neural networks in Python.

Using the GPU, I’ll show that we can train deep belief networks up to 15x faster than using just the CPU, cutting training time down from hours to minutes. You can see my code, experiments, and results on Domino.

# Why are GPUs useful?

When you think of high-performance graphics cards, data science may not be the first thing that comes to mind. However, computer graphics and data science have one important thing in common: matrices!

Images, videos, and other graphics are represented as matrices, and when you perform a certain operation, such as a camera rotation or a zoom in effect, all you are doing is applying some mathematical transformation to a matrix.

What this means is that GPUs, compared to CPUs (Central Processing Unit), are more specialized at performing matrix operations and other advanced mathematical transformations. In some cases, we see a 10x speedup in an algorithm when it runs on the GPU.

GPU-based computation have been employed in a wide variety of scientific applications, from genomic to epidemiology

Recently, there has been a rise in GPU-accelerated algorithms in machine learning thanks to the rising popularity of deep learning algorithms. Deep Learning is a collection of algorithms for training neural network-based models for various problems in machine learning. Deep Learning algorithms involve computationally intensive methods, such as convolutions, Fourier Transforms, and other matrix-based operations which GPUs are well-suited for computing. The computationally intensive functions, which make up about 5% of the code, are run on the GPU, and the remaining code is run on the CPU.

*Source: http://www.nvidia.com/docs/IO/143716/how-gpu-acceleration-works.png*

With the recent advances in GPU performance and support for GPUs in common libraries, I recommend anyone interested in deep learning get ahold of a GPU.

Now that I have thoroughly motivated the use of GPUs, let’s see how they can be used to train neural networks in Python.

# Deep Learning in Python

The most popular library in Python to implement neural networks is Theano. However, Theano is not strictly a neural network library, but rather a Python library that makes it possible to implement a wide variety of mathematical abstractions. Because of this, Theano has a high learning curve, so I will be using two neural network libraries built on top of Theano that have a more gentle learning curve.

The first library is Lasagne. This library provides a nice abstraction that allows you to construct each layer of the neural network, and then stack the layers on top of each other to construct the full model. While this is nicer than Theano, constructing each layer and then appending them on top of one another becomes tedious, so we’ll be using the Nolearn library which provides a Scikit-Learn style API over Lasagne to easily construct neural networks with multiple layers.

Because these libraries do not come default with Domino’s hardware, you need to create a *requirements.txt* with the following text:

`1` |
`pip install ` `-` `r https:` `/` `/` `raw.githubusercontent.com` `/` `dnouri` `/` `nolearn` `/` `master` `/` `requirements.txt git` `+` `https:` `/` `/` `github.com` `/` `dnouri` `/` `nolearn.git@master` `#egg=nolearn==0.7.git` |

### Setting up Theano

Now, before we can import Lasagne and Nolearn, we need to configure Theano, so that it can utilize the GPU hardware. To do this, we create a *.theanorc* file in our project directory with the following contents:

`1` |
`[` `global` `]` |

`2` |
`device ` `=` `gpu` |

`3` |
`floatX ` `=` `float32` |

`4` |

`5` |
`[nvcc]` |

`6` |
`fastmath ` `=` `True` |

The *.theanorc* file must be placed in the home directory. On your local machine this could be done manually, but we cannot access the home directory of Domino’s machine, so we will move the file to the home directory using the following code:

`1` |
`import` `os` |

`2` |
`import` `shutil` |

`3` |

`4` |
`destfile ` `=` `"/home/ubuntu/.theanorc"` |

`5` |
`open` `(destfile, ` `'a'` `).close()` |

`6` |
`shutil.copyfile(` `".theanorc"` `, destfile)` |

The above code creates an empty *.theanorc* file in the home directory and then copy the contents of the *.theanorc* file in our project directory into the file in the home directory.

After changing the hardware tier to GPU, we can test to see if Theano detects the GPU using the test code provided in Theano’s documentation.

`01` |
`import` `os` |

`02` |
`import` `shutil` |

`03` |

`04` |
`destfile ` `=` `"/home/ubuntu/.theanorc"` |

`05` |
`open` `(destfile, ` `'a'` `).close()` |

`06` |
`shutil.copyfile(` `".theanorc"` `, destfile)` |

`07` |

`08` |
`from` `theano ` `import` `function, config, shared, sandbox` |

`09` |
`import` `theano.tensor as T` |

`10` |
`import` `numpy` |

`11` |
`import` `time` |

`12` |

`13` |
`vlen ` `=` `10` `*` `30` `*` `768` `# 10 x #cores x # threads per core` |

`14` |
`iters ` `=` `1000` |

`15` |

`16` |
`rng ` `=` `numpy.random.RandomState(` `22` `)` |

`17` |
`x ` `=` `shared(numpy.asarray(rng.rand(vlen), config.floatX))` |

`18` |
`f ` `=` `function([], T.exp(x))` |

`19` |
`print` `f.maker.fgraph.toposort()` |

`20` |
`t0 ` `=` `time.time()` |

`21` |
`for` `i ` `in` `xrange` `(iters):` |

`22` |
` ` `r ` `=` `f()` |

`23` |
`t1 ` `=` `time.time()` |

`24` |
`print` `'Looping %d times took'` `%` `iters, t1 ` `-` `t0, ` `'seconds'` |

`25` |
`print` `'Result is'` `, r` |

`26` |
`if` `numpy.` `any` `([` `isinstance` `(x.op, T.Elemwise) ` `for` `x ` `in` `f.maker.fgraph.toposort()]):` |

`27` |
` ` `print` `'Used the cpu'` |

`28` |
`else` `:` |

`29` |
` ` `print` `'Used the gpu'` |

If Theano detects the GPU, the above function should take about 0.7 seconds to run and will print ‘Used the gpu’. Otherwise, it will take 2.6 seconds to run and print ‘Used the cpu’. If it outputs this, then you forgot to change the hardware tier to GPU.

### The Dataset

For this project, we’ll be using the CIFAR-10 image dataset containing 60,000 32×32 colored images from 10 different classes.

Fortunately, the data come in a pickled format, so we can load the data using helper functions to load each file into NumPy arrays to produce a training set (Xtr), training labels (Ytr), testing set (Xte), and testing labels (Yte). Credit for the following code goes to Stanford’s CS231n course staff.

`01` |
`import` `cPickle as pickle` |

`02` |
`import` `numpy as np` |

`03` |
`import` `os` |

`04` |

`05` |
`def` `load_CIFAR_file(filename):` |

`06` |
` ` `'''Load a single file of CIFAR'''` |

`07` |
` ` `with ` `open` `(filename, ` `'rb'` `) as f:` |

`08` |
` ` `datadict` `=` `pickle.load(f)` |

`09` |
` ` `X ` `=` `datadict[` `'data'` `]` |

`10` |
` ` `Y ` `=` `datadict[` `'labels'` `]` |

`11` |
` ` `X ` `=` `X.reshape(` `10000` `, ` `3` `, ` `32` `, ` `32` `).transpose(` `0` `,` `2` `,` `3` `,` `1` `).astype(` `'float32'` `)` |

`12` |
` ` `Y ` `=` `np.array(Y).astype(` `'int32'` `)` |

`13` |
` ` `return` `X, Y` |

`14` |

`15` |

`16` |
`def` `load_CIFAR10(directory):` |

`17` |
` ` `'''Load all of CIFAR'''` |

`18` |
` ` `xs ` `=` `[]` |

`19` |
` ` `ys ` `=` `[]` |

`20` |
` ` `for` `k ` `in` `range` `(` `1` `,` `6` `):` |

`21` |
` ` `f ` `=` `os.path.join(directory, ` `"data_batch_%d"` `%` `k)` |

`22` |
` ` `X, Y ` `=` `load_CIFAR_file(f)` |

`23` |
` ` `xs.append(X)` |

`24` |
` ` `ys.append(Y)` |

`25` |
` ` `Xtr ` `=` `np.concatenate(xs)` |

`26` |
` ` `Ytr ` `=` `np.concatenate(ys)` |

`27` |
` ` `Xte, Yte ` `=` `load_CIFAR_file(os.path.join(directory, ` `'test_batch'` `))` |

`28` |
` ` `return` `Xtr, Ytr, Xte, Yte` |

### Multi-Layered Perceptron

A multi-layered perceptron is one of the most simple neural network models. The model consists of an input layer for the data, a hidden layer to apply some mathematical transformation, and an output layer to produce a label (either categorical for classification or continuous for regression).

*Source: http://dms.irb.hr/tutorial/tut_nnets_short.php*

Before we can use the training data, we need to grayscale it and flatten it into a two-dimensional matrix. In addition, we will divide each value by 255 and subtract 0.5. When we grayscale the image, we convert each (R,G,B) tuple in a float value between 0 and 255. By dividing by 255, we normalize the grayscale value to the interval [0,1]. Next, we subtract 0.5 to map the values to the interval [-0.5, 0.5]. Now, each image is represented by a 1024-dimensional array where each value is between -0.5 and 0.5. It’s common practice to standardize your input features to the interval [-1, 1] when training classification networks.

`1` |
`X_train_flat ` `=` `np.dot(X_train[...,:` `3` `], [` `0.299` `, ` `0.587` `, ` `0.114` `]).reshape(X_train.shape[` `0` `],` `-` `1` `).astype(np.float32)` |

`2` |
`X_train_flat ` `=` `(X_train_flat` `/` `255.0` `)` `-` `0.5` |

`3` |
`X_test_flat ` `=` `np.dot(X_test[...,:` `3` `], [` `0.299` `, ` `0.587` `, ` `0.114` `]).reshape(X_test.shape[` `0` `],` `-` `1` `).astype(np.float32)` |

`4` |
`X_test_flat ` `=` `(X_test_flat` `/` `255.0` `)` `-` `.` `5` |

Using Nolearn’s API, we can easily create a multi-layered perceptron with an input, hidden, and output layer. The *hidden_num_units = 100* means our hidden layer has 100 neurons, and the *output_num_units = 10* means our output layer has 10 neurons, one for each of the label. Before outputting, the network applies a softmax function to determine the most probable label. If The network is trained for 50 epochs and with *verbose = 1*, the model prints out the result of each training epoch and how long the epoch took.

`01` |
`net1 ` `=` `NeuralNet(` |

`02` |
` ` `layers ` `=` `[` |

`03` |
` ` `(` `'input'` `, layers.InputLayer),` |

`04` |
` ` `(` `'hidden'` `, layers.DenseLayer),` |

`05` |
` ` `(` `'output'` `, layers.DenseLayer),` |