# Random-Walk Bayesian Deep Networks: Dealing with Non-Stationary Data

Deep LearningModelingposted by Thomas Wiecki, PhD April 27, 2017

*Thomas originally posted this article here at http://twiecki.github.io *

Most problems solved by Deep Learning are stationary. A cat is always a cat. The rules of Go have remained stable for 2,500 years, and will likely stay that way. However, what if the world around you *is* changing? This is common, for example when applying Machine Learning in Quantitative Finance. Markets are constantly evolving so features that are predictive in some time-period might not lose their edge while other patterns emerge. Usually, quants would just retrain their classifiers every once in a while. This approach of just re-estimating the same model on more recent data is very common. I find that to be a pretty unsatisfying way of modeling, as there are certain shortfalls:

- The estimation window should be long so as to incorporate as much training data as possible.
- The estimation window should be short so as to incorporate only the most recent data, as old data might be obsolete.
- When you have no estimate of how fast the world around you is changing, there is no principled way of setting the window length to balance these two objectives.

Certainly there is something to be learned even from past data, we just need to instill our models with a sense of time and recency.

Enter random-walk processes. Ever since I learned about them in the stochastic volatility model they have become one of my favorite modeling tricks. Basically, it allows you to turn every static model into a time-sensitive one.

You can read more about the details of a random-walk priors here, but the central idea is that, in any time-series model, rather than assuming a parameter to be constant over time, we allow it to change gradually, following a random walk. For example, take a logistic regression:

(Y_i=f(βX_i))

Where (f) is the logistic function and (β) is our learnable parameter. If we assume that our data is not iid and that (β) is changing over time. We thus need a different (β) for every (i):

(Y_i=f(β_iX_i))

Of course, this will just overfit, so we need to constrain our (β_i) somehow. We will assume that while (β_i) is changing over time, it will do so rather gradually by placing a random-walk prior on it:

(β_t∼mathcal{N}(β_{t−1},s^2))

So (β_t) is allowed to only deviate a little bit (determined by the step-width (s)) form its previous value (β_t). (s) can be thought of as a stability parameter — how fast is the world around you changing.

Let’s first generate some toy data and then implement this model in `PyMC3`

. We will then use this same trick in a Neural Network with hidden layers.

If you would like a more complete introduction to Bayesian Deep Learning, see my recent ODSC London talk. This blog post takes things one step further so definitely read further below.

```
%matplotlib inline
import pymc3 as pm
import theano.tensor as T
import theano
import sklearn
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('white')
from sklearn import datasets
from sklearn.preprocessing import scale
import warnings
from scipy import VisibleDeprecationWarning
warnings.filterwarnings("ignore", category=VisibleDeprecationWarning)
sns.set_context('notebook')
```

### Generating data

First, lets generate some toy data — a simple binary classification problem that’s linearly separable. To introduce the non-stationarity, we will rotate this data along the center across time. Safely skip over the next few code cells.

```
X, Y = sklearn.datasets.make_blobs(n_samples=1000, centers=2, random_state=1)
X = scale(X)
colors = Y.astype(str)
colors[Y == 0] = 'r'
colors[Y == 1] = 'b'
interval = 20
subsample = X.shape[0] // interval
chunk = np.arange(0, X.shape[0]+1, subsample)
degs = np.linspace(0, 360, len(chunk))
sep_lines = []
for ii, (i, j, deg) in enumerate(list(zip(np.roll(chunk, 1), chunk, degs))[1:]):
theta = np.radians(deg)
c, s = np.cos(theta), np.sin(theta)
R = np.matrix([[c, -s], [s, c]])
X[i:j, :] = X[i:j, :].dot(R)
```

```
import base64
from tempfile import NamedTemporaryFile
VIDEO_TAG = ""<video controls>
<source src="data:video/x-m4v;base64,{0}" type="video/mp4">
Your browser does not support the video tag.
</video>""
def anim_to_html(anim):
if not hasattr(anim, '_encoded_video'):
anim.save("test.mp4", fps=20, extra_args=['-vcodec', 'libx264'])
video = open("test.mp4","rb").read()
anim._encoded_video = base64.b64encode(video).decode('utf-8')
return VIDEO_TAG.format(anim._encoded_video)
from IPython.display import HTML
def display_animation(anim):
plt.close(anim._fig)
return HTML(anim_to_html(anim))
from matplotlib import animation
# First set up the figure, the axis, and the plot element we want to animate
fig, ax = plt.subplots()
ims = [] #l, = plt.plot([], [], 'r-')
for i in np.arange(0, len(X), 10):
ims.append([(ax.scatter(X[:i, 0], X[:i, 1], color=colors[:i]))])
ax.set(xlabel='X1', ylabel='X2')
# call the animator. blit=True means only re-draw the parts that have changed.
anim = animation.ArtistAnimation(fig, ims,
interval=500,
blit=True);
display_animation(anim)
```

The last frame of the video, where all data is plotted is what a classifier would see that has no sense of time. Thus, the problem we set up is impossible to solve when ignoring time, but trivial once you do.

How would we classically solve this? You could just train a different classifier on each subset. But as I wrote above, you need to get the frequency right and you use less data overall.

## Random-Walk Logistic Regression in PyMC3

```
from pymc3 import HalfNormal, GaussianRandomWalk, Bernoulli
from pymc3.math import sigmoid
import theano.tensor as tt
X_shared = theano.shared(X)
Y_shared = theano.shared(Y)
n_dim = X.shape[1] # 2
with pm.Model() as random_walk_perceptron:
step_size = pm.HalfNormal('step_size', sd=np.ones(n_dim),
shape=n_dim)
# This is the central trick, PyMC3 already comes with this distribution
w = pm.GaussianRandomWalk('w', sd=step_size,
shape=(interval, 2))
weights = tt.repeat(w, X_shared.shape[0] // interval, axis=0)
class_prob = sigmoid(tt.batched_dot(X_shared, weights))
# Binary classification -> Bernoulli likelihood
pm.Bernoulli('out', class_prob, observed=Y_shared)
```

OK, if you understand the stochastic volatility model, the first two lines should look fairly familiar. We are creating 2 random-walk processes. As allowing the weights to change on every new data point is overkill, we subsample. The `repeat`

turns the vector `[t, t+1, t+2]`

into `[t, t, t, t+1, t+1, ...]`

so that it matches the number of data points.

Next, we would usually just apply a single dot-product but here we have many weights we’re applying to the input data, so we need to call `dot`

in a loop. That is what `tt.batched_dot`

does. In the end, we just get probabilities (predicitions) for our Bernoulli likelihood.

On to the inference. In `PyMC3`

we recently improved NUTS in many different places. One of those is automatic initialization. If you just call `pm.sample(n_iter)`

, we will first run ADVI to estimate the diagional mass matrix and find a starting point. This usually makes NUTS run quite robustly.

```
with random_walk_perceptron:
trace_perceptron = pm.sample(2000)
```

Let’s look at the learned weights over time:

```
plt.plot(trace_perceptron['w'][:, :, 0].T, alpha=.05, color='r');
plt.plot(trace_perceptron['w'][:, :, 1].T, alpha=.05, color='b');
plt.xlabel('time'); plt.ylabel('weights'); plt.title('Optimal weights change over time'); sns.despine();
```