# Bayesian Surprise

ModelingPredictive Analyticsposted by Thomas Lumley December 28, 2017 Thomas Lumley

For reasons not entirely unconnected with NZ election polling, I’ve been thinking about surprise in Bayesian inference again: what happens when you get a result that’s a long way from what you expected in advance? Yes, your prior is badly calibrated and you should feel bad, but what should you **believe**?

A toy version of the problem is inference for a location parameter. We have a prior $p_theta(theta)$ for the parameter, and a model $p_X(x|theta)$. Consider two extremes

- $thetasim N(0,1)$ and $Xsimtextrm{Cauchy}(theta)$
- $thetasimtextrm{Cauchy}(0)$ and $Xsim N(theta, 1)$

Suppose we take a single observation $x$ of $X$ and it’s very large. What do we end up believing about $theta$ in each case?

Heuristically, the first case says the data can sometimes be a long way from $theta$, but $theta$ has to be not that far from 0. The second case says $theta$ can sometimes be a long way from 0 but $X$ can’t be that far from $theta$. So in the the first case the posterior for $theta$ should be concentrated fairly near zero and in the second it should be concentrated fairly near $X$. That’s exactly what happens when you do the maths.

Under the first model, the posterior density is proportional to

$$e^{-theta^2/2}frac{1}{1+(x-theta)^2}$$ and the posterior mode solves

$$tildetheta =frac{(x-tildetheta)}{1+(x-tildetheta)^2}.$$

For $xtoinfty$ we can’t have $x-theta$ bounded, which in turn means $tildetheta=O((x-tildetheta)^{-1})$, giving $thetato 0$.

Under the second model, the posterior is proportional to $$e^{(x-theta)^2/2}frac{1}{1+theta^2}$$ and the posterior mode solves

$$x-tildetheta=frac{2tildetheta}{1+tildetheta^2}.$$

If $xtoinfty$, the solution to this equation must have $x-tildetheta$ bounded, which implies $tildethetatoinfty$, which implies $x-tildethetato 0$.

If the two distributions are both Normal the posterior mode will be about halfway between $x$ and 0. If they’re both Cauchy, the posterior will be bimodal, with one mode near $x$ and another near 0.

The basic observation here goes back a long way, with a relatively recent summary by O’Hagan in JASA, 1990: given a surprising observation, Bayesian inference can (sensibly) end up just ‘rejecting’ which ever of the prior and model have heavier tails.

Working it out for simple cases makes a nice straightforward stats theory question. It’s also a good low-dimensional example of the problem common in high-dimensional problems that it’s quite hard to be sure what features of your model and prior are going to matter for inference.

Original Source.