

Bayesian Surprise
ModelingPredictive Analyticsposted by Thomas Lumley December 28, 2017 Thomas Lumley

For reasons not entirely unconnected with NZ election polling, I’ve been thinking about surprise in Bayesian inference again: what happens when you get a result that’s a long way from what you expected in advance? Yes, your prior is badly calibrated and you should feel bad, but what should you believe?
A toy version of the problem is inference for a location parameter. We have a prior $p_theta(theta)$ for the parameter, and a model $p_X(x|theta)$. Consider two extremes
- $thetasim N(0,1)$ and $Xsimtextrm{Cauchy}(theta)$
- $thetasimtextrm{Cauchy}(0)$ and $Xsim N(theta, 1)$
Suppose we take a single observation $x$ of $X$ and it’s very large. What do we end up believing about $theta$ in each case?
Heuristically, the first case says the data can sometimes be a long way from $theta$, but $theta$ has to be not that far from 0. The second case says $theta$ can sometimes be a long way from 0 but $X$ can’t be that far from $theta$. So in the the first case the posterior for $theta$ should be concentrated fairly near zero and in the second it should be concentrated fairly near $X$. That’s exactly what happens when you do the maths.
Under the first model, the posterior density is proportional to
and the posterior mode solves
For $xtoinfty$ we can’t have $x-theta$ bounded, which in turn means $tildetheta=O((x-tildetheta)^{-1})$, giving $thetato 0$.
Under the second model, the posterior is proportional to
and the posterior mode solves
If $xtoinfty$, the solution to this equation must have $x-tildetheta$ bounded, which implies $tildethetatoinfty$, which implies $x-tildethetato 0$.
If the two distributions are both Normal the posterior mode will be about halfway between $x$ and 0. If they’re both Cauchy, the posterior will be bimodal, with one mode near $x$ and another near 0.
The basic observation here goes back a long way, with a relatively recent summary by O’Hagan in JASA, 1990: given a surprising observation, Bayesian inference can (sensibly) end up just ‘rejecting’ which ever of the prior and model have heavier tails.
Working it out for simple cases makes a nice straightforward stats theory question. It’s also a good low-dimensional example of the problem common in high-dimensional problems that it’s quite hard to be sure what features of your model and prior are going to matter for inference.
Original Source.