# Understanding the Hoeffding Inequality

ToolsTools & Languagesposted by Spencer Norris, ODSC September 20, 2018 Spencer Norris, ODSC

If you read my last post on mathematically defining machine learning problems, then you’ll be familiar with the terminology here. Otherwise, I recommend you read that and then circle back here. We’ll work our way up to understanding the Hoeffding Bound over a few posts. However, it’s important to understand the little theorem that makes it possible in the first place: the Hoeffding Inequality.

The Hoeffding Bound is one of the most important results in machine learning theory, so you’d do well to understand it. It’s what allows us to make probabilistic guesses about observations outside of our training data, and thus it’s why machine learning works at all.

## Your Data Isn’t Helpful!

Say you have a friend that carries a lucky coin. He insists on using it every time you need to break a tie, and you groan whenever he drags it out because you really *believe* it’s lucky.

Maybe you suspect something is up. You begin recording his tosses every time and whether it shows up heads or tails. After a certain number of ‘unfortunate’ outcomes for you and your other friends, you’re feeling pretty confident that the coin is biased. You develop a simple rule of thumb: you think the coin comes up as heads 𝜇 percent of the time and as tails (1 – 𝜇) percent of the time.

But here’s the problem: you obviously don’t *know* that the coin is biased. You’ve never taken it into a lab and tested to see whether it’s weighted one way or the other, and your friend probably isn’t keen on parting with it. Yeah, maybe you guessed a few cases right – but why would you think that you’ll keep being right over and over again? Just because you observed 99 heads doesn’t mean the next one *can’t* come up as tails. These observations on their own don’t tell you anything about future outcomes.

We’ve seen this before: our coin is some target function f and it falls somewhere in the infinite hypothesis space H. We want to find some gf that we can use to guess on values in the future.

So how can we make those observations useful?

## Making Data Helpful

For a human, the answer is very straightforward: if I have all these observations, I’m just going to guess whichever one is more common, either heads or tails. That way, you’re just maximizing your expected value, right? But that isn’t immediately obvious in the math, so we have to formalize it a little more.

Say you pick out N random observations, where the coin came up either heads or tails. How do we use these observations to make any sense of what will happen on the N+1 toss?

(An important subtlety: the observations *must* be random. Randomness allows us to believe that our observations are representative of the larger sample. It’s like a parent showing up at their college-aged kid’s apartment: what are they going to find if they tell them they’re coming, versus if they waltz in unannounced? Chances are the latter is closer to reality.)

Look at the percentage of tosses that came up as heads. We’ll call this fraction v. The flip side of that is the number of times the coin came up as tails, which is v-1.

The idea is that we want to somehow peg our percentage of observed heads v to the true probability of observing a heads, 𝜇. Incidentally, we have a very handy tool for this: the Hoeffding Inequality.

The Hoeffding Inequality is as follows:

𝕡[ |v-u| >eps]2e-2(eps)2N

What the Hoeffding Inequality gives us is a probabilistic guarantee that v doesn’t stray too far from 𝜇. eps is some small value which we use to measure the deviation of v from 𝜇. We claim that the probability of v being more than eps away from 𝜇 is less than or equal to some bound which shrinks exponentially as eps and/or our sample size N increases. In other words, the larger your sample size and the wider your margin of error, the less likely you are to step over that margin of error with your best guess.

## Calling Out Your Friend

So, you’ve watched your friend flip his ‘lucky’ coin N times, and you see that it comes up heads v percent of the time, and you want to be right within a margin of 𝜀. Say N, v and 𝜀 are 1,000, .75 and .05, respectively. You still don’t know what 𝜇 is, but after you watch him crush your mutual friends and steal their cash multiple times, you figure you should confront him.

You tell him you think he’s cheating, and he says there’s no way. So you tell him, “I think I’m right about this. I’ve watched you flip your coin 1,000 times and it comes up as heads 75 percent of the time. The chance of my guess being off the mark by more than 5 percent are less than or equal to 2e-2(.05)2(1000), or 1.3 percent. In other words, I am 98.7 percent confident that your coin is biased.” You further add that he’s kind of a jerk about insisting that he gets to call heads, which about seals the deal.

He sheepishly relinquishes his coin.

We’ve just demonstrated the power of the Hoeffding Bound in tying an unobserved function to our expectations. Next time, we’ll lay the foundation for why this allows us to say that one hypothesis h is *probably* better than another one.