# Eight-bit Floating Point

ModelingStatisticsposted by John Cook April 25, 2018 John Cook

Researchers have discovered that for some problems, deep neural networks (DNNs) can get by with low precision weights. Using fewer bits to represent weights means that more weights can fit in memory at once. This, as well as embedded systems, has renewed interest in low-precision floating point.

Microsoft mentioned its proprietary floating point formats **ms-fp8** and **ms-fp9** in connection with its Brainwave Project [1]. I haven’t been able to find any details about these formats, other than that they use two- and three-bit exponents (respectively?).

This post will look at what an 8-bit floating point number would look like if it followed the pattern of IEEE floats or posit numbers. In the notation of the previous post, we’ll look at ieee<8,2> and posit<8,0> numbers. (**Update**: Added a brief discussion of ieee<8,3>, ieee<8,4>, and posit<8,1> at the end.)

## Eight-bit IEEE-like float

IEEE floating point reserves exponents of all 0’s and all 1’s for special purposes. That’s not as much of a high price with large exponents, but with only four possible exponents, it seems very wasteful to devote half of them for special purposes. Maybe this is where Microsoft does something clever. But for this post, we’ll forge ahead with the analogy to larger IEEE floating point numbers.

There would be 191 representable finite numbers, counting the two representations of 0 as one number. There would be two infinities, positive and negative, and 62 ways to represent NaN.

The smallest non-zero number would be

2^{-5} = 1/32 = 0.03125.

The largest value would be 01011111 and have value

4(1 – 2^{-5}) = 31/8 = 3.3875.

This makes the dynamic range just over two decades.

## Eight-bit posit

A posit<8, 0> has no significand, just a sign bit, regime, and exponent. But in this case the useed value is 2, and so the range acts like an exponent.

There are 255 representable finite numbers and one value corresponding to ±∞.

The smallest non-zero number would be 1/64 and the largest finite number would be 64. The dynamic range is 3.6 decades.

**Update**: Here is a list of all possible posit<8,0> numbers.

## Distribution of values

The graphs below give the distribution of 8-bit IEEE-like numbers and 8-bit posits on a log scale.

The distribution of IEEE-like numbers is asymmetric because much of the dynamic range comes from denormalized numbers.

The distributions of posits is approximately symmetrical. If a power of 2 is representable as a posit, so is its reciprocal. But you don’t have perfect symmetry because, for example, 3/2 is representable while 2/3 is not.

## Other eight-bit formats

I had originally considered a 2-bit significand because Microsoft’s ms-fp8 format has a two-bit significand. After this post was first published it was suggested in the comments that an ieee<8, 4> float might be better than ieee<8, 2>, so let’s look at that. Let’s look at ieee<8, 3> too while we’re at it. And a posit<8, 1> too.

An ieee<8, 3> floating point number would have a maximum value of 7 and a minimum value of 2^{-6} = 1/64, a dynamic range of 2.7 decades. It would have 223 finite values, including two zeros, as well as 2 infinities as 30 NaNs.

An ieee<8, 4> floating point number would have a maximum value of 120 and a minimum value of 2^{-9} = 1/512, a dynamic range of 4.7 decades. It would have 239 finite values, including two zeros, as well as 2 infinities and 14 NaNs.

A posit<8, 1> would have a maximum value of 2^{12} = 4096 and a minimum value of 1/4096, a dynamic range of 7.2 decades. Any 8-bit posit, regardless of the maximum number of exponent bits, will have 255 finite values and one infinity.

Near 1, an ieee<8, 4> has 3 significand bits, an ieee<8, 3> has 4, and a posit<8,1> has 4.

***

[1] Chung et al. Serving DNNs in Real Time at Datacenter Scale with Project Brainwave. Available here.

*Original Source*