# How Quickly Can You Compute the Dot Product Between Two Large Vectors?

Data WranglingModelingposted by Daniel Lemire July 10, 2018 Daniel Lemire

A dot (or scalar) product is a fairly simple operation that simply sums the many products:

float sum = 0; for (size_t i = 0; i < len; i++) { sum += x1[i] * x2[i]; } return sum;

It is nevertheless tremendously important. You know these fancy machine learning algorithms we keep hearing about? If you dig deep under all the sophistication, you often find lots and lots of dot products.

Xavier Arteaga had a few interesting comments on a previous post of mine:

- “I would say that CPU frequency does not really make a difference when computing large amounts of data.”
- “I guess for intensive computations which require little memory and lots of operations AVX512 [new fancy instruction sets provided by Intel] provides a better performance.”

I disagree with Xavier, but I feel the need to provide some analysis. So let us evaluate Xavier’s observations with the dot product. Before I begin, let me point out that I am not a linear algebra expert, there has been lots and lots of fancy engineering done on these problems. If you need fast dot product software, find a good library, don’t grab it out of my blog.

I implemented a simple benchmark that computes dot products over increasingly larger vectors. I use three modes: “standard” math which is what you get when you simply compile a dot-product loop, a version with 128-bit vectors (SSE) and a version with 256-bit vectors (AVX2).

total input size | cycles per pair of floats | bytes per cycle | mode |
---|---|---|---|

8 kB | 4 | 2 | standard math |

8 kB | 1.0 | 7.8 | fast math (SSE) |

8 kB | 0.55 | 15 | fast math (AVX2) |

16 MB | 4 | 2 | standard math |

16 MB | 1.0 | 7 | fast math (SSE) |

16 MB | 0.9 | 9 | fast math (AVX2) |

256 MB | 4 | 2 | standard math |

256 MB | 1.3 | 6.0 | fast math (SSE) |

256 MB | 1.2 | 6.7 | fast math (AVX2) |

Can we reason about the AVX2 speed? So you have two loads (uses as little as one cycle), one cheap vectorized multiplication and one cheap vectorized addition. You can multiply two vectors of eight floating-point values per cycle, easily on a recent Intel processor. Then there is overhead due to the loop. Generously, say that it takes 5 cycles to multiply to pair of vectors and add them to a set of accumulators… that is about 0.6 cycles per multiplication. As expected, my system does it faster (0.55 cycles per pair) which means that it takes fewer than 5 cycles per pair of eight values.

What is apparent is that we are quickly hitting a wall… for large inputs (e.g., 256 MB and more). This wall has to do with how quickly our single core can grab data from the memory subsystem. I suspect that Xavier is correct: this wall has probably little dependency on the CPU frequency. Furthermore, having fancier instructions (e.g., those from the AVX-512 instruction sets) will not help you.

What can we conclude?

- If you have to process lots of data, and do dirt cheap operations (e.g., a vectorized dot product), then your single processor core is easily starved for data. That’s the part where Xavier is right.
- However, it is important to qualify what we mean by “cheap tasks”. Even just computing the dot product in a standard-compliant manner is entirely compute-bound in my experiments. Lots and lots of streaming operations like parsing a document, compressing data, and so forth, are likely to be compute bound.

To put it otherwise, I would say that while it is possible to design functions that stream through data in such a way that they are memory-bound over large inputs (e.g., copying data), these functions need to be able to eat through several bytes of data per cycle. Because our processors are vectorized and superscalar, it is certainly in the realm of possibilities, but harder than you might think.

*Original Source*