How Quickly Can You Compute the Dot Product Between Two Large Vectors? How Quickly Can You Compute the Dot Product Between Two Large Vectors?
A dot (or scalar) product is a fairly simple operation that simply sums the many products: float sum = 0; for... How Quickly Can You Compute the Dot Product Between Two Large Vectors?

A dot (or scalar) product is a fairly simple operation that simply sums the many products:

float sum = 0;
for (size_t i = 0; i < len; i++) {
    sum += x1[i] * x2[i];
return sum;

It is nevertheless tremendously important. You know these fancy machine learning algorithms we keep hearing about? If you dig deep under all the sophistication, you often find lots and lots of dot products.

Xavier Arteaga had a few interesting comments on a previous post of mine:

  1. “I would say that CPU frequency does not really make a difference when computing large amounts of data.”
  2. “I guess for intensive computations which require little memory and lots of operations AVX512 [new fancy instruction sets provided by Intel] provides a better performance.”

I disagree with Xavier, but I feel the need to provide some analysis. So let us evaluate Xavier’s observations with the dot product. Before I begin, let me point out that I am not a linear algebra expert, there has been lots and lots of fancy engineering done on these problems. If you need fast dot product software, find a good library, don’t grab it out of my blog.

I implemented a simple benchmark that computes dot products over increasingly larger vectors. I use three modes: “standard” math which is what you get when you simply compile a dot-product loop, a version with 128-bit vectors (SSE) and a version with 256-bit vectors (AVX2).

total input size cycles per pair of floats bytes per cycle mode
8 kB 4 2 standard math
8 kB 1.0 7.8 fast math (SSE)
8 kB 0.55 15 fast math (AVX2)
16 MB 4 2 standard math
16 MB 1.0 7 fast math (SSE)
16 MB 0.9 9 fast math (AVX2)
256 MB 4 2 standard math
256 MB 1.3 6.0 fast math (SSE)
256 MB 1.2 6.7 fast math (AVX2)

Can we reason about the AVX2 speed? So you have two loads (uses as little as one cycle), one cheap vectorized multiplication and one cheap vectorized addition. You can multiply two vectors of eight floating-point values per cycle, easily on a recent Intel processor. Then there is overhead due to the loop. Generously, say that it takes 5 cycles to multiply to pair of vectors and add them to a set of accumulators… that is about 0.6 cycles per multiplication. As expected, my system does it faster (0.55 cycles per pair) which means that it takes fewer than 5 cycles per pair of eight values.

What is apparent is that we are quickly hitting a wall… for large inputs (e.g., 256 MB and more). This wall has to do with how quickly our single core can grab data from the memory subsystem. I suspect that Xavier is correct: this wall has probably little dependency on the CPU frequency. Furthermore, having fancier instructions (e.g., those from the AVX-512 instruction sets) will not help you.

What can we conclude?

  1. If you have to process lots of data, and do dirt cheap operations (e.g., a vectorized dot product), then your single processor core is easily starved for data. That’s the part where Xavier is right.
  2. However, it is important to qualify what we mean by “cheap tasks”. Even just computing the dot product in a standard-compliant manner is entirely compute-bound in my experiments. Lots and lots of streaming operations like parsing a document, compressing data, and so forth, are likely to be compute bound.

To put it otherwise, I would say that while it is possible to design functions that stream through data in such a way that they are memory-bound over large inputs (e.g., copying data), these functions need to be able to eat through several bytes of data per cycle. Because our processors are vectorized and superscalar, it is certainly in the realm of possibilities, but harder than you might think.


Original Source

Daniel Lemire

Daniel Lemire

Daniel Lemire is a full professor in computer science at the University of Quebec (TELUQ). His research is focused on data indexing techniques. For example, he worked on bitmap indexes, column-oriented databases and integer compression. He is also interested in database design and probabilistic algorithms (e.g., universal hashing). His work on bitmap indexes is used by companies such as eBay, LinkedIn, Facebook and Netflix in their data warehousing, within big-data platforms such as Apache Hive, Druid, Apache Spark, Netflix Atlas, LinkedIn Pinot and Apache Kylin. The version control system Git is also accelerated by the same compressed bitmaps. Some of his techniques were adopted by Apache Lucene, the search engine behind sites such as Wikipedia or platforms such as Solr and Elastic. One of his hashing techniques has been adopted by Google TensorFlow. His Slope One recommender algorithm is a standard reference in the field of recommender systems. He is a beneficiary of the Google Open Source Peer Bonus Program. He has written over 50 peer-reviewed publications, including more than 30 journal articles. He has held competitive research grants for the last 15 years. He serves on the program committees of leading computer science conferences (e.g., ACM CIKM, WWW, ACM WSDM, ACM SIGIR, ACM RecSys).