fbpx
Performance of ranged accesses into arrays: modulo, multiply-shift and masks Performance of ranged accesses into arrays: modulo, multiply-shift and masks
Suppose that you wish to access values in an array of size n, but instead of having indexes in [0,n), you have arbitrary non-negative... Performance of ranged accesses into arrays: modulo, multiply-shift and masks

Suppose that you wish to access values in an array of size n, but instead of having indexes in [0,n), you have arbitrary non-negative integers. This sort of problems happens when you build a hash table or other array-backed data structure.

The naive approach to this problem is to use the remainder of the division by n (sometimes called a modulo reduction):

uint32_t access(uint32_t * array, size_t n, size_t index) {
  return array[index % n];
}

However, if the compiler cannot inline this call and determine the value of n, then this code is likely to compile to a division instruction. Division instructions are among the slowest instructions on modern-day processors.

To avoid division, many people assume that n is a power of two. Then they use the mask trick: i & (n-1) = i % n.

uint32_t access(uint32_t * array, size_t n, size_t index) {
  return array[index & ( n - 1 )];
}

That’s typically much faster. You do pay a price, however. Your array lengths must all be a power of two. It is a small price to pay, but a price nonetheless.

Another approach I like is the multiply-shift fast alternative to the modulo reduction (see the fastrange library). It involves a multiplication followed by a shift:

uint32_t access(uint32_t * array, uint64_t n, size_t index) {
  return array[(index * n)>>32];
}

Undeniably, the masked approach ought to be faster. You cannot get much faster than a bitwise AND. It is nearly a free instruction on modern processors. Multiplications and shifts are typically more expensive.

But let us measure the throughput of these operations. One thing to take into account is that if you have to do more than one such access, the processor can vectorize it. That is, it can use fast instructions that do several multiplications at once. On x64 processors, the vpmuludq instruction can do four full 32-bit by 32-bit multiplications at once.

Let us try it out with the GCC compiler (5.5):

no-AVX2AVX2
modulo8 cycles8 cycles
multiply-shift2.2 cycles1.5 cycles
mask1.7 cycles1.7 cycles

Clearly, my particular compiler does a poor job at optimizing the masked approach. It should be able to beat the multiply-shift approach. Yet what I think should be evident is that the approach with a mask is not massively more efficient than the multiply-shift approach, and might even be slower (depending on your compiler).

In effect, it may not be warranted to force your array lengths to be a power of two. With careful engineering, you might get much of the same performance with any array length. It is especially likely to be true if you often access several values at once in your array, because you can rely on vectorization.

Relying on the compiler to do some vectorization magic is fine in most instances, but what if you want more control? My original code looks like this…

uint32_t fastsum(uint32_t * z, uint32_t N, uint32_t * accesses, 
   uint32_t nmbr) {
  uint32_t sum = 0;
  uint64_t N64 = (uint64_t) N;
  for(uint32_t j = 0; j < nmbr ; ++j ) {
    sum += z[(accesses[j] * N64)>> 32] ;
  }
  return sum;
}

Here is a version with Intel intrinsics:

uint32_t vectorsum(uint32_t * z, uint32_t N, uint32_t * accesses, 
     uint32_t nmbr) {
  __m256i Nvec = _mm256_set1_epi32(N);
  __m128i sum = _mm_setzero_si128();
  for(uint32_t j = 0; j < nmbr ; j+=4) {
     __m256i fourints = _mm256_loadu_si256(accesses + j;
     __m256i f4 =  _mm256_mul_epu32(fourints, Nvec);
     __m256i ft = _mm256_srli_epi64(f4,32);
     __m128i fi = _mm256_i64gather_epi32 (z,ft , 4);
     sum = _mm_add_epi32(sum, fi);
  }
  uint32_t buffer[4];
  _mm_storeu_si128((__m128i *)buffer,sum);
  return buffer[0] + buffer[1] + buffer[2] + buffer[3];
}

The catch is that you have to process values four at a time.

My source code is available.

Original post.

Daniel Lemire

Daniel Lemire

Daniel Lemire is a full professor in computer science at the University of Quebec (TELUQ). His research is focused on data indexing techniques. For example, he worked on bitmap indexes, column-oriented databases and integer compression. He is also interested in database design and probabilistic algorithms (e.g., universal hashing). His work on bitmap indexes is used by companies such as eBay, LinkedIn, Facebook and Netflix in their data warehousing, within big-data platforms such as Apache Hive, Druid, Apache Spark, Netflix Atlas, LinkedIn Pinot and Apache Kylin. The version control system Git is also accelerated by the same compressed bitmaps. Some of his techniques were adopted by Apache Lucene, the search engine behind sites such as Wikipedia or platforms such as Solr and Elastic. One of his hashing techniques has been adopted by Google TensorFlow. His Slope One recommender algorithm is a standard reference in the field of recommender systems. He is a beneficiary of the Google Open Source Peer Bonus Program. He has written over 50 peer-reviewed publications, including more than 30 journal articles. He has held competitive research grants for the last 15 years. He serves on the program committees of leading computer science conferences (e.g., ACM CIKM, WWW, ACM WSDM, ACM SIGIR, ACM RecSys).

1