Vectorizing Random Number Generators for Greater Speed: PCG and xorshift128+ (AVX-512 edition) Vectorizing Random Number Generators for Greater Speed: PCG and xorshift128+ (AVX-512 edition)
Most people designing random number generators program using regular code. If they are aiming for speed, they probably write functions in C. However, our... Vectorizing Random Number Generators for Greater Speed: PCG and xorshift128+ (AVX-512 edition)

Most people designing random number generators program using regular code. If they are aiming for speed, they probably write functions in C. However, our processors have fast “vectorized” (or SIMD) instructions that can allow you to go faster.

These instructions do several operations at once. For example, recent Skylake-X processors from Intel can do eight 64-bit multiplications with a single instruction.

There is a vectorized version of the Mersenne Twister generator used in some C++ standard libraries, but the Mersenne Twister is not particularly fast to begin with.

What happens if we vectorize really fast generators?

I had good luck vectorizing Vigna’s xorshift128+ random number generator. A generator like it is part of some JavaScript engines. The xorshift128+ generator produces 64-bit values, but you can consider them as two 32-bit values. On my Skylake-X processor, I can generate 32-bit random integers at a rate of 2 cycles per integer using xorshift128+. That’s almost twice as fast as when using the default, scalar implementation in C.

scalar xorshift128+ 3.6 cycles per 32-bit word
SIMD xorshift128+ 1.0 cycles per 32-bit word

PCG is a family of random number generators invented by O’Neill. Can we do the same with PCG? I took a first stab at it with my simdpcg library. My vectorized PCG ends up using about 1 cycle per integer, so it has the same speed as the vectorized xorshift128+.

scalar PCG 4.3 cycles per 32-bit word
SIMD PCG 1.0 cycles per 32-bit word

In these tests, I simply write out the random number to a small array in cache. I only measure raw throughput. To get these good results, I “cheat” a bit by interleaving several generators in the vectorized versions. Indeed, without this interleave trick, I find that the processor is almost running idle due to data dependencies.

My C code is available:

Sadly, I expect that most of my readers do not yet have processors with support for AVX-512 instructions. They are available as part of the Skylake-X and Cannonlake processors. Intel has not been doing a great job at selling these new processors in great quantities. You may be able to have access to such processors using the cloud.

Update: In my initial version, I reported worse performance on xorshift128+. Samuel Neves pointed out to me that this is due to the lack of inlining, because I compile the xorshift128+ functions in their own object files. We can solve this particular problem using link time optimization (LTO), effectively by passing the -fltoflag as part of the compile command line. As usual, results will vary depending on your compiler and processor.

Further reading: See Xorshift1024*, Xorshift1024+, Xorshift128+ and Xoroshiro128+ fail statistical tests for linearity, Journal of Computational and Applied Mathematics.

Originally posted here.

 

Daniel Lemire

Daniel Lemire

Daniel Lemire is a full professor in computer science at the University of Quebec (TELUQ). His research is focused on data indexing techniques. For example, he worked on bitmap indexes, column-oriented databases and integer compression. He is also interested in database design and probabilistic algorithms (e.g., universal hashing). His work on bitmap indexes is used by companies such as eBay, LinkedIn, Facebook and Netflix in their data warehousing, within big-data platforms such as Apache Hive, Druid, Apache Spark, Netflix Atlas, LinkedIn Pinot and Apache Kylin. The version control system Git is also accelerated by the same compressed bitmaps. Some of his techniques were adopted by Apache Lucene, the search engine behind sites such as Wikipedia or platforms such as Solr and Elastic. One of his hashing techniques has been adopted by Google TensorFlow. His Slope One recommender algorithm is a standard reference in the field of recommender systems. He is a beneficiary of the Google Open Source Peer Bonus Program. He has written over 50 peer-reviewed publications, including more than 30 journal articles. He has held competitive research grants for the last 15 years. He serves on the program committees of leading computer science conferences (e.g., ACM CIKM, WWW, ACM WSDM, ACM SIGIR, ACM RecSys).

Open Data Science - Your News Source for AI, Machine Learning & more