Modern processors use many tricks to go faster. They are superscalar which means that they can execute many instructions at once....

Modern processors use many tricks to go faster. They are superscalar which means that they can execute many instructions at once. They are multicore, which means that each CPU is made of several baby processors that are partially independent. And they are vectorized, which means that they have instructions that can operate over wide registers (spanning 128, 256 or even 512 bits).

Regarding vectorization, Intel is currently ahead of the curve with its AVX-512 instruction sets. They have the only commodity-level processors able to work over 512-bit registers. AMD is barely doing 256 bits and your phone is limited to 128 bits.

The more you use your CPU, the more heat it produces. Intel does not want your CPU to burn out. So it throttles the CPU (makes it run slower). Your CPU stays warm but not too hot. When the processor does not have to use AVX-512 instructions, some of the silicon remains dark, and thus less heat is generated.

The cores that are executing SIMD instructions may run at a lower frequency. Thankfully, Intel has per-core voltage and frequencies. And Intel processors can switch frequency very fast, certainly in a millisecond.

Vlad Krasnov from Cloudfare wrote a blog post last year warning us against AVX-512 throttling:

If you do not require AVX-512 for some specific high performance tasks, I suggest you disable AVX-512 execution on your server or desktop, to avoid accidental AVX-512 throttling.

I am sure that it is the case that AVX-512 can cause problems for some use cases. It is also the case that some people die if you give them aspirin; yet we don’t retire aspirin.

Should we really disable AVX-512 as a precautionary stance?

In an earlier blog post, I tried to measure this throttling on a server I own but found no effect whatsoever. It could be that this server does not have AVX-512 throttling or maybe I did not make use of sufficiently expensive instructions.

Vlad offered me test case in C. His test case involves AVX-512 multiplications, while much of the running time is spent on some bubble-sort routine. It can run in both AVX-512 mode and in the regular (non-AVX-512) mode. To be clear, it is not meant to be a demonstration in favour of AVX-512: it is meant to show that AVX-512 can be detrimental.

I did not want to run my tests using my own server this time. So I went to Packet and launched a powerful two-CPU Xeon Gold server (Intel Xeon Gold 5120 CPU @ 2.20GHz). Each of these processors have 14 cores, so we have 28 cores in total. Because of hyperthreading, it supports up to 56 physical threads (running two threads per core).

threads AVX-512 disabled with AVX-512
20 7.6 s 7.5 s
40 6 s 5.8 s
80 8.4 s 10.7 s

As you can see, as long as the number of threads does not exceed the number of physically supported threads (56), the AVX-512 version is slightly faster. If I crank up the number of threads beyond the limit (to 80), both the non-AVX-512 and the AVX-512 code suffers, but the AVX-512 code suffers more. I am not sure why that would be but I suspect it is not related to throttling; it might have to do with context switching and register initialization (though that is speculation on my part).

Thus all that my experiment reveals is that if you are going to use too many physical threads in C, don’t use AVX-512. Otherwise, I see no negative effect from the application of AVX-512. If there is throttling, it appears that the benefits of AVX-512 offset it.

My code is available along with all the scripts and the outputs for your inspection. You should be able to reproduce my results. It is not like Xeon Gold processors are magical faeries: anyone can grab an instance. For the record, the bill I got from Packet was $2.

Note: I have no conflict of interest to disclose. I do not own Intel stock.


Original Source

Daniel Lemire

Daniel Lemire

Daniel Lemire is a full professor in computer science at the University of Quebec (TELUQ). His research is focused on data indexing techniques. For example, he worked on bitmap indexes, column-oriented databases and integer compression. He is also interested in database design and probabilistic algorithms (e.g., universal hashing). His work on bitmap indexes is used by companies such as eBay, LinkedIn, Facebook and Netflix in their data warehousing, within big-data platforms such as Apache Hive, Druid, Apache Spark, Netflix Atlas, LinkedIn Pinot and Apache Kylin. The version control system Git is also accelerated by the same compressed bitmaps. Some of his techniques were adopted by Apache Lucene, the search engine behind sites such as Wikipedia or platforms such as Solr and Elastic. One of his hashing techniques has been adopted by Google TensorFlow. His Slope One recommender algorithm is a standard reference in the field of recommender systems. He is a beneficiary of the Google Open Source Peer Bonus Program. He has written over 50 peer-reviewed publications, including more than 30 journal articles. He has held competitive research grants for the last 15 years. He serves on the program committees of leading computer science conferences (e.g., ACM CIKM, WWW, ACM WSDM, ACM SIGIR, ACM RecSys).