Greater Speed in Memory-Bound Graph Algorithms with Just Straight C Code Greater Speed in Memory-Bound Graph Algorithms with Just Straight C Code
Graph algorithms are often memory bound. When you visit a node, there is no reason to believe that its neighbours are located nearby in... Greater Speed in Memory-Bound Graph Algorithms with Just Straight C Code

Graph algorithms are often memory bound. When you visit a node, there is no reason to believe that its neighbours are located nearby in memory.

In an earlier post, I showed how we could accelerate memory-bound graph algorithms by using software prefetches. We were able to trim a third of the running time just by adding a line of code. But I do not like nor recommend software prefetching.

In my test case, I am trying to find the distance between two nodes by doing a breadth-first search. I use a random graph containing 10 million nodes and where each node has degree 16 (meaning that each node has 16 neighbours). I start from one node, visit all its neighbours, then visit the neighbours of the neighbours, until I arrive at my destination. The pseudocode is as follows:

insert my starting node in a queue
for each node x in my queue {
  for each node y connected to x {
    // option: prefetch the node y' that follows y
    if (y is my target) { 
      // success
    } 
    if (y is new) {
       add y to next queue
    } 
  }
}

I can rewrite this code where I visit the nodes in chunks. So I grab 8 nodes, I look at the first neighbour of each of the 8 nodes, then I look at the second neighbour of the 8 nodes and so forth. We interleave neighbours. The pseudocode looks as follows:

insert my starting node in a queue
// divide the queue into chunks of up to 8 nodes 
for each chunk of nodes x1, x2, ..., x8 in my queue {
  for index i in 1, 2..., degree {
     for x in x1, x2, ..., x8 {
       let y be neighbour i of x
       if (y is my target) { 
         // success
       } 
       if (y is new) {
         add y to next queue
       } 
     }
  }
}

From a purely computer science perspective, this new pseudocode should not be faster. You will not find it in any textbook I know. Yet it is considerably faster when working over large graphs.

How fast is this new approach? About twice as fast as the naive approach, and faster than the one relying on software prefetches. My source code is available.

naive breadth-first search 23 cycles per edge visited
prefetched breadth-first search 16 cycles per edge visited
new approach 12 cycles per edge visited

I like this example because it illustrates quite well that even if a problem is memory-bound, you can still code a faster solution. There is no need for specialized instructions. There is no need to move anything around in memory. How you access the data matters.

I should point out that this is just something I threw together during a lazy Monday. It is probably nowhere near optimal. It is totally lacking in sophistication. Yet it works. Moreover, it is going to be effective on almost all modern computer architectures.


 

Original Source

Daniel Lemire

Daniel Lemire

Daniel Lemire is a full professor in computer science at the University of Quebec (TELUQ). His research is focused on data indexing techniques. For example, he worked on bitmap indexes, column-oriented databases and integer compression. He is also interested in database design and probabilistic algorithms (e.g., universal hashing). His work on bitmap indexes is used by companies such as eBay, LinkedIn, Facebook and Netflix in their data warehousing, within big-data platforms such as Apache Hive, Druid, Apache Spark, Netflix Atlas, LinkedIn Pinot and Apache Kylin. The version control system Git is also accelerated by the same compressed bitmaps. Some of his techniques were adopted by Apache Lucene, the search engine behind sites such as Wikipedia or platforms such as Solr and Elastic. One of his hashing techniques has been adopted by Google TensorFlow. His Slope One recommender algorithm is a standard reference in the field of recommender systems. He is a beneficiary of the Google Open Source Peer Bonus Program. He has written over 50 peer-reviewed publications, including more than 30 journal articles. He has held competitive research grants for the last 15 years. He serves on the program committees of leading computer science conferences (e.g., ACM CIKM, WWW, ACM WSDM, ACM SIGIR, ACM RecSys).