Iterating over hash sets quickly in Java

ToolsTools & Languagesposted by Daniel Lemire March 21, 2018

There are many ways in software to represent a set. The most common approach is to use a hash table. We...

There are many ways in software to represent a set. The most common approach is to use a hash table. We define a “hash function” that takes as an input our elements and produces as an output an integer that “looks random”. Then your element is stored at the location indicated by the value of the hash function. To check whether the value is in the hash function, you compute the hash function and look at the appropriate location memory. Your elements will thus appear in “random” order.

This means that unless you do extra work, iterating through the elements in your hash set will be done in random order. Let me pull out my Swift console to demonstrate:

```\$ var x = Set<Int>()
\$ x.insert(1)
\$ x.insert(2)
\$ x.insert(3)
\$ for i in x { print(i) }
2
3
1
```

That is right: you insert the numbers 1, 2, 3 and the numbers 2, 3, 1 come out.

You get the same kind of result in Python:

```>>> x = set()
>>> x
set([-2, -1, -3])
```

That is, the hash set is visited starting with the elements having the smallest hash function value. The hash function is often designed to appear random, so the elements come out randomly.

This randomness can take programmers by surprise, so some programming languages like JavaScript and PHP preserve the “insertion order”. If you pull out your JavaScript console, you get that the set keeps track of the order in which you inserted the element and gives it back to you:

```> var x = new Set()
Set { -3 }
Set { -3, -2 }
Set { -3, -2, -1 }
Set { -3, -2, -1, 3 }
Set { -3, -2, -1, 3, 2 }
Set { -3, -2, -1, 3, 2, 1 }
```

This can be implemented as a linked list working on top of the hash table.

Java supports both approaches through HashSet and LinkedHashSet.

The LinkedHashSet will use more memory, but it gives back the elements in insertion order. The HashSet gives back the element in an order determined in large part by the hash function. The LinkedHashSet may allow you to iterate over the elements faster because you are essentially bypassing the hash table entirely and just following the linked list. Linked lists are not great in the sense that each node being visited can incur a cache miss. However, Java’s HashSet is implemented using a fancy chaining approach, so you will be chasing pointers in memory and possibly also having cache misses.

So it would seem like LinkedHashSet is a good choice in Java if you are not memory bound.

To explore this problem, I took a set made of 1 million integers generated randomly. I insert them into both a HashTable and a LinkedHashTable. Then I sum the values. I run my benchmark on a Skylake processor with Java 8:

class sum values
HashSet 50 ops/s

My numbers are clear: in my tests, it is three times faster to sum up the values in a LinkedHashSet. You can reproduce my benchmark.

Original Source

Daniel Lemire

Daniel Lemire is a full professor in computer science at the University of Quebec (TELUQ). His research is focused on data indexing techniques. For example, he worked on bitmap indexes, column-oriented databases and integer compression. He is also interested in database design and probabilistic algorithms (e.g., universal hashing). His work on bitmap indexes is used by companies such as eBay, LinkedIn, Facebook and Netflix in their data warehousing, within big-data platforms such as Apache Hive, Druid, Apache Spark, Netflix Atlas, LinkedIn Pinot and Apache Kylin. The version control system Git is also accelerated by the same compressed bitmaps. Some of his techniques were adopted by Apache Lucene, the search engine behind sites such as Wikipedia or platforms such as Solr and Elastic. One of his hashing techniques has been adopted by Google TensorFlow. His Slope One recommender algorithm is a standard reference in the field of recommender systems. He is a beneficiary of the Google Open Source Peer Bonus Program. He has written over 50 peer-reviewed publications, including more than 30 journal articles. He has held competitive research grants for the last 15 years. He serves on the program committees of leading computer science conferences (e.g., ACM CIKM, WWW, ACM WSDM, ACM SIGIR, ACM RecSys).

1