As it turns out, teaching a new class forces you to refresh your memory on key software topics. A topic I recently covered in class was hashing, and I’m fairly excited to cover it here on the blog. Let’s get into it!
Table of Contents
Implementing a Set Data Structure
A set is a data structure that attempts to model a mathematical set. The rules we generally impose on this data structure is that there can be no duplicate entries nor an order of those entries. As a consequence, sets cannot be sorted.
While these constraints might seem restrictive, there are a lot of benefits. For example, knowing that there are no duplicates, we can use a set to get a list of unique elements of some data set (e.g., a set of all customers from a list of purchasing data from the last year). Likewise, sets lend themselves well to all the mathematical set features, like unions and differences between two sets (e.g., a set of customers who shop at two different stores).
From the implementer side of things, the lack of ordering makes for a ton of flexibility in implementation. Under the hood, we might implement a set as a list-like data structure, such as an array, a linked list, a queue, or even a stack. For instance, if we use a queue to implement our set, we can add an item to the set by adding the item to the end of the queue.
With that said, are list-like data structures really the best tool for implementation? I would argue they aren’t because all of the list-like data structures are slow to traverse. As a result, if we ever want to lookup an element in the set, it might take a while to traverse the list to find the right element, and we will always have to traverse the entire list if our set does not contain our element.
Introducing Hashing
One way around the issue of traversal is to break down our set into subsets or “buckets.” We can do this by using a list-like data structure that allows for indexing, such as an array. The size of this array then determines our number of subsets. For example, an array of size seven would contain seven subsets.
Unfortunately, we have a new problem: how do we know which subset contains our element? That’s where hashing comes in! For a set of integers, we can find any element by computing element mod size
(though, be careful how you compute mod). By using the size of our array, we guarantee that the result of the computation will be a valid index in the array. And the beauty of this is that we can compute this hash for both storing values in our set and looking them up.
So, what’s the payoff? Going back to our set example, we can swap out our queue for an array of lists. The job of the array is to break up the set into subsets which we assign through hashing. Unsurprisingly, we call this new data structure a hash table. Suddenly, looking up a value in our set becomes a much, much faster computation on average. All we have to do is compute the hash for that function and search through the corresponding subset: the more subsets we have, the less items we have to search through and the faster our lookup and removal processes.
Storing More Complex Data
In our set example up to this point, we’ve only really concerned ourself with storing integers. Of course, we probably want to be able to store other types of data like strings. But, there’s yet another new problem: how do we hash data that isn’t an integer?
As it turns out, all we have to do is convert that data into an integer. This is often called a hash code and can be computed for any data we like. Of course, creating a hash code is easier said than done.
In the example of storing names, we run into the issue of how to convert names into integers. There are a ton of options, and there is perhaps no perfect option. For example, we could use the length of the names as our integer. Alternatively, we could convert each character of the name to a number and sum them.
What we’ll find is that some of these hash code algorithms are better than others for each particular data set. With names, length is probably not a great choice because there will be a lot of names with similar lengths. This will cause a significant number of collisions in the hash table. Collisions are normal since the length of our array is probably smaller than the number of names we intend to store, but having many collisions for only some of the subsets is a problem. Ideally, we want roughly equally sized subsets, or we won’t reap the performance benefits. Just imagine a scenario where all of our data collides for the same hash. Suddenly, we’ll be traversing a list with all of our items all over again.
That said, even in the worst cases, a hash table is still just as good as our original implementation. Therefore, we can’t really go wrong.
Hashing Pitfalls
In the course I teach, the concepts above are about all we expect our students to understand. However, there are some pitfalls with hashing that I think are worth mentioning.
First, in modern programming, hashing is ubiquitous. In fact, it’s so prominent that most programming languages have hashing built-in to their objects by default. In Java, there is the hashCode()
method, which we can override to specify a hash code algorithm for our data type. In Python, there is the __hash__()
dunder method, which we can override for custom data types just like in Java.
When implementing these methods, it is incredibly important that our algorithm is congruent with the respective “equals” method. In other words, two objects that are considered equal should also have the same hash code. If not, a data structure which relies on the hash code will consider two otherwise equivalent values as unique values. This can cause nasty bugs in sets, maps, and dictionaries. But don’t just take it from me, this is straight from Python’s docs:
If a class does not define an
Python 3 Documentation, 2023-09-13__eq__()
method it should not define a
__hash__()
operation either; if it defines
__eq__()
but not
__hash__()
, its instances will not be usable as items in hashable collections.
And here is the same demand made in Java’s documentation:
If two objects are equal according to the
Java 8 Documentation, 2023-09-13equals(Object)
method, then calling thehashCode
method on each of the two objects must produce the same integer result.
Second, it is generally bad practice to implement a hash code for a mutable data type. The reason should be somewhat obvious: if we can change the value of an object while it’s stored somewhere that relies on its hash code (like a set), the modified object is no longer guaranteed to be in the correct “bucket.” This can make lookup and removal a nightmare. Unsurprisingly, the Python docs give a similar warning:
If a class defines mutable objects and implements an
Python 3 Documentation, 2023-09-13__eq__()
method, it should not implement
__hash__()
, since the implementation of hashable
collections requires that a key’s hash value is immutable (if the object’s hash value changes, it will be in the wrong hash bucket).
As far as I can tell, Java doesn’t give the same level of concern around mutability. However, the community seems to argue that it’s bad practice and easily avoidable—though possible with enough discipline.
At any rate, I think that’s enough discussion on hashing for the day. If you found this interesting, there’s plenty more like it here:
- Java Lambda Expressions Are a Scam
- Java Has A Remainder Operator—Not a Mod Operator
- Unpacking CS Jargon: What Makes Data Mutable?
And you can take your support even further by heading over to my list of ways to grow the site. Otherwise, that’s it for today. See you next Friday!
Recent Posts
As of March of this year, I'm off Twitter for good! In fact, at the time of writing, my old account should be deleted. Let's talk about that!
Recently, I was thinking about how there are so many ways to approach software design. While some of these approaches have fancy names, I'm not sure if anyone has really thought about them...