Speed Up Your Data Structures With Hashing

Speed Up Your Data Structures With Hashing Featured Image

As it turns out, teaching a new class forces you to refresh your memory on key software topics. A topic I recently covered in class was hashing, and I’m fairly excited to cover it here on the blog. Let’s get into it!

Table of Contents

Implementing a Set Data Structure

A set is a data structure that attempts to model a mathematical set. The rules we generally impose on this data structure is that there can be no duplicate entries nor an order of those entries. As a consequence, sets cannot be sorted.

While these constraints might seem restrictive, there are a lot of benefits. For example, knowing that there are no duplicates, we can use a set to get a list of unique elements of some data set (e.g., a set of all customers from a list of purchasing data from the last year). Likewise, sets lend themselves well to all the mathematical set features, like unions and differences between two sets (e.g., a set of customers who shop at two different stores).

From the implementer side of things, the lack of ordering makes for a ton of flexibility in implementation. Under the hood, we might implement a set as a list-like data structure, such as an array, a linked list, a queue, or even a stack. For instance, if we use a queue to implement our set, we can add an item to the set by adding the item to the end of the queue.

With that said, are list-like data structures really the best tool for implementation? I would argue they aren’t because all of the list-like data structures are slow to traverse. As a result, if we ever want to lookup an element in the set, it might take a while to traverse the list to find the right element, and we will always have to traverse the entire list if our set does not contain our element.

Introducing Hashing

One way around the issue of traversal is to break down our set into subsets or “buckets.” We can do this by using a list-like data structure that allows for indexing, such as an array. The size of this array then determines our number of subsets. For example, an array of size seven would contain seven subsets.

Unfortunately, we have a new problem: how do we know which subset contains our element? That’s where hashing comes in! For a set of integers, we can find any element by computing element mod size (though, be careful how you compute mod). By using the size of our array, we guarantee that the result of the computation will be a valid index in the array. And the beauty of this is that we can compute this hash for both storing values in our set and looking them up.

So, what’s the payoff? Going back to our set example, we can swap out our queue for an array of lists. The job of the array is to break up the set into subsets which we assign through hashing. Unsurprisingly, we call this new data structure a hash table. Suddenly, looking up a value in our set becomes a much, much faster computation on average. All we have to do is compute the hash for that function and search through the corresponding subset: the more subsets we have, the less items we have to search through and the faster our lookup and removal processes.

Storing More Complex Data

In our set example up to this point, we’ve only really concerned ourself with storing integers. Of course, we probably want to be able to store other types of data like strings. But, there’s yet another new problem: how do we hash data that isn’t an integer?

As it turns out, all we have to do is convert that data into an integer. This is often called a hash code and can be computed for any data we like. Of course, creating a hash code is easier said than done.

In the example of storing names, we run into the issue of how to convert names into integers. There are a ton of options, and there is perhaps no perfect option. For example, we could use the length of the names as our integer. Alternatively, we could convert each character of the name to a number and sum them.

What we’ll find is that some of these hash code algorithms are better than others for each particular data set. With names, length is probably not a great choice because there will be a lot of names with similar lengths. This will cause a significant number of collisions in the hash table. Collisions are normal since the length of our array is probably smaller than the number of names we intend to store, but having many collisions for only some of the subsets is a problem. Ideally, we want roughly equally sized subsets, or we won’t reap the performance benefits. Just imagine a scenario where all of our data collides for the same hash. Suddenly, we’ll be traversing a list with all of our items all over again.

That said, even in the worst cases, a hash table is still just as good as our original implementation. Therefore, we can’t really go wrong.

Hashing Pitfalls

In the course I teach, the concepts above are about all we expect our students to understand. However, there are some pitfalls with hashing that I think are worth mentioning.

First, in modern programming, hashing is ubiquitous. In fact, it’s so prominent that most programming languages have hashing built-in to their objects by default. In Java, there is the hashCode() method, which we can override to specify a hash code algorithm for our data type. In Python, there is the __hash__() dunder method, which we can override for custom data types just like in Java.

When implementing these methods, it is incredibly important that our algorithm is congruent with the respective “equals” method. In other words, two objects that are considered equal should also have the same hash code. If not, a data structure which relies on the hash code will consider two otherwise equivalent values as unique values. This can cause nasty bugs in sets, maps, and dictionaries. But don’t just take it from me, this is straight from Python’s docs:

If a class does not define an __eq__()Opens in a new tab. method it should not define a __hash__()Opens in a new tab. operation either; if it defines __eq__()Opens in a new tab. but not __hash__()Opens in a new tab., its instances will not be usable as items in hashable collections.

Python 3 Documentation, 2023-09-13Opens in a new tab.

And here is the same demand made in Java’s documentation:

If two objects are equal according to the equals(Object) method, then calling the hashCode method on each of the two objects must produce the same integer result.

Java 8 Documentation, 2023-09-13Opens in a new tab.

Second, it is generally bad practice to implement a hash code for a mutable data type. The reason should be somewhat obvious: if we can change the value of an object while it’s stored somewhere that relies on its hash code (like a set), the modified object is no longer guaranteed to be in the correct “bucket.” This can make lookup and removal a nightmare. Unsurprisingly, the Python docs give a similar warning:

If a class defines mutable objects and implements an __eq__()Opens in a new tab. method, it should not implement __hash__()Opens in a new tab., since the implementation of hashableOpens in a new tab. collections requires that a key’s hash value is immutable (if the object’s hash value changes, it will be in the wrong hash bucket).

Python 3 Documentation, 2023-09-13Opens in a new tab.

As far as I can tell, Java doesn’t give the same level of concern around mutability. However, the community seems to argue that it’s bad practice and easily avoidable—though possible with enough discipline.

At any rate, I think that’s enough discussion on hashing for the day. If you found this interesting, there’s plenty more like it here:

And you can take your support even further by heading over to my list of ways to grow the site. Otherwise, that’s it for today. See you next Friday!

Coding Tangents (43 Articles)—Series Navigation

As a lifelong learner and aspiring teacher, I find that not all subjects carry the same weight. As a result, some topics can fall through the cracks due to time constraints or other commitments. Personally, I find these lost artifacts to be quite fun to discuss. That’s why I’ve decided to launch a whole series to do just that. Welcome to Coding Tangents, a collection of articles that tackle the edge case topics of software development.

In this series, I’ll be tackling topics that I feel many of my own students have been curious about but never really got the chance to explore. In many cases, these are subjects that I think deserve more exposure in the classroom. For instance, did you ever receive a formal explanation of access modifiers? How about package management? Version control?

In some cases, students are forced to learn these subjects on their own. Naturally, this forms a breeding ground for misconceptions which are made popular in online forums like Stack Overflow and Reddit. With this series, I’m hoping to get back to the basics where these subjects can be tackled in their entirety.

Jeremy Grifski

Jeremy grew up in a small town where he enjoyed playing soccer and video games, practicing taekwondo, and trading Pokémon cards. Once out of the nest, he pursued a Bachelors in Computer Engineering with a minor in Game Design. After college, he spent about two years writing software for a major engineering company. Then, he earned a master's in Computer Science and Engineering. Today, he pursues a PhD in Engineering Education in order to ultimately land a teaching gig. In his spare time, Jeremy enjoys spending time with his wife, playing Overwatch and Phantasy Star Online 2, practicing trombone, watching Penguins hockey, and traveling the world.

Recent Posts