Welcome back to yet another How to Python article. Today, we’re going to talk about how to remove duplicates from a list in Python.
Perhaps the quickest way to remove duplicates from a list is to take advantage of the set data structure: list(set(my_list))
. However, this sort of solution won’t maintain order. As a result, it might be a better idea to take advantage of the dictionary data structure (Python 3.7+): list(dict.fromkeys(my_list))
. In either case, the output will be a list with all duplicates removed.
If that’s all you need to solve your issue, help me create more content like this by browsing my list of ways to support the site. Otherwise, keep reading for more details.
Table of Contents
Problem Description
When it comes to managing data, certain problems tend to arise. For example, if we’re working with a few lists, we might be interested in summing them together. Or, maybe we’re working with only one list, and we find ourselves needing to retrieve the last item in that list.
One thing I’ve found myself needing to do most recently was removing duplicates from a list. As a bit of background, I had a matrix that contained several rows of data. One of the columns included information about sequences. Specifically, this column labeled data points with an index to help organize the data into sequences:
matrix = [ ["x", "y", "index"], [2, 3, 0], [2, 3, 0], [5, 2, 1], ... ]
Of course, what I wanted to know were how many unique indices we had. On one hand, I could have searched the column for the largest index. Alternatively, I decided to take the column and remove all the duplicates. That way, I would have a list of indices I could use for other purposes.
To do that, I needed to extract the column of indices which looked as follows:
indices = [1, 1, 1, 2, 3, 3, 3, 3, 3, 4, 4, 5, 6, 6, 6, 6, 7]
Then, it was just a matter of removing the duplicates, so we could end up with a list like the following:
indices = [1, 2, 3, 4, 5, 6, 7]
Of course, how do we actually go about removing the duplicates from a list? That’s the topic of today’s article!
Solutions
With the problem description out of the way, let’s talk about how we’d actually go about removing duplicates from a list. In this section, we’ll look at a few solutions—not all of them practical. That said, I find it helpful to have a few options.
Note: there are a few times throughout this section that I mention some of the challenges associated with removing duplicates. I figure it’s probably worth mentioning them here as well. First, many of the following solutions don’t respect the order of the original list. Second, many of the solutions mention possible issues related to mutable data and objects in general. In other words, it’s unlikely that these solutions are universal for any situation. Keep this in mind.
Removing List Duplicates by Hand
As with every article in this series, I like to take a moment to try to write my own solution. I do this for a couple of reasons:
First, I think it’s important to understand the complexity that goes into solving these kinds of problem. After all, solving problems by hand is great way of checking your understanding.
Second, seeing a solution like this really makes you appreciate some of the tooling provided by Python. For instance, if it weren’t for list comprehensions and negative indexing, working with lists in Python would be a lot more painful.
At any rate, to remove duplicates from a list, we need to be able to detect duplicates. There are a lot of algorithms for this, but I’m going to keep it simple and use a dictionary:
dupes = [1, 3, 8, 3, 5] counts = dict() for num in dupes: if num in counts: counts[num] += 1 else: counts[num] = 1
Now, this counting algorithm doesn’t remove duplicates on its own. However, it does tell us if there are any duplicates.
From here, we’re going to modify the loop above to create a new list containing only the unique values:
dupes = [1, 3, 8, 3, 5] counts = dict() unique = list() for num in dupes: if num not in counts: counts[num] = 1 unique.append(num)
If we execute this, we should get a list that contains unique values only:
>>> dupes = [1, 3, 8, 3, 5] >>> counts = dict() >>> unique = list() >>> for num in dupes: if num not in counts: counts[num] = 1 unique.append(num) >>> unique [1, 3, 8, 5]
In this case, we had to create a new list because it’s bad practice to modify a list we’re iterating over. That said, it’s possible to edit the list in place, but we won’t chat about that now. Instead, I’ll ask you to think about that for today’s challenge!
In the meantime, let’s keep talking solutions. In particular, let’s look at some solutions that don’t require us to write our own loops.
Removing List Duplicates Using Set
Another way to remove duplicates from a list is to take advantage of the set data structure in Python. Unlike lists, sets cannot contain duplicates, so transforming a list into a set should remove all duplicates. Fortunately, the set constructor can do the work for us:
dupes = [1, 3, 8, 3, 5] unique = list(set(dupes))
Unfortunately, the downside of using a set is that sets are unordered. In other words, it’s possible that the list we get back is in a different order than before:
>>> dupes = [1, 3, 8, 3, 5] >>> unique = list(set(dupes)) >>> unique [8, 1, 3, 5]
Another possible snag is that sets are not meant to store mutable data. As a result, this transformation may run into issues if the list stores mutable objects like lists or dictionaries.
That said, if neither of these concerns are an issue for you, this is the way to go. Keep reading otherwise.
Removing List Duplicates Using Dict
If sets aren’t the way to go, we can always try to use a dictionary transformation. Specifically, there’s a function, fromkeys()
, that will generate a dictionary from a list of keys:
>>> dupes = [1, 3, 8, 3, 5] >>> dict.fromkeys(dupes) {1: None, 3: None, 8: None, 5: None}
Since keys have to be unique, this dictionary transformation will remove all duplicates. Then, it’s just a matter of converting the dictionary back into a list:
>>> list(dict.fromkeys(dupes)) [1, 3, 8, 5]
If we use a sufficiently recent version of Python (3.7+), we’ll even be able to guarantee the original order. Otherwise, we may end up with a solution very similar to the previous set solution. In that case, we might opt for OrderedDict
:
>>> from collections import OrderedDict >>> list(OrderedDict.fromkeys(dupes)) [1, 3, 8, 5]
Regardless, either solution should get the job done. Here’s the first dictionary solution in its entirety:
dupes = [1, 3, 8, 3, 5] unique = list(dict.fromkeys(dupes))
Again, I’ll warn that this solution only reliably maintains order in Python 3.7+. If order doesn’t matter, we should probably stick with the set solution.
That said, this transformation has the same immutable datatype concern as sets. After all, dictionary keys should not be mutable, so converting a list of mutable datatypes to a dictionary would be considered bad practice.
At any rate, we’re not quite done exploring solutions. In the next section, we’ll leverage the numpy library.
Removing List Duplicates Using a Library
If for some reason none of these solutions are appealing, there is another option. After all, these sort of list manipulations are quite common in certain areas of data science, so it’s no surprise that there are already libraries that can help us out. In particular, numpy has a function called unique()
that will do exactly what we want:
import numpy as np dupes = [1, 3, 8, 3, 5] unique = np.unique(dupes) # returns [1 3, 5, 8]
Now, there are basically two main issues with this solution. First, numpy isn’t exactly a small library. There’s definitely a cost associated with adding it as a dependency, so I probably wouldn’t reach for it unless it was already being used.
Second, this function will sort the list which may not be ideal. Previously, we discussed maintaining order, and this function definitely won’t.
That said, I find this function fairly handy, and I suspect you will too. At any rate, we’ve covered just about every way I can think of to remove duplicates from a list. Now, let’s compare their performance.
Performance
As always, I like to take some time to naively compare the performance of the solutions above. To do that, I use the timeit
library which allows us to test the speed of each solution. If you’re interested in learning more about this process, check out my article on performance testing.
Otherwise, let’s go ahead and store all of our solutions in strings:
setup = """ import numpy as np dupes = [1, 3, 8, 3, 5] """ by_hand = """ counts = dict() unique = list() for num in dupes: if num not in counts: counts[num] = 1 unique.append(num) """ sets = """ unique = list(set(dupes)) """ dicts = """ unique = list(dict.fromkeys(dupes)) """ lib = """ unique = np.unique(dupes) """
Now that we have all our strings, it’s just a matter of running them through timeit
:
>>> import timeit >>> min(timeit.repeat(setup=setup, stmt=by_hand)) 0.7825387999999975 >>> min(timeit.repeat(setup=setup, stmt=sets)) 0.43202079999999654 >>> min(timeit.repeat(setup=setup, stmt=dicts)) 0.4831847999999894 >>> min(timeit.repeat(setup=setup, stmt=lib)) 7.4180329
First impressions seem to be that the two data transformation solutions (sets and dicts) are about the same. What I’m most surprised by is how slow the numpy solution is. How is that so slow?! Perhaps this is because numpy performs a sort?
In the next round of tests, I decided to generate a much larger list using a list comprehension:
setup = """ import numpy as np dupes = [x // 3 for x in range(1000)] """
This resulted in the following times:
>>> min(timeit.repeat(setup=setup, stmt=by_hand)) 65.90517239999997 >>> min(timeit.repeat(setup=setup, stmt=sets)) 23.18903429999955 >>> min(timeit.repeat(setup=setup, stmt=dicts)) 26.943748899999264 >>> min(timeit.repeat(setup=setup, stmt=lib)) 67.39827859999968
Here, I think I’m most surprised by how well the numpy solution scaled. If I had the time, I’d try some longer tests, but I think this is enough to illustrate the differences between each solution.
For reference, I ran all four solutions in IDLE using Python 3.8.2 on a Windows 10 PC. Your mileage may vary.
Challenge
As I mentioned previously in this article, all of these solutions make copies of our underlying list instead of modifying it in place. As a result, we end up with two lists of possibly similar sizes. If we had a really large list, this kind of operation could be costly.
As a result, I’m interested to see if you could come up with a function that would remove duplicates from a list in-place. In other words, write code that could do the following:
dupes = [1, 3, 8, 3, 5] remove_dupes(dupes) print(dupes) # prints [1, 3, 8, 5]
Once you think you have a solution, feel free to head on over to Twitter to give it a share under #RenegadePython.
While you’re there, I recommend trying to make the post as accessible as possible. For instance, if you use an image, make sure to include a link to the code (e.g. GitHub, Gist, JDoodle, etc.) and a copy of the code in the ALT tag. If you’re looking for a place to store your solution, feel free to use our GitHub repo.
To kick things off, here’s my solution:
I’m interested to see what you come up with, so don’t be afraid to play around.
A Little Recap
At this point, we’re all finished! Here’s each way you can remove duplicates from a list in Python:
import numpy as np dupes = [1, 3, 8, 3, 5] # Remove duplicates by hand counts = dict() unique = list() for num in dupes: if num not in counts: counts[num] = 1 unique.append(num) # Remove duplicates using a set unique = list(set(dupes)) # Remove duplicates using a dictionary unique = list(dict.fromkeys(dupes)) # Remove duplicates using numpy unique = np.unique(dupes)
If you liked this sort of content, I’d appreciate it if you could take some time to check out my list of ways to grow the site. In there, you’ll find links to my newsletter, Patreon, and YouTube channel.
In addition, here are a few related posts:
Likewise, you might get some value out of the following books from Amazon (ad):
- Effective Python: 90 Specific Ways to Write Better Python
- Python Tricks: A Buffet of Awesome Python Features
- Python Programming: An Introduction to Computer Science
Otherwise, thanks for checking out the website! I appreciate it, and I hope you’ll stick around.
Recent Code Posts
In the world of programming languages, expressions are an interesting concept that folks tend to implicitly understand but might not be able to define. As a result, I figured I'd take a crack at...
It might seem like a straightforward concept, but variables are more interesting than you think. In an effort to expand our concept map, we're here to cover one of the most basic programming...