Once again, I’m back with another look at some ways to solve a common Python problem. This time, we’ll be looking at how to split a string by whitespace (and other separators) in Python.
If you’re in a rush, here’s how to split a string by whitespace in Python: use the builtin split()
function. It works for any string as follows: "What a Wonderful World".split()
. If done correctly, you’ll get a nice list of substrings without all that whitespace (e.g. ["What", "a", "Wonderful", "World"]
).
In the remainder of this article, we’ll look at the solution described above in more detail. In addition, we’ll try writing our own solution. Then, we’ll compare them all by performance. At the end, I’ll ask you to tackle a little challenge.
Let’s get started!
Table of Contents
Problem Description
When we talk about splitting a string, what we’re really talking about is the process of breaking a string up into parts. As it turns out, there are a lot of ways to split a string. For the purposes of this article, we’ll just be looking at splitting a string by whitespace.
Of course, what does it mean to split a string by whitespace? Well, let’s look at an example:
"How are you?"
Here, the only two whitespace characters are the two spaces. As a result, splitting this string by whitespace would result in a list of three strings:
["How", "are", "you?"]
Of course, there are a ton of different types of whitespace characters. Unfortunately, which characters are considered whitespace are totally dependent on the character set being used. As a result, we’ll simplify this problem by only concerning ourselves with Unicode characters (as of the publish date).
In the Unicode character set, there are 17 “separator, space” characters. In addition, there are another 8 whitespace characters which include things like line separators. As a result, the following string is a bit more interesting:
"Hi, Ben!\nHow are you?"
With the addition of the line break, we would expect that splitting by whitespace would result in the following list:
["Hi,", "Ben!", "How", "are", "you?"]
In this article, we’ll take a look at a few ways to actually write some code that will split a string by whitespace and store the result in a list.
Solutions
As always, there are a lot of different ways to split a string by whitespace. To kick things off, we’ll try to write our own solution. Then, we’ll look at a few more practical solutions.
Split a String by Whitespace Using Brute Force
If I were given the problem description above and asked to solve it without using any libraries, here’s what I would do:
items = [] my_string = "Hi, how are you?" whitespace_chars = [" ", ..., "\n"] start_index = 0 end_index = 0 for character in my_string: if character in whitespace_chars: items.append(my_string[start_index: end_index]) start_index = end_index + 1 items.append(my_string[start_index: end_index]) end_index += 1
Here, I decided to build up a few variables. First, we need to track the end result which is items
in this case. Then, we need some sort of string to work with (e.g. my_string
).
To perform the splitting, we’ll need to track a couple indices: one for the front of each substring (e.g. start_index
) and one for the back of the substring (e.g. end_index
).
On top of all that, we need some way to verify that a character is in fact a whitespace. To do that, we created a list of whitespace characters called whitespace_chars
. Rather than listing all of the whitespace characters, I cheated and showed two examples with a little ellipses. Make sure to remove the ellipsis before running this code. For some reason, Python gives those three dots meaning, so it won’t actually error out (although, it likely won’t cause any harm either).
Using these variables, we’re able to loop over our string and construct our substrings. We do that by checking if each character is a whitespace. If it is, we know we need to construct a substring and update start_index
to begin tracking the next word. Then, when we’re done, we can grab the last word and store it.
Now, there’s a lot of messiness here. To make life a bit easier, I decided to move the code into a function which we could modify as we go along:
def split_string(my_string: str): items = [] whitespace_chars = [" ", ..., "\n"] start_index = 0 end_index = 0 for character in my_string: if character in whitespace_chars: items.append(my_string[start_index: end_index]) start_index = end_index + 1 end_index += 1 items.append(my_string[start_index: end_index]) return items
Now, this solution is extremely error prone. To prove that, try running this function as follows:
split_string("Hello World") # returns ['Hello', '', 'World']
Notice how having two spaces in a row causes us to store empty strings? Yeah, that’s not ideal. In the next section, we’ll look at a way to improve this code.
Split a String by Whitespace Using State
Now, I borrowed this solution from a method that we ask students to write for a lab in one of the courses I teach. Basically, the method is called “nextWordOrSeparator” which is a method that looks like this:
/** * Returns the first "word" (maximal length string of characters not in * {@code separators}) or "separator string" (maximal length string of * characters in {@code separators}) in the given {@code text} starting at * the given {@code position}. */ private static String nextWordOrSeparator(String text, int position, Set<Character> separators) { assert text != null : "Violation of: text is not null"; assert separators != null : "Violation of: separators is not null"; assert 0 <= position : "Violation of: 0 <= position"; assert position < text.length() : "Violation of: position < |text|"; // TODO - fill in body /* * This line added just to make the program compilable. Should be * replaced with appropriate return statement. */ return ""; }
One way to implement this method is to check whether or not the first character is a separator. If it is, loop until it’s not. If it’s not, loop until it is.
Typically, this is done by writing two separate loops. One loop continually checks characters until a character is in the separator set. Meanwhile, the other loop does the opposite.
Of course, I think that’s a little redundant, so I wrote my solution using a single loop (this time in Python):
def next_word_or_separator(text: str, position: int, separators: list): end_index = position is_separator = text[position] in separators while end_index < len(text) and is_separator == (text[end_index] in separators): end_index += 1 return text[position: end_index]
Here, we track a couple variables. First, we need an end_index
, so we know where to split our string. In addition, we need to determine if we’re dealing with a word or separator. To do that, we check if the character at the current position
in text
is in separators
. Then, we store the result in is_separator
.
With is_separator
, all there is left to do is loop over the string until we find a character that is different. To do that, we repeatedly run the same computation we ran for is_separator
. To make that more obvious, I’ve stored that expression in a lambda function:
def next_word_or_separator(text: str, position: int, separators: list): test_separator = lambda x: text[x] in separators end_index = position is_separator = test_separator(position) while end_index < len(text) and is_separator == test_separator(end_index): end_index += 1 return text[position: end_index]
At any rate, this loop will run until either we run out of string or our test_separator
function gives us a value that differs from is_separator
. For example, if is_separator
is True
then we won’t break until test_separator
is False
.
Now, we can use this function to make our first solution a bit more robust:
def split_string(my_string: str): items = [] whitespace_chars = [" ", ..., "\n"] i = 0 while i < len(my_string): sub = next_word_or_separator(my_string, i, whitespace_chars) items.append(sub) i += len(sub) return items
Unfortunately, this code is still wrong because we don’t bother to check if what is returned is a word or a separator. To do that, we’ll need to run a quick test:
def split_string(my_string: str): items = [] whitespace_chars = [" ", ..., "\n"] i = 0 while i < len(my_string): sub = next_word_or_separator(my_string, i, whitespace_chars) if sub[0] not in whitespace_chars: items.append(sub) i += len(sub) return items
Now, we have a solution that is slightly more robust! Also, it gets the job done for anything we consider separators; they don’t even have to be whitespace. Let’s go ahead and adapt this one last time to let the user enter any separators they like:
def split_string(my_string: str, seps: list): items = [] i = 0 while i < len(my_string): sub = next_word_or_separator(my_string, i, seps) if sub[0] not in seps: items.append(sub) i += len(sub) return items
Then, when we run this, we’ll see that we can split by whatever we like:
>>> split_string("Hello, World", [" "]) ['Hello,', 'World'] >>> split_string("Hello, World", ["l"]) ['He', 'o, Wor', 'd'] >>> split_string("Hello, World", ["l", "o"]) ['He', ', W', 'r', 'd'] >>> split_string("Hello, World", ["l", "o", " "]) ['He', ',', 'W', 'r', 'd'] >>> split_string("Hello, World", [",", " "]) ['Hello', 'World']
How cool is that?! In the next section, we’ll look at some builtin tools that do exactly this.
Split a String by Whitespace Using split()
While we spent all this time trying to write our own split method, Python had one built in all along. It’s called split()
, and we can call it on strings directly:
my_string = "Hello, World!" my_string.split() # returns ["Hello,", "World!"]
In addition, we can provide our own separators to split the string:
my_string = "Hello, World!" my_string.split(",") # returns ['Hello', ' World!']
However, this method doesn’t work quite like the method we provided. If we input multiple separators, the method will only match the combined string:
my_string = "Hello, World!" my_string.split("el") # returns ['H', 'lo, World!']
In the documentation, this is described as a “different algorithm” from the default behavior. In other words, the whitespace algorithm will treat consecutive whitespace characters as a single entity. Meanwhile, if a separator is provided, the method splits at every occurrence of that separator:
my_string = "Hello, World!" my_string.split("l") # returns ['He', '', 'o, Wor', 'd!']
But, that’s not all! This method can also limit the number of splits using an additional parameter, maxsplit
:
my_string = "Hello, World! Nice to meet you." my_string.split(maxsplit=2) # returns ['Hello,', 'World!', 'Nice to meet you.']
How cool is that? In the next section, we’ll see how this solution stacks up against the solutions we wrote ourselves.
Performance
To test performance, we’ll be using the timeit
library. Essentially, it allows us to compute the runtime of our code snippets for comparison. If you’d like to learn more about this process, I’ve documented my approach in an article on performance testing in Python.
Otherwise, let’s go ahead and convert our solutions into strings:
setup = """ zero_spaces = 'Jeremy' one_space = 'Hello, World!' many_spaces = 'I need to get many times stronger than everyone else!' first_space = ' Well, what do we have here?' last_space = 'Is this the Krusty Krab? ' long_string = 'Spread love everywhere you go: first of all in your own house. Give love to your children, to your wife or husband, to a next door neighbor. Let no one ever come to you without leaving better and happier. Be the living expression of God’s kindness; kindness in your face, kindness in your eyes, kindness in your smile, kindness in your warm greeting.' def split_string_bug(my_string: str): items = [] whitespace_chars = [' '] start_index = 0 end_index = 0 for character in my_string: if character in whitespace_chars: items.append(my_string[start_index: end_index]) start_index = end_index + 1 end_index += 1 items.append(my_string[start_index: end_index]) return items def next_word_or_separator(text: str, position: int, separators: list): test_separator = lambda x: text[x] in separators end_index = position is_separator = test_separator(position) while end_index < len(text) and is_separator == test_separator(end_index): end_index += 1 return text[position: end_index] def split_string(my_string: str, seps: list): items = [] i = 0 while i < len(my_string): sub = next_word_or_separator(my_string, i, seps) if sub[0] not in seps: items.append(sub) i += len(sub) return items """ split_string_bug = """ split_string_bug(zero_spaces) """ split_string = """ split_string(zero_spaces, [" "]) """ split_python = """ zero_spaces.split() """
For this first set of tests, I decided to start with a string that has no spaces:
>>> import timeit >>> min(timeit.repeat(setup=setup, stmt=split_string_bug)) 0.7218914000000041 >>> min(timeit.repeat(setup=setup, stmt=split_string)) 2.867278899999974 >>> min(timeit.repeat(setup=setup, stmt=split_python)) 0.0969244999998864
Looks like our next_word_or_separator()
solution is very slow. Meanwhile, the builtin split()
is extremely fast. Let’s see if that trends continues. Here are the results when we look at one space:
>>> split_string_bug = """ split_string_bug(one_space) """ >>> split_string = """ split_string(one_space, [" "]) """ >>> split_python = """ one_space.split() """ >>> min(timeit.repeat(setup=setup, stmt=split_string_bug)) 1.4134186999999656 >>> min(timeit.repeat(setup=setup, stmt=split_string)) 6.758952300000146 >>> min(timeit.repeat(setup=setup, stmt=split_python)) 0.1601205999998001
Again, Python’s split()
method is pretty quick. Meanwhile, our robust method is terribly slow. I can’t imagine how much worse our performance is going to get with a larger string. Let’s try the many_spaces
string next:
>>> split_string_bug = """ split_string_bug(many_spaces) """ >>> split_string = """ split_string(many_spaces, [" "]) """ >>> split_python = """ many_spaces.split() """ >>> min(timeit.repeat(setup=setup, stmt=split_string_bug)) 5.328358900000012 >>> min(timeit.repeat(setup=setup, stmt=split_string)) 34.19867759999988 >>> min(timeit.repeat(setup=setup, stmt=split_python)) 0.4214780000002065
This very quickly became painful to wait out. I’m a bit afraid to try the long_string
test to be honest. At any rate, let’s check out the performance for the first_space
string (and recall that the bugged solution doesn’t work as expected):
>>> split_string_bug = """ split_string_bug(first_space) """ >>> split_string = """ split_string(first_space, [" "]) """ >>> split_python = """ first_space.split() """ >>> min(timeit.repeat(setup=setup, stmt=split_string_bug)) 3.8263317999999344 >>> min(timeit.repeat(setup=setup, stmt=split_string)) 20.963715100000172 >>> min(timeit.repeat(setup=setup, stmt=split_python)) 0.2931996000002073
At this point, I’m not seeing much difference in the results, so I figured I’d spare you the data dump and instead provide a table of the results:
Test | split_string_bug | split_string | split_python |
---|---|---|---|
no_spaces | 0.7218914000000041 | 2.867278899999974 | 0.0969244999998864 |
one_space | 1.4134186999999656 | 6.758952300000146 | 0.1601205999998001 |
many_spaces | 5.328358900000012 | 34.19867759999988 | 0.4214780000002065 |
first_space | 3.8263317999999344 | 20.963715100000172 | 0.2931996000002073 |
last_space | 3.560071500000049 | 17.976437099999657 | 0.2646626999999171 |
long_string | 35.38718729999982 | 233.59029310000005 | 3.002933099999609 |
timeit
library for three separate split solutions.Clearly, the builtin method should be the goto method for splitting strings.
Challenge
At this point, we’ve covered just about everything I want to talk about today. As a result, I’ll leave you with this challenge.
We’ve written a function which can be used to split any string we like by any separator. How could we go about writing something similar for numbers? For example, what if I wanted to split a number every time the number 256 appears?
This could be a cool way to create a fun coding scheme where ASCII codes could be embedded in a large number:
secret_key = 72256101256108256108256111
We could then delineate each code by some separator code—in this case 256 because it’s outside of ASCII range. Using our method, we could split our coded string by the separator and then make sense of the result using chr()
:
arr = split_nums(secret_key, 256) # [72, 101, 108, 108, 111] print("".join([chr(x) for x in arr]))
If you read my article on obfuscation, you already know why this might be desirable. We could essentially write up an enormous number and use it to generate strings of text. Anyone trying to reverse engineer our solution would have to make sense of our coded string.
Also, I think something like this is a fun thought experiment; I don’t expect it to be entirely useful. That said, feel free to share your solutions with me on Twitter using #RenegadePython. For instance, here’s my solution:
As you can see, I used modular arithmetic to split the string. Certainly, it would be easier to convert the key to a string and split it using one of our solutions, right? That said, I like how this solution turned out, and I’m glad it works (as far as I can tell).
A Little Recap
And with that, we’re done! As always, here are all the solutions from this article in one convenient location:
my_string = "Hi, fam!" # Split that only works when there are no consecutive separators def split_string(my_string: str, seps: list): items = [] i = 0 while i < len(my_string): sub = next_word_or_separator(my_string, i, seps) if sub[0] not in seps: items.append(sub) i += len(sub) return items split_string(my_string) # ["Hi,", "fam!"] # A more robust, albeit much slower, implementation of split def next_word_or_separator(text: str, position: int, separators: list): test_separator = lambda x: text[x] in separators end_index = position is_separator = test_separator(position) while end_index < len(text) and is_separator == test_separator(end_index): end_index += 1 return text[position: end_index] def split_string(my_string: str, seps: list): items = [] i = 0 while i < len(my_string): sub = next_word_or_separator(my_string, i, seps) if sub[0] not in seps: items.append(sub) i += len(sub) return items split_string(my_string) # ["Hi,", "fam!"] # The builtin split solution **preferred** my_string.split() # ["Hi,", "fam!"]
If you liked this article, and you’d like to read more like it, check out the following list of related articles:
- How to Convert a String to Lowercase in Python
- How to Compare Strings in Python
- How to Check If a String Contains a Substring in Python
If you’d like to go the extra mile, check out my article on ways you can help grow The Renegade Coder. This list includes ways to get involved like hopping on my mailing list or joining me on Patreon.
Otherwise, here are some helpful Python resources that can be found on Amazon (ad):
- Learn Python Quickly: A Complete Beginner’s Guide to Learning Python, Even If You’re New to Programming
- Python for Kids: A Playful Introduction to Programming Paperback – December 22, 2012
Once again, thanks for stopping by. Hopefully, you found value in this article and you’ll swing by again later! I’d appreciate it.
Recent Code Posts
In the world of programming languages, expressions are an interesting concept that folks tend to implicitly understand but might not be able to define. As a result, I figured I'd take a crack at...
It might seem like a straightforward concept, but variables are more interesting than you think. In an effort to expand our concept map, we're here to cover one of the most basic programming...