How to Parse a Spreadsheet in Python

How to Parse a Spreadsheet in Python Featured Image

Last Updated on

Every once in awhile, I’ll have the need to load data from a spreadsheet into a Python program, but one question always comes up: what’s the best way to parse a spreadsheet in Python? The goal of today’s article is to find out!

Table of Contents

Problem Introduction

Recently, I was learning a visualization library in Python called VTK, and I needed to find a way to visualize some data from a spreadsheet. Unfortunately, I had two problems:

  1. I didn’t know how to read a spreadsheet
  2. I didn’t know how to parse the data that I read

In other words, what are some good ways to read spreadsheet data? And, what are some good ways to model that data in a Python program?

Parsing CSV Files

That first question will be the target of this article. In particular, we’ll be looking at various ways to read data from a CSV file. For example, does it make sense to try to write our own parsing implementation? After all, a CSV is one of the easier file formats to parse (as seen below), and Python is great for working with strings:

Name,Age,Favorite Color
Jeremy,25,Blue
Ally,41,Magenta
Jasmine,29,Aqua

That said, we may prefer to use some of the utilities provided by Python like the csv package. What’s the point of reinventing the wheel when there’s a battle tested library built right into the language?

As usual, we’ll tackle a handful of solutions and discuss their pros and cons. By the end of this article, you should feel comfortable parsing CSV files yourself.

Internal Representation

As for the second question, we have a couple of options. For starters, we could interpret the spreadsheet as a giant list of lists. In this example, we could give each row of data its own list and store these rows in a list:

# Heading: Name, Age, Favorite Color
csv_matrix = [
  ["Jeremy", 25, "Blue"],
  ["Ally", 41, "Magenta"],
  ["Jasmine", 29, "Aqua"]
]

Alternatively, we could give each column of data their own list and store those lists in a list:

# Heading: Name, Age, Favorite Color
csv_matrix = [
  ["Jeremy", "Ally", "Jasmine"],
  [25, 41, 29],
  ["Blue", "Magenta", "Aqua"]
]

In either case, we’d have a giant matrix of data that would closely resemble the original CSV file.

Of course, I’m a bit partial to dictionaries, so I might like to use one of those. For instance, what’s stopping us from creating a dictionary where each key provides us with an entire column of data?

csv_dict = {
  "Name": ["Jeremy", "Ally", "Jasmine"],
  "Age": [25, 41, 29],
  "Favorite Color": ["Blue", "Magenta", "Aqua"]
}

Alternatively, we could flip the relationship so we’re storing a list of dictionaries. That way, the data rows are mappings:

csv_mapping_list = [
  {
    "Name": "Jeremy",
    "Age": 25,
    "Favorite Color": "Blue"
  },
  {
     "Name": "Ally",
     "Age": 41,
     "Favorite Color": "Magenta"
  },
  {
    "Name": "Jasmine",
    "Age": 29,
    "Favorite Color": "Aqua"
  }
]

Personally, I prefer this last representation because data samples stick together. In other words, there’s no risk of distorting the original data set during actions like sorting. Meanwhile, the other three representations have independent lists which have to be sorted together. That’s enough to give this last representation the win.

In the next section, we’ll start digging into some parsing solutions which will all leverage this last representation. If you have any questions about the other three, feel free to drop them in the comments below.

Solutions

As always, we’ll cover a handful of common solutions to today’s problem. Feel free to grab what you need and run, but I recommend reading through the pros and cons of each solution. After all, you’ll never know when some of this might be useful.

Parse a Spreadsheet with Brute Force

At its core, a spreadsheet is a comma separated file. If we want to parse one by hand, we need to be aware of a few things:

  • The header row
  • The line endings
  • The delimiter (in this case a comma)

In our original example, we had a spreadsheet that looked something like the following:

Name,Age,Favorite Color
Jeremy,25,Blue
Ally,41,Magenta
Jasmine,29,Aqua

To parse it by hand, we’d want to read each line and split it by comma. After that, we’d need to do some post processing to get the format we want:

csv_mapping_list = []
with open("/path/to/data.csv") as my_data:
  line_count = 0
  for line in my_data:
    row_list = [val.strip() for val in line.split(",")]
    if line_count == 0:
      header = row_list
    else:
      row_dict = {}
      for i, key in enumerate(header):
        row_dict[key] = row_list[i]
      csv_mapping_list.append(row_dict)
    line_count += 1

Here is my very rough attempt at parsing a spreadsheet by hand. In this example, we open the CSV and read it line by line. For each line, we split it by comma and systematically trim each value using a list comprehension.

From there, we decide if the line we just parsed is the header or not. If it is, we save its value for later. Otherwise, we iterate over the row of values and map them into a dictionary using the header. I suppose it may have been simpler to use one of the methods from our How to Convert Two Lists into a Dictionary in Python article:

csv_mapping_list = []
with open("/path/to/data.csv") as my_data:
  line_count = 0
  for line in my_data:
    row_list = [val.strip() for val in line.split(",")]
    if line_count == 0:
      header = row_list
    else:
      row_dict = {key: value for key, value in zip(header, row_list)}
      csv_mapping_list.append(row_dict)
    line_count += 1

In either case, this solution leaves a lot of room for modification. For instance, if your file has some other kind of delimiter, this may be the solution for you.

That said, be aware that the brute force solution has a major drawback. According to Juha-Matti Santala, splitting by comma can fail if one of the rows contains text with a comma in it (i.e. "Grifski, Jeremy",25,Blue). To make matters worse, this problem changes depending on which delimiter you use. Fortunately, there are better solutions to follow!

Parse a Spreadsheet with the CSV Reader Object

As mentioned previously, we don’t have to write our own CSV parser if we don’t want to. Instead, we can get by with the csv package. As you can probably imagine, it has a ton of CSV parsing functionality. In particular, it contains the reader object which we can use to read a CSV file like we did previously:

import csv

csv_mapping_list = []
with open("/path/to/data.csv") as my_data:
  csv_reader = csv.reader(my_data, delimiter=",")
  line_count = 0
  for line in csv_reader:
     if line_count == 0:
       header = line
     else:
       row_dict = {key: value for key, value in zip(header, line)}
       csv_mapping_list.append(row_dict)
     line_count += 1

With the reader object, we haven’t managed to simplify our code that much. In fact, all we did was replace the list comprehension with the reader instantiation. Regardless, it’s a nice option for those who would prefer not to write their own parser.

Parse a Spreadsheet with the CSV DictReader Object

At this point, you’re probably wondering why we would even use the csv library. After all, it barely made a difference in the last two examples. Fortunately, there is a way to reduce our code a bit using a special class in the csv library called DictReader:

import csv

with open("/path/to/dict.csv") as my_data:
  csv_mapping_list = list(csv.DictReader(my_data))

And, there we have it! All the parsing we did in the first two sections has now been drastically reduced to just 3 lines of code.

That said, there is a minor difference in this solution. Instead of creating a list of dictionaries, we’ve created a list of OrderedDict objects. They can be used just like dictionaries, but their key order is fixed. Regardless, the elegance of this solution should more than make up for that fact, and in some cases we may even want to preserve key order.

A Little Recap

At this point, I find it nice to look at all the potential solutions at once:

# Brute force solution
csv_mapping_list = []
with open("/path/to/data.csv") as my_data:
  line_count = 0
  for line in my_data:
    row_list = [val.strip() for val in line.split(",")]
    if line_count == 0:
      header = row_list
    else:
      row_dict = {key: value for key, value in zip(header, row_list)}
      csv_mapping_list.append(row_dict)
    line_count += 1

# CSV reader solution
import csv

csv_mapping_list = []
with open("/path/to/data.csv") as my_data:
  csv_reader = csv.reader(my_data, delimiter=",")
  line_count = 0
  for line in csv_reader:
     if line_count == 0:
       header = line
     else:
       row_dict = {key: value for key, value in zip(header, line)}
       csv_mapping_list.append(row_dict)
     line_count += 1

# CSV DictReader solution
import csv

with open("/path/to/dict.csv") as my_data:
  csv_mapping_list = list(csv.DictReader(my_data))

Up next, I’m looking to write an article on how to sort a list of dictionaries, so look out for that! Until then, thanks for sticking around. Hopefully, this article was helpful. If so, why not give it a share?

Series Navigation← How to Sort a List of Strings in PythonHow to Sort a List of Dictionaries in Python →
Advertisements

Leave a Comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.