Unpacking the Jargon Around Compilers, Interpreters, and More

Today, we’re going to get a little pedantic and try to define concepts like compiler and interpreter. Of course, I ultimately don’t think it matters what the exact definitions are for practical purposes, but I thought I would try to draw a distinction regardless.

What's the Confusion?
High-Level Overview of Programming Languages
Processing Source Code
The Difference Between Compilers and Interpreters
Why the Confusion?

What’s the Confusion?

There are a lot of misconceptions that swirl around terms like compilers and interpreters. In fact, I ran a little poll with my classes this week and found out that most students describe an interpreter as a compiler that runs code one line at a time. If this is how you would have described an interpreter, then you’ve come to the right place for clarity!

Just so we’re clear, the misconception around what an interpreter actually is can be found all over the internet. Just searching up “compiler vs interpreter” will give you an article by my favorite website to hate, GeeksForGeeks, with the following definition:

An interpreter is a program that translates a programming language into a comprehensible language. The interpreter converts high-level language to an intermediate language. It contains pre-compiled code, source code, etc.

It translates only one statement of the program at a time.

Interpreters, more often than not are smaller than compilers.

But don’t worry! They’re not the only content farm that churns out this kind of definition. Right behind it, Programiz dumps a table comparing the two, which states that an interpreter “translates a program one statement at a time.”

Broadly, in this article, I’m going to try to argue that this is not the definition of an interpreter, and if you don’t believe me, I encourage you to browse the Wikipedia page. In addition, I’m going to try to outline the definitions of a few related terms. However, to do that, we’ll need to understand how these systems work.

High-Level Overview of Programming Languages

Programming languages have basically two concepts that govern them: syntax and semantics.

To start, syntax determines the rules for the structure of the code. If the proper syntax is not followed, then the program is not legal (i.e., it cannot be compiled or interpreted).

As you can probably image, syntax is a term stolen from natural language, which refers to the way that we structure sentences. In English, “every” sentence must have a subject and a verb in that order. In other words, “she runs” is syntactically valid while “runs she” is not.

Just following the syntactic rules of English is not enough to constitute a meaningful sentence. For example, the following sentence is syntactically valid but completely meaningless: the guitar listens to a happy door.

The same applies to programming languages. In other words, if syntax describes the structure of the code, semantics describes the meaning of the code. I often joke with my students that I’m going to make a programming language with unexpected semantics, like if statements that actually loop or arithmetic operators that do the opposite (e.g., subtraction that does addition).

Ultimately, for a compiler or interpreter to do its job, it needs to be intimately familiar with both the proper syntax of the language but also its semantics. Next, we’ll see why!

Processing Source Code

Generally speaking, the syntax of a programming language is defined through a grammar. There are many different kinds of grammars, which are out of the scope of this article, but feel free to look up context-free and attribute grammars. In short, they define the general structures of the language and how they’re related.

By knowing the grammar, we can check a sample program for syntactic validity. In other words, according to the rules in the grammar, is some sample code a valid program? If not, throw an error to the user.

Now, you could easily do this by hand by checking the code against your grammar, but we make tools that do this for us. They’re called compilers and interpreters.

Regardless of which one we’re talking about, they have a lot of overlap. For example, the general first step of a compiler or interpreter is to tokenize the input file. In short, tokenization refers to the process of breaking up the source code into “tokens.” You get to choose what a token is, but a rough process would be to split the source code by whitespace.

Therefore, the output of the tokenization step should be a sequence of tokens. For example, here’s a short Java program:

public class Example {
  public static void main(String[] args) {
    System.out.println("Hi");
  }
}

If you were to tokenize this input, you might try splitting it by whitespace. In which case, your output would be a sequence like the following:

[
  "public", 
  "class", 
  "Example", 
  "{", 
  "public", 
  "static", 
  "void", 
  "main(String[]", 
  "args)", 
  ...
]

Of course, this isn’t quite good enough. We would probably need to account for braces and parentheses, so they show up as their own tokens. That said, you get the idea! The output of tokenization would be a list of tokens.

Next, the sequence of tokens is fed into a parser. Much like grammars, there are probably 1,000 different ways to parse a set of tokens. The easiest way is to look at the first token and compare it with the grammar. If there is a substitution rule that fits, you might try to expand it. If not, there’s a syntax error.

To put it another way, think again about how English works. We can take a sentence and splice it up into tokens. As a form of meta-humor, we’ll use the previous sentence as an example:

[
  "We", 
  "can", 
  "take", 
  "a", 
  "sentence", 
  "and", 
  "splice", 
  "it", 
  "up", 
  "into", 
  "tokens"
]

Next, we can look at the first token and see if it fits our grammar rules. “We” is a pronoun, which is an acceptable start to a sentence, so great start! “can” is a verb (in fact, it’s a special verb known as an auxiliary verb), which is also a reasonable next token. Basically, you follow this process until you exhaust all the tokens or you run into an unexpected token.

The Difference Between Compilers and Interpreters

Up to this point, both compilers and interpreters tokenize (sometimes called lexical analysis) and parse their input. While I didn’t explicitly state it, the output of a parser is usually an abstract syntax tree. Without getting too into the weeds, an abstract syntax tree is the syntax of input program structured as a tree. That way, the tree can be traversed for various means.

Where compilers and interpreters differ is in what happens with the abstract syntax tree (AST)—though, an interpreter may never even need to generate one. It should be clear why in a moment.

Specifically, a compiler is going to traverse the AST to produce code. In other words, the output of a compiler is code. It could be bytecode. It could be machine code. It could be code in some other language. It doesn’t really matter as long as the output is code.

On the other hand, an interpreter is going to traverse the AST to evaluate it. In other words, an interpreter is going to run the code given to it directly.

Ultimately, the difference is that an interpreter executes the code and a compiler does not. Therefore, often what you will read about an interpreter executing code “line-by-line” or “statement-by-statement” is more descriptive of a REPL (i.e., Read-Evaluate-Print-Loop) than an interpreter. Though, “statement-by-statement” doesn’t even make sense on its own since not all languages contain statements, but that’s getting overly pedantic.

A rough overview of how you might conceptualize the difference between a compiler and interpreter. Not all compilers or interpreters follow the same flow.

Why the Confusion?

While the definitions above might seem clear cut, the concepts themselves are actually steeped in gray area.

For starters, some interpreters will actually compile code to machine code right before execution (i.e., via a JIT compiler). In other words, in addition to running code, an interpreter might produce code. Then, the next time the interpreter runs, it uses as much compiled code as possible.

In other cases, the compiler and interpreter might be two different steps. For example, you can compile Java code to bytecode. Then, you can interpret the bytecode with the JVM, which itself may contain a JIT compiler for producing machine code. Python operates somewhat similarly by first compiling the code to bytecode then immediately executing it, though the compilation step is hidden from the user.

To make matters even worse, some of the stuff I said above is misleading. Languages themselves are neither compiled or interpreted. They’re just languages. We can make tools that do either or both. In fact, you could write your own Java interpreter right now, if you wanted to.

This all falls back on my older discussion on defining what a programming language is. Several times on this site I’ve argued that it’s silly to even entertain the question of whether or not “X” is a programming language. Frankly, languages don’t do anything. It’s the tooling around those languages that gives them meaning.

At any rate, this will hopefully be my last article before I have my doctorate, so this is my last chance to act like a pretentious grad student. Jokes aside, I ultimately don’t think it matters how you conceptualize compiler and interpreter. Much of these details are abstracted away at this point and have almost no effect on software development.

That said, if you find yourself in a situation where someone says “um, actually an interpreter …”, you will be somewhat better prepared to deal with it. Or at the very least, I’ve lead you down some rabbit hole to better understand how compilers and interpreters work. If you’re curious about that, I think I have some related articles you might like:

Making Sense of the Python Interpreter
- Looking back on this article is quite funny because I even use phrases like “line-by-line” to introduce the concept of an interpreter. The article should perhaps be titled “Making Sense of the Python REPL,” though it’s not wrong to call IDLE an interpreter. It’s just a special kind of interpreter known as an interactive interpreter.
The Lisp Programming Language: Interpreter Design
Explain Like I’m Five: Context-Free Grammars

Finally, feel free to take your support another step by checking out my list of ways to grow the site. Otherwise, see you next time!

Coding Tangents (46 Articles)—Series Navigation

As a lifelong learner and aspiring teacher, I find that not all subjects carry the same weight. As a result, some topics can fall through the cracks due to time constraints or other commitments. Personally, I find these lost artifacts to be quite fun to discuss. That’s why I’ve decided to launch a whole series to do just that. Welcome to Coding Tangents, a collection of articles that tackle the edge case topics of software development.

In this series, I’ll be tackling topics that I feel many of my own students have been curious about but never really got the chance to explore. In many cases, these are subjects that I think deserve more exposure in the classroom. For instance, did you ever receive a formal explanation of access modifiers? How about package management? Version control?

In some cases, students are forced to learn these subjects on their own. Naturally, this forms a breeding ground for misconceptions which are made popular in online forums like Stack Overflow and Reddit. With this series, I’m hoping to get back to the basics where these subjects can be tackled in their entirety.

← Previous Post: [#44] [#46]: Next Post →