Human Review of AI Output Is Not the Path Forward

Recently, I’ve noticed that a lot of folks still try to have a nuanced take about generative AI. Like, they might say, “AI has problems, but it’s fine as long as a human reviews the output.” Do we really think that’s the way forward? I sure don’t.

Is It Okay If a Human Reviews It?
Reviewing Synthetic "Work" Is Untenable
Proposing a Solution
But the Cat's Out of the Bag

Is It Okay If a Human Reviews It?

A sentiment I see a lot with the current rapid adoption of generative AI is that it’s okay “as long as a human reviews the output.” I get why people say this. They believe the technology is good, possibly even necessary, but it makes mistakes. Therefore, all we need to do is double check its work.

It’s funny because I’ve seen this sentiment a lot, but I most recently saw it in the r/law subreddit, where folks were debating the role of AI in cases. Apparently, a lot of lawyers are getting caught using AI in court filings because of case law hallucinations. Either the sources are completely hallucinated or proper cases are cited that do not back up the argument.

Naturally, as someone who is also required to cite other people’s work as a part of my job, I see no place for generative AI. Yet, I know I hold the minority position. After all, I’ve already reviewed several education papers where the authors essentially brag about using AI to conduct their studies. Given how things have gone in the courtroom, it’s only a matter of time before reviewers start calling out hallucinations in research papers (if that’s not happening already).

Yet, seeing these stories might actually fill you with hope: “See! Human intervention is working as intended.” Shouldn’t I be glad that people are catching AI mistakes?

The short answer is… no, I’m not glad that people are successfully finding mistakes in AI output. All this tells me is that manual review is only scratching the surface. Each story is just a single example of a hallucination that was caught. What about the plethora of hallucinations that slipped through?

Reviewing Synthetic “Work” Is Untenable

Given what we’ve discussed so far, I don’t think it’s safe to say that “AI is fine as long as a human reviews the output.” There is just no way that position is tenable long term. Let me pitch you a few reasons why.

To start, transitioning people from “doing the work” to “reviewing the work” is a surefire way to deskill those people. I am a perfect example of this. As an educator and maintainer, I very rarely write code these days. I primarily debug student code, suggest strategies, explain theory, and review pull requests. Almost certainly my ability to develop software has atrophied over the last few years. Do you want to deskill an entire population of programmers, lawyers, and more by transitioning their roles to simply reviewing work done for them?

Likewise, as I suggested in the last section, are we certain that people will always conduct the most rigorous reviews? Like, the idea that “it’s fine if a human reviews it” only works if people are diligent in their reviews. In my experience, reviews are rarely, if ever, thorough. When I look at a pull request, I skim the code to make sure it makes sense, follows proper style, etc. Then, I trust the test cases to have verified the behavior. I’m not doing a line-by-line read of the entire change because I trust the person who worked on it. When you outsource your development to a chat bot, I can no longer implicitly trust your output.

And even if the reviews are thorough, we run into a new problem: generative AI is capable of producing synthetic output (read: slop) at a rate that no human could ever expect to achieve or review. Suddenly, reviewing code and whatnot becomes an overwhelming task. So you either hastily review it, or you let the backlog fill up. The latter case seems to be what’s happening with a lot of open source repositories.

Of course, all of the above is only possible if the reviewers actually have expertise in the thing they’re reviewing, even if we factor out deskilling. I am able to review pull requests because I have a background in software development. I can also review manuscripts because I have a background in qualitative research. What is the plan long term for reviewers? Are we still going to train them to write/code/etc., or are we going to train them to review output?

I bring this up because I’m not sure the path forward is “everyone becomes a critic of a craft.” Like, do we train people to be “code critics” in the same way we train film critics? Is that even possible? In my mind, it’s not. After all, we’ve all seen a video on how to make a fire, but how many of us could actually do it (assuming no prior knowledge)? Or better yet, could we properly critique someone doing it? I just don’t know how we go down the path of equating “seeing a ton of things” to “making a ton of things.” How can you evaluate a manuscript if you’ve never written one? How can you assess a pull request if you’ve never written code? If no one builds things, how can we expect them to have the expertise to review synthetic “work”?

Finally, let’s suppose we toss all the previous arguments aside and train people to be reviewers; being a reviewer sucks. I don’t like reviewing manuscripts. I don’t like reviewing pull requests. And, I don’t like grading assignments. These are all tasks where you review someone’s work and help them improve. I do these things because they’re the right thing to do (e.g., paying it forward for the next generation), but it’s not something I enjoy doing (though, I absolutely am not advocating for these kinds of tasks to be automated, rather for workers to be properly compensated). It’s a lot of time and effort to critique someone’s work and help them grow. Not to mention that you’re never just reviewing one manuscript, pull request, or assignment. Therefore, the reviewer pays the cost of context switching. Isn’t turning everyone into a reviewer going to fry some brains?

Proposing a Solution

While I get that a lot of well-meaning folks think it’s reasonable to review the synthetic output, I’m not sure that’s the solution long term. What disturbs me, however, is what many AI sycophants propose instead: remove people from the loop.

A proposed solution that comes up again and again, from folks in that law thread to people in my own life, is that every problem can be solved with another model. Are you worried about people poisoning your AI model? Use another AI model to filter the training data. Are you worried that humans will review code poorly? Use a different AI model for code review.

Like I hate working as much as the next guy, but this solution doesn’t even pass the smell test. There’s not a single student out there that trusts tools like GPTZero to accurately check if their essay was written with AI. Why the hell would you trust a discriminator to differentiate between real and synthetic data? Isn’t the goal for these models to produce output that’s indistinguishable from reality? Likewise, do we really expect these models to know what “good” code looks like? What is “their” frame of reference? Uncle Bob’s Clean Code? Reddit?

This entire approach is about kicking the hallucinated can down the road. It’s why folks have completely moved on to agents. They know these problems are unsolvable, but they can mask them by letting an agent spin the LLM slot machine on their behalf. Eventually, if you spin enough times, you might get something with just the right combination of hallucinations to work.

But the Cat’s Out of the Bag

I wouldn’t mind this current era of tech sycophancy so much if there weren’t so many centrists giving credence to these AGI grifters. Like, I can’t even get away from these “it’s the new calculator” takes from regular people I chat with in real life.

Recently, I was given an exit interview for a workshop I was a part of for the last couple of semesters. In it, the interviewer asked me for some feedback, and I shared that I was glad they were promoting actual teaching strategies and not hopping on the AI hype train. The interviewer pushed back a bit and said, “well, it’s not like it’s going anywhere.”

I find this mindset so pathetic. It’s the exact mindset that allows the fossil fuel industry to continue dominating the energy sector. It’s why we don’t have nearly enough options for EVs. It’s why we don’t have trains. It’s why our infrastructure is just miles and miles of highways and parking lots. We just accept it because “the car isn’t going anywhere” or “gas isn’t going anywhere.”

Imagine if we said the same thing about any number of advancements that turned sour. “Sorry, the cat’s out of the bag on CFCs. The ozone layer is cooked. Nothing we can do.” “Sorry, the cat’s out of the bag on asbestos. Every house must be built with it. Nothing we can do.” Yet, somehow AI gets a pass because it makes the dumbest person you know feel like a genius. I hate this timeline.

Anyway, I’m going to call it there. Hopefully, you had some fun with this piece. It’ll go in a long list of posts in my already massive series about all the things I hate about AI. Turns out that even a year later, I’m still annoyed by this technological takeover. If you are too, then you might enjoy some of these pieces:

If not, no sweat. I would appreciate it if you stopped by my list of ways to grow the site anyway. Otherwise, take care!

The Hater's Guide to Generative AI (23 Articles)—Series Navigation

As a self-described hater of generative AI, I figured I might as well group up all my related articles into one series. During the earlier moments in the series, I share why I’m skeptical of generative AI as a technology. Later, I share more direct critiques. Feel free to follow me along for the ride.

← Previous Post: [#22]

Human Review of AI Output Is Not the Path Forward

Table of Contents

Is It Okay If a Human Reviews It?

Reviewing Synthetic “Work” Is Untenable

Proposing a Solution

But the Cat’s Out of the Bag

Recent Blog Posts