The Pesky Challenge of Evaluating AI Outputs

Julie Dirksen is a leading expert in instructional design, eLearning, and behavior change. She’s a frequent speaker at industry events, an eLearning Guild Guild-master, and the author of one of the bestselling books on instructional design, “Design for How People Learn.”

Originally posted to

One of the things that has bothered me since the beginning of the AI conversation is that most of the discussions of using AI or LLM outputs contains some phrase to the importance of “evaluating the output to make sure it’s correct” or something along those lines. Pretty much any responsible writing about AI contains a reference to the importance of not accepting the AI output at face value, but reading or viewing it to make sure it’s okay.

But here’s the thing. They say that very casually LIKE IT’S NOT THE HARD PART.

First of all, you need the expertise to judge an output, and second you need the discipline to exert the effort required to assess an output.

One of the early ChatGPT efforts I saw was somebody using the AI tool to write a short textbook on the teaching method of Direct Instruction (this is not intended as a slight on that person – it was clearly an experiment at the same time people were experimenting with ChatGPT to write Shakespearean sonnets about their favorite dog breed).

I have no reason to believe this was an effort to really create a textbook, but it gives us an example to work with. If you were actually using it as a tool to help create an actual textbook, what questions would you need to ask (ethical issues aside)?

The first question would probably need to be “Is this an accurate resource about Direct Instruction?” Possibly this person had the expertise to evaluate the output, but it seems like something that you would want an expert to review before distributing widely.

Second, you’d need to have the discipline to read the whole thing.

This scientific article was making the rounds on the internet in March. I’m utterly unqualified to comment on the accuracy of this article, but even I know if you start your introduction with the phrase “Certainly, here is a possible introduction for your topic: Lithium-metal batteries are…” then it means somebody wasn’t being careful with the copy-and-paste function. Academic writing can be a tedious process, and AI might be able to help with that, but if the authors of the article aren’t actually reading what the AI produces, that’s a problem.

It’s part of human nature to accept defaults. This isn’t always the case, but it is common enough that we should be very concerned about people having the discipline to stop and review AI outputs. I’m not convinced that it’s a realistic goal – quick copy and paste is too easy a behavior to admonish people out of – but that means we need to have other safeguards in place. Either that means guardrails built into the AI itself, which are being developed, or it means having at least a spot check or audit process in a workplace context.

I know one of the arguments has been that error rates in AI-generated material can be lower than error rates in human-generated material (Jennifer Solberg discusses a few good examples in this podcast), and should we hold the AI to 100% standard when we don’t hold humans to that standard. First, YES WE SHOULD. I suspect that is there is more likely to be an internal logic with human errors, but that’s a complicated topic that is probably very context dependent.

Risk level should drive this level of vigilance. What is the consequence of this being wrong? Is it a wonky social media post, or is it a missed cancer diagnosis, or is it an inaccurately placed drone attack?

More thoughts to come on this, but for now, I think there are a few questions we should be asking:

  • Does this person have the knowledge and expertise to judge this output?
  • Is it reasonable to expect this person has the discipline to evaluate the outputs in detail?
  • What is the risk if output errors are not caught?

Love this @MilwJoe , the ability to synthesize, discern, evaluate - to critically think is being brushed aside far too casually. “Oh, the tool/tech/platform will take care of that” isn’t the answer, and we are finding that the rule of “GIGO” (garbage in garbage out) coined long ago is as true today as it was then.

Let us push for the use of tools within the framework of ethical, moral, and accurate application of the results being created and delivered. Perhaps our focus and conversations should be around the application more than the apps? Great share!!

1 Like