#87: The Map is Not The Territory

Hi friend,

How will we know when AI is safe? Or when it is clinically effective? How can we compare one AI chatbot against another?

Evals (evaluations) have emerged as one of the most common ways to answer these questions. These structured tests measure how an AI model behaves in specific scenarios by simulating how people use these products for their mental health.

The field is confusing. There are now more than sixty¹evals in the mental health space alone. There’s no shared standard, some evals are public while others are private and only seen by the companies who build them. Many, understandably, don’t know how they work. Everyone has different opinions on which evals are good, and some believe evals - at least in their current state - are not very useful at all.

Kevin Hou and I set out to understand this space. We’ve been gathering data and speaking to experts on AI in mental health. In this report, we give a primer on evals, discuss their current state, share their limitations and present what the frontier of AI testing looks like in 2026. In the appendix we also share a link to a rapid literature review² of recent research that uses evals to test AI performance in mental health.

Whether you know nothing about evals or are deep in the weeds of AI testing, I’m confident there’s something interesting in this for you.

Let’s get into it!

#87: The Map is Not The Territory

Reply

Keep Reading

The Hemingway Report

Home

Account

#87: The Map is Not The Territory

Subscribe to keep reading

Reply

Keep Reading

The Hemingway Report

Home

Account