#87: The Map is Not The Territory

Understanding evals and the frontier of AI testing in mental health

Hi friend,

How will we know when AI is safe? Or when it is clinically effective? How can we compare one AI chatbot against another? 

Evals (evaluations) have emerged as one of the most common ways to answer these questions. These structured tests measure how an AI model behaves in specific scenarios by simulating how people use these products for their mental health.

The field is confusing. There are now more than sixty1 evals in the mental health space alone. There’s no shared standard, some evals are public while others are private and only seen by the companies who build them. Many, understandably, don’t know how they work. Everyone has different opinions on which evals are good, and some believe evals - at least in their current state - are not very useful at all.

Kevin Hou and I set out to understand this space. We’ve been gathering data and speaking to experts on AI in mental health. In this report, we give a primer on evals, discuss their current state, share their limitations and present what the frontier of AI testing looks like in 2026. In the appendix we also share a link to a rapid literature review2 of recent research that uses evals to test AI performance in mental health.

Whether you know nothing about evals or are deep in the weeds of AI testing, I’m confident there’s something interesting in this for you.

Let’s get into it!

Subscribe to keep reading

This content is free, but you must be subscribed to The Hemingway Report to continue reading.

Already a subscriber?Sign in.Not now

Reply

or to participate.