Episode 60: 10 Things I Hate About AI Evals with Hamel Husain

Update: 2025-09-30

Description

Most AI teams find "evals" frustrating, but ML Engineer Hamel Husain argues they’re just using the wrong playbook. In this episode, he lays out a data-centric approach to systematically measure and improve AI, turning unreliable prototypes into robust, production-ready systems.

Drawing from his experience getting countless teams unstuck, Hamel explains why the solution requires a "revenge of the data scientists." He details the essential mindset shifts, error analysis techniques, and practical steps needed to move beyond guesswork and build AI products you can actually trust.

We talk through:

The 10(+1) critical mistakes that cause teams to waste time on evals

Why "hallucination scores" are a waste of time (and what to measure instead)

The manual review process that finds major issues in hours, not weeks

A step-by-step method for building LLM judges you can actually trust

How to use domain experts without getting stuck in endless review committees

Guest Bryan Bischof's "Failure as a Funnel" for debugging complex AI agents

If you're tired of ambiguous "vibe checks" and want a clear process that delivers real improvement, this episode provides the definitive roadmap.

LINKS

Hamel's website and blog

Hugo speaks with Philip Carter (Honeycomb) about aligning your LLM-as-a-judge with your domain expertise

Hamel Husain on Lenny's pocast, which includes a live demo of error analysis

The episode of VG in which Hamel and Hugo talk about Hamel's "data consulting in Vegas" era

Upcoming Events on Luma

Watch the podcast video on YouTube

Hamel's AI evals course, which he teaches with Shreya Shankar (UC Berkeley): starts Oct 6 and this link gives 35% off! https://maven.com/parlance-labs/evals?promoCode=GOHUGORGOHOME

🎓 Learn more:

Hugo's course: Building LLM Applications for Data Scientists and Software Engineers — https://maven.com/s/course/d56067f338

Comments

In Channel

Episode 62: Practical AI at Work: How Execs and Developers Can Actually Use LLMs

2025-10-3159:04

Episode 61: The AI Agent Reliability Cliff: What Happens When Tools Fail in Production

2025-10-1628:04

Episode 60: 10 Things I Hate About AI Evals with Hamel Husain

2025-09-3001:13:15

Episode 59: Patterns and Anti-Patterns For Building with AI

2025-09-2347:37

Episode 58: Building GenAI Systems That Make Business Decisions with Thomas Wiecki (PyMC Labs)

2025-09-0901:00:45

Episode 57: AI Agents and LLM Judges at Scale: Processing Millions of Documents (Without Breaking the Bank)

2025-08-2941:27

Episode 56: DeepMind Just Dropped Gemma 270M... And Here’s Why It Matters

2025-08-1445:40

Episode 55: From Frittatas to Production LLMs: Breakfast at SciPy

2025-08-1238:08

Episode 54: Scaling AI: From Colab to Clusters — A Practitioner’s Guide to Distributed Training and Inference

2025-07-1841:17

Episode 53: Human-Seeded Evals & Self-Tuning Agents: Samuel Colvin on Shipping Reliable LLMs

2025-07-0844:49

Episode 52: Why Most LLM Products Break at Retrieval (And How to Fix Them)

2025-07-0228:38

Episode 51: Why We Built an MCP Server and What Broke First

2025-06-2647:41

Episode 50: A Field Guide to Rapidly Improving AI Products -- With Hamel Husain

2025-06-1727:42

Episode 49: Why Data and AI Still Break at Scale (and What to Do About It)

2025-06-0501:21:45

Episode 48: HOW TO BENCHMARK AGI WITH GREG KAMRADT

2025-05-2301:04:25

Episode 47: The Great Pacific Garbage Patch of Code Slop with Joe Reis

2025-04-0701:19:12

Episode 46: Software Composition Is the New Vibe Coding

2025-04-0301:08:57

Episode 45: Your AI application is broken. Here’s what to do about it.

2025-02-2001:17:30

Episode 44: The Future of AI Coding Assistants: Who’s Really in Control?

2025-02-0401:34:11

Episode 43: Tales from 400+ LLM Deployments: Building Reliable AI Agents in Production

2025-01-1601:01:03

00:00

Episode 60: 10 Things I Hate About AI Evals with Hamel Husain

#box-pro-ellipsis-176245685355860{-webkit-line-clamp:2;}Episode 60: 10 Things I Hate About AI Evals with Hamel Husain

Episode 60: 10 Things I Hate About AI Evals with Hamel Husain

Hugo Bowne-Anderson

Episode 60: 10 Things I Hate About AI Evals with Hamel Husain