DiscoverLenny's Podcast: Product | Career | GrowthWhy AI evals are the hottest new skill for product builders | Hamel Husain & Shreya Shankar (creators of the #1 eval course)
Why AI evals are the hottest new skill for product builders | Hamel Husain & Shreya Shankar (creators of the #1 eval course)

Why AI evals are the hottest new skill for product builders | Hamel Husain & Shreya Shankar (creators of the #1 eval course)

Update: 2025-09-253
Share

Digest

This podcast emphasizes the critical role of "evals" in building successful AI products, defining them as systematic methods for measuring and improving AI applications, akin to data analytics for LLM applications. It debunks common misconceptions, such as AI being able to self-evaluate, and introduces the "benevolent dictator" concept for efficient eval processes. The discussion highlights error analysis, using techniques like open coding and axial coding, as a foundational step. It also covers building and validating LLM judges, the challenges of criteria drift, and the relationship between evals, A/B tests, and data science. Practical advice is offered on getting started, managing time investment, and the high ROI of continuous improvement through evals, ultimately positioning them as a core component of AI product strategy.

Outlines

00:00:00
The Importance and Definition of Evals in AI Product Development

Evals are systematic methods for measuring and improving AI applications, offering high ROI for product builders. They are more than unit tests, encompassing broader quality measurement and data analytics for LLM applications.

00:00:35
Demystifying Evals: Common Misconceptions and the Benevolent Dictator

This section addresses the misconception that AI can self-evaluate and introduces the "benevolent dictator" model to streamline the eval process by appointing a domain expert. Evals are presented as a critical skill for AI product builders.

00:10:06
Practical Error Analysis for AI Products

The process of building evals is demonstrated through error analysis of application logs (traces) to identify issues like incorrect information or conversational flow problems. This involves manual "open coding" to capture initial observations.

00:25:15
Synthesizing Errors and Building LLM Judges

Error analysis is synthesized using AI through "axial coding" to categorize failure modes. LLM judges are then built to automate the assessment of complex failure modes, providing pass/fail results after careful validation against human judgment.

01:03:17
Challenges and Nuances in Evals

Research on "Who Validates the Validator" highlights challenges like "criteria drift." The discussion also clarifies that evals are essentially data science applied to products, addressing misconceptions about automated evaluation and the need for raw data analysis.

01:13:53
Evals for Different AI Products and Unified Approaches

Coding agents are discussed as a unique case where developers are domain experts, simplifying evals. A/B tests are framed as a form of evals, emphasizing grounding hypotheses in error analysis.

01:24:13
Practical Tips and Time Investment for Evals

Key advice includes not fearing data analysis, using LLMs for organization, and focusing on actionable improvements. Initial setup may take days, but ongoing efforts are minimal, yielding a high ROI.

01:33:54
Deep Dive into Evals Course Content and Perks

The course covers the full lifecycle of error analysis, automated evaluators, and improvement flywheels, including custom interfaces and cost optimization. Perks include a comprehensive book and an AI chatbot.

01:38:06
Lightning Round and Life Mottos

Recommendations include books and AI coding tools. Life mottos focus on continuous learning and considering other perspectives, with a vision for collaborative improvement through evals.

01:43:09
Appreciation and Next Steps

The speakers express mutual admiration and share how listeners can connect with them and their course, encouraging the sharing of successes.

Keywords

Evals


Systematic process for measuring and improving AI applications, especially LLMs, involving data analysis and iteration.

Error Analysis


Examining AI application logs to identify and categorize failures, a key first step in building evals.

Open Coding


Manual, free-form note-taking on observed issues in AI traces during error analysis.

Axial Coding


Categorizing open codes into broader failure modes or themes to organize identified errors.

Benevolent Dictator


A single domain expert leading the eval process for efficiency and streamlined decision-making.

LLM Judge


An AI model used to evaluate AI outputs based on predefined criteria for quality control.

Criteria Drift


The challenge of changing evaluator standards over time, impacting consistent evaluation.

Data Science


The application of scientific methods to extract insights from data for product improvement.

AI-assisted coding


Tools that use AI to help developers write, debug, and optimize code, increasing productivity.

A/B Testing


A method to compare product versions and measure quality through metrics, considered a form of evals.

Q&A

  • What are evals in the context of AI product development?

    Evals are a systematic method for measuring and improving AI applications, essentially acting as data analytics for LLM applications. They help identify issues, iterate on solutions, and ensure the AI performs as intended.

  • Why is manual error analysis important before creating automated evals?

    Manual error analysis, through techniques like open coding, is crucial because humans have the context to identify subtle issues and “bad product smells” that AI might miss. This initial step grounds the eval process in real-world problems.

  • What is the role of a "benevolent dictator" in building evals?

    A benevolent dictator is a single, trusted individual with domain expertise who leads the eval process. This approach streamlines decision-making, prevents committee-based delays, and makes the complex task of eval creation more manageable.

  • How can AI be used to help with the eval process?

    AI can assist in synthesizing open codes into axial codes (failure modes) and can also function as an “LLM judge” to automate the evaluation of specific AI behaviors based on predefined prompts and criteria.

  • What are the two main types of evals discussed?

    The two main types are code-based evals, which are like automated unit tests for specific, quantifiable behaviors, and LLM-based evals, where an LLM acts as a judge for more complex or subjective failure modes.

  • Why is it important to validate LLM judges?

    LLM judges must be validated against human judgment to ensure their accuracy and reliability. Misaligned judges can lead to misleading metrics and a loss of trust in the evaluation process.

  • What is "criteria drift" and why is it a challenge in evals?

    Criteria drift occurs when evaluators' standards change as they review more data. This makes consistent evaluation difficult, as what is considered "good" or "bad" can shift, requiring careful management of the eval process.

  • Should evals replace traditional PRDs (Product Requirements Documents)?

    Evals complement, rather than replace, PRDs. While PRDs outline initial expectations, evals, informed by data analysis, help uncover unforeseen failure modes and evolving requirements, leading to more robust and adaptive AI products.

  • What are the main misconceptions people have about evals?

    Common misconceptions include believing that evaluation tools can automatically perform evals without human input, and neglecting the importance of directly analyzing raw data and user traces to understand product issues.

  • How do A/B tests relate to evals?

    A/B tests are considered a form of evals, as they involve systematically measuring quality through metrics and comparing different versions of a product. However, effective A/B tests should be informed by prior error analysis to ensure hypotheses are grounded in real-world issues.

Show Notes

Hamel Husain and Shreya Shankar teach the world’s most popular course on AI evals and have trained over 2,000 PMs and engineers (including many teams at OpenAI and Anthropic). In this conversation, they demystify the process of developing effective evals, walk through real examples, and share practical techniques that’ll help you improve your AI product.

What you’ll learn:

1. WTF evals are

2. Why they’ve become the most important new skill for AI product builders

3. A step-by-step walkthrough of how to create an effective eval

4. A deep dive into error analysis, open coding, and axial coding

5. Code-based evals vs. LLM-as-judge

6. The most common pitfalls and how to avoid them

7. Practical tips for implementing evals with minimal time investment (30 minutes per week after initial setup)

8. Insight into the debate between “vibes” and systematic evals

Brought to you by:

Fin—The #1 AI agent for customer service

Dscout—The UX platform to capture insights at every stage: from ideation to production

Mercury—The art of simplified finances

Where to find Shreya Shankar

• X: https://x.com/sh_reya

• LinkedIn: https://www.linkedin.com/in/shrshnk/

• Website: https://www.sh-reya.com/

• Maven course: https://bit.ly/4myp27m

Where to find Hamel Husain

• X: https://x.com/HamelHusain

• LinkedIn: https://www.linkedin.com/in/hamelhusain/

• Website: https://hamel.dev/

• Maven course: https://bit.ly/4myp27m

In this episode, we cover:

(00:00 ) Introduction to Hamel and Shreya

(04:57 ) What are evals?

(09:56 ) Demo: Examining real traces from a property management AI assistant

(16:51 ) Writing notes on errors

(23:54 ) Why LLMs can’t replace humans in the initial error analysis

(25:16 ) The concept of a “benevolent dictator” in the eval process

(28:07 ) Theoretical saturation: when to stop

(31:39 ) Using axial codes to help categorize and synthesize error notes

(44:39 ) The results

(46:06 ) Building an LLM-as-judge to evaluate specific failure modes

(48:31 ) The difference between code-based evals and LLM-as-judge

(52:10 ) Example: LLM-as-judge

(54:45 ) Testing your LLM judge against human judgment

(01:00:51 ) Why evals are the new PRDs for AI products

(01:05:09 ) How many evals you actually need

(01:07:41 ) What comes after evals

(01:09:57 ) The great evals debate

(1:15:15 ) Why dogfooding isn’t enough for most AI products

(01:18:23 ) OpenAI’s Statsig acquisition

(1:23:02 ) The Claude Code controversy and the importance of context

(01:24:13 ) Common misconceptions around evals

(1:22:28 ) Tips and tricks for implementing evals effectively

(1:30:37 ) The time investment

(1:33:38 ) Overview of their comprehensive evals course

(1:37:57 ) Lightning round and final thoughts

LLM Log Open Codes Analysis Prompt:

Please analyze the following CSV file. There is a metadata field which has an nested field called z_note that contains open codes for analysis of LLM logs that we are conducting. Please extract all of the different open codes. From the _note field, propose 5-6 categories that we can create axial codes from.

Referenced:

• Building eval systems that improve your AI product: https://www.lennysnewsletter.com/p/building-eval-systems-that-improve

• Mercor: https://mercor.com/

• Brendan Foody on LinkedIn: https://www.linkedin.com/in/brendan-foody-2995ab10b

• Nurture Boss: https://nurtureboss.io/

• Braintrust: https://www.braintrust.dev/

• Andrew Ng on X: https://x.com/andrewyng

• Carrying Out Error Analysis: https://www.youtube.com/watch?v=JoAxZsdw_3w

• Julius AI: https://julius.ai/

• Brendan Foody on X—“evals are the new PRDs”: https://x.com/BrendanFoody/status/1939764763485171948

• Who Validates the Validators? Aligning LLM-Assisted Evaluation of LLM Outputs with Human Preferences: https://dl.acm.org/doi/abs/10.1145/3654777.3676450

• Lenny’s post on X about evals: https://x.com/lennysan/status/1909636749103599729

• Statsig: https://statsig.com/

• Claude Code: https://www.anthropic.com/claude-code

• Cursor: https://cursor.com/

• Occam’s razor: https://en.wikipedia.org/wiki/Occam%27s_razor

Frozen: https://www.imdb.com/title/tt2294629/

The Wire on HBO: https://en.wikipedia.org/wiki/The_Wire

Recommended books:

Pachinko: https://www.amazon.com/Pachinko-National-Book-Award-Finalist/dp/1455563935

Apple in China: The Capture of the World’s Greatest Company: https://www.amazon.com/Apple-China-Capture-Greatest-Company/dp/1668053373/

Machine Learning: https://www.amazon.com/Machine-Learning-Tom-M-Mitchell/dp/1259096955

Artificial Intelligence: A Modern Approach: https://www.amazon.com/Artificial-Intelligence-Modern-Approach-Global/dp/1292401133/

Production and marketing by https://penname.co/. For inquiries about sponsoring the podcast, email podcast@lennyrachitsky.com.

Lenny may be an investor in the companies discussed.

My biggest takeaways from this conversation:



To hear more, visit www.lennysnewsletter.com
Comments 
loading
In Channel
loading

Table of contents

00:00
00:00
1.0x

0.5x

0.8x

1.0x

1.25x

1.5x

2.0x

3.0x

Sleep Timer

Off

End of Episode

5 Minutes

10 Minutes

15 Minutes

30 Minutes

45 Minutes

60 Minutes

120 Minutes

Why AI evals are the hottest new skill for product builders | Hamel Husain & Shreya Shankar (creators of the #1 eval course)

Why AI evals are the hottest new skill for product builders | Hamel Husain & Shreya Shankar (creators of the #1 eval course)

Lenny Rachitsky