Why AI evals are the hottest new skill for product builders | Hamel Husain & Shreya Shankar (creators of the #1 eval course)
Digest
This podcast emphasizes the critical role of "evals" in building successful AI products, defining them as systematic methods for measuring and improving AI applications, akin to data analytics for LLM applications. It debunks common misconceptions, such as AI being able to self-evaluate, and introduces the "benevolent dictator" concept for efficient eval processes. The discussion highlights error analysis, using techniques like open coding and axial coding, as a foundational step. It also covers building and validating LLM judges, the challenges of criteria drift, and the relationship between evals, A/B tests, and data science. Practical advice is offered on getting started, managing time investment, and the high ROI of continuous improvement through evals, ultimately positioning them as a core component of AI product strategy.
Outlines

The Importance and Definition of Evals in AI Product Development
Evals are systematic methods for measuring and improving AI applications, offering high ROI for product builders. They are more than unit tests, encompassing broader quality measurement and data analytics for LLM applications.

Demystifying Evals: Common Misconceptions and the Benevolent Dictator
This section addresses the misconception that AI can self-evaluate and introduces the "benevolent dictator" model to streamline the eval process by appointing a domain expert. Evals are presented as a critical skill for AI product builders.

Practical Error Analysis for AI Products
The process of building evals is demonstrated through error analysis of application logs (traces) to identify issues like incorrect information or conversational flow problems. This involves manual "open coding" to capture initial observations.

Synthesizing Errors and Building LLM Judges
Error analysis is synthesized using AI through "axial coding" to categorize failure modes. LLM judges are then built to automate the assessment of complex failure modes, providing pass/fail results after careful validation against human judgment.

Challenges and Nuances in Evals
Research on "Who Validates the Validator" highlights challenges like "criteria drift." The discussion also clarifies that evals are essentially data science applied to products, addressing misconceptions about automated evaluation and the need for raw data analysis.

Evals for Different AI Products and Unified Approaches
Coding agents are discussed as a unique case where developers are domain experts, simplifying evals. A/B tests are framed as a form of evals, emphasizing grounding hypotheses in error analysis.

Practical Tips and Time Investment for Evals
Key advice includes not fearing data analysis, using LLMs for organization, and focusing on actionable improvements. Initial setup may take days, but ongoing efforts are minimal, yielding a high ROI.

Deep Dive into Evals Course Content and Perks
The course covers the full lifecycle of error analysis, automated evaluators, and improvement flywheels, including custom interfaces and cost optimization. Perks include a comprehensive book and an AI chatbot.

Lightning Round and Life Mottos
Recommendations include books and AI coding tools. Life mottos focus on continuous learning and considering other perspectives, with a vision for collaborative improvement through evals.

Appreciation and Next Steps
The speakers express mutual admiration and share how listeners can connect with them and their course, encouraging the sharing of successes.
Keywords
Evals
Systematic process for measuring and improving AI applications, especially LLMs, involving data analysis and iteration.
Error Analysis
Examining AI application logs to identify and categorize failures, a key first step in building evals.
Open Coding
Manual, free-form note-taking on observed issues in AI traces during error analysis.
Axial Coding
Categorizing open codes into broader failure modes or themes to organize identified errors.
Benevolent Dictator
A single domain expert leading the eval process for efficiency and streamlined decision-making.
LLM Judge
An AI model used to evaluate AI outputs based on predefined criteria for quality control.
Criteria Drift
The challenge of changing evaluator standards over time, impacting consistent evaluation.
Data Science
The application of scientific methods to extract insights from data for product improvement.
AI-assisted coding
Tools that use AI to help developers write, debug, and optimize code, increasing productivity.
A/B Testing
A method to compare product versions and measure quality through metrics, considered a form of evals.
Q&A
What are evals in the context of AI product development?
Evals are a systematic method for measuring and improving AI applications, essentially acting as data analytics for LLM applications. They help identify issues, iterate on solutions, and ensure the AI performs as intended.
Why is manual error analysis important before creating automated evals?
Manual error analysis, through techniques like open coding, is crucial because humans have the context to identify subtle issues and “bad product smells” that AI might miss. This initial step grounds the eval process in real-world problems.
What is the role of a "benevolent dictator" in building evals?
A benevolent dictator is a single, trusted individual with domain expertise who leads the eval process. This approach streamlines decision-making, prevents committee-based delays, and makes the complex task of eval creation more manageable.
How can AI be used to help with the eval process?
AI can assist in synthesizing open codes into axial codes (failure modes) and can also function as an “LLM judge” to automate the evaluation of specific AI behaviors based on predefined prompts and criteria.
What are the two main types of evals discussed?
The two main types are code-based evals, which are like automated unit tests for specific, quantifiable behaviors, and LLM-based evals, where an LLM acts as a judge for more complex or subjective failure modes.
Why is it important to validate LLM judges?
LLM judges must be validated against human judgment to ensure their accuracy and reliability. Misaligned judges can lead to misleading metrics and a loss of trust in the evaluation process.
What is "criteria drift" and why is it a challenge in evals?
Criteria drift occurs when evaluators' standards change as they review more data. This makes consistent evaluation difficult, as what is considered "good" or "bad" can shift, requiring careful management of the eval process.
Should evals replace traditional PRDs (Product Requirements Documents)?
Evals complement, rather than replace, PRDs. While PRDs outline initial expectations, evals, informed by data analysis, help uncover unforeseen failure modes and evolving requirements, leading to more robust and adaptive AI products.
What are the main misconceptions people have about evals?
Common misconceptions include believing that evaluation tools can automatically perform evals without human input, and neglecting the importance of directly analyzing raw data and user traces to understand product issues.
How do A/B tests relate to evals?
A/B tests are considered a form of evals, as they involve systematically measuring quality through metrics and comparing different versions of a product. However, effective A/B tests should be informed by prior error analysis to ensure hypotheses are grounded in real-world issues.
Show Notes
Hamel Husain and Shreya Shankar teach the world’s most popular course on AI evals and have trained over 2,000 PMs and engineers (including many teams at OpenAI and Anthropic). In this conversation, they demystify the process of developing effective evals, walk through real examples, and share practical techniques that’ll help you improve your AI product.
What you’ll learn:
1. WTF evals are
2. Why they’ve become the most important new skill for AI product builders
3. A step-by-step walkthrough of how to create an effective eval
4. A deep dive into error analysis, open coding, and axial coding
5. Code-based evals vs. LLM-as-judge
6. The most common pitfalls and how to avoid them
7. Practical tips for implementing evals with minimal time investment (30 minutes per week after initial setup)
8. Insight into the debate between “vibes” and systematic evals
—
Brought to you by:
Fin—The #1 AI agent for customer service
Dscout—The UX platform to capture insights at every stage: from ideation to production
Mercury—The art of simplified finances
—
Where to find Shreya Shankar
• LinkedIn: https://www.linkedin.com/in/shrshnk/
• Website: https://www.sh-reya.com/
• Maven course: https://bit.ly/4myp27m
—
Where to find Hamel Husain
• X: https://x.com/HamelHusain
• LinkedIn: https://www.linkedin.com/in/hamelhusain/
• Website: https://hamel.dev/
• Maven course: https://bit.ly/4myp27m
—
In this episode, we cover:
(00:00 ) Introduction to Hamel and Shreya
(04:57 ) What are evals?
(09:56 ) Demo: Examining real traces from a property management AI assistant
(16:51 ) Writing notes on errors
(23:54 ) Why LLMs can’t replace humans in the initial error analysis
(25:16 ) The concept of a “benevolent dictator” in the eval process
(28:07 ) Theoretical saturation: when to stop
(31:39 ) Using axial codes to help categorize and synthesize error notes
(44:39 ) The results
(46:06 ) Building an LLM-as-judge to evaluate specific failure modes
(48:31 ) The difference between code-based evals and LLM-as-judge
(52:10 ) Example: LLM-as-judge
(54:45 ) Testing your LLM judge against human judgment
(01:00:51 ) Why evals are the new PRDs for AI products
(01:05:09 ) How many evals you actually need
(01:07:41 ) What comes after evals
(01:09:57 ) The great evals debate
(1:15:15 ) Why dogfooding isn’t enough for most AI products
(01:18:23 ) OpenAI’s Statsig acquisition
(1:23:02 ) The Claude Code controversy and the importance of context
(01:24:13 ) Common misconceptions around evals
(1:22:28 ) Tips and tricks for implementing evals effectively
(1:30:37 ) The time investment
(1:33:38 ) Overview of their comprehensive evals course
(1:37:57 ) Lightning round and final thoughts
—
LLM Log Open Codes Analysis Prompt:
Please analyze the following CSV file. There is a metadata field which has an nested field called z_note that contains open codes for analysis of LLM logs that we are conducting. Please extract all of the different open codes. From the _note field, propose 5-6 categories that we can create axial codes from.
—
Referenced:
• Building eval systems that improve your AI product: https://www.lennysnewsletter.com/p/building-eval-systems-that-improve
• Mercor: https://mercor.com/
• Brendan Foody on LinkedIn: https://www.linkedin.com/in/brendan-foody-2995ab10b
• Nurture Boss: https://nurtureboss.io/
• Braintrust: https://www.braintrust.dev/
• Andrew Ng on X: https://x.com/andrewyng
• Carrying Out Error Analysis: https://www.youtube.com/watch?v=JoAxZsdw_3w
• Julius AI: https://julius.ai/
• Brendan Foody on X—“evals are the new PRDs”: https://x.com/BrendanFoody/status/1939764763485171948
• Who Validates the Validators? Aligning LLM-Assisted Evaluation of LLM Outputs with Human Preferences: https://dl.acm.org/doi/abs/10.1145/3654777.3676450
• Lenny’s post on X about evals: https://x.com/lennysan/status/1909636749103599729
• Statsig: https://statsig.com/
• Claude Code: https://www.anthropic.com/claude-code
• Cursor: https://cursor.com/
• Occam’s razor: https://en.wikipedia.org/wiki/Occam%27s_razor
• Frozen: https://www.imdb.com/title/tt2294629/
• The Wire on HBO: https://en.wikipedia.org/wiki/The_Wire
—
Recommended books:
• Pachinko: https://www.amazon.com/Pachinko-National-Book-Award-Finalist/dp/1455563935
• Apple in China: The Capture of the World’s Greatest Company: https://www.amazon.com/Apple-China-Capture-Greatest-Company/dp/1668053373/
• Machine Learning: https://www.amazon.com/Machine-Learning-Tom-M-Mitchell/dp/1259096955
• Artificial Intelligence: A Modern Approach: https://www.amazon.com/Artificial-Intelligence-Modern-Approach-Global/dp/1292401133/
Production and marketing by https://penname.co/. For inquiries about sponsoring the podcast, email podcast@lennyrachitsky.com.
—
Lenny may be an investor in the companies discussed.
My biggest takeaways from this conversation:
To hear more, visit www.lennysnewsletter.com
























