Listen Top Shows Blog

706: Large Language Model Leaderboards and Benchmarks

706: Large Language Model Leaderboards and Benchmarks

Update: 2023-08-18

Share

Description

In this episode, Caterina Constantinescu dives deep into Large Language Models (LLMs), spotlighting top leaderboards, evaluation benchmarks, and real-world user perceptions. Plus, discover the challenges of dataset contamination and the intricacies of platforms like HELM and Chatbot Arena.

Additional materials: www.superdatascience.com/706

Interested in sponsoring a SuperDataScience Podcast episode? Visit JonKrohn.com/podcast for sponsorship information.

Comments

In Channel

948: In Case You Missed It in November 2025

948: In Case You Missed It in November 2025

2025-12-1229:07

947: How to Get Hired at Top Firms like Netflix and Spotify, with Jeff Li

947: How to Get Hired at Top Firms like Netflix and Spotify, with Jeff Li

2025-12-0901:08:49

946: How Robotaxis Are Transforming Cities

946: How Robotaxis Are Transforming Cities

2025-12-0508:03

945: AI is a Joke, with Joel Beasley

945: AI is a Joke, with Joel Beasley

2025-12-0255:37

944: Gemini 3 Pro: Google’s Back on Top

944: Gemini 3 Pro: Google’s Back on Top

2025-11-2808:21

943: Creative Machines: AI in Music and Art, with Prof. Maya Ackerman

943: Creative Machines: AI in Music and Art, with Prof. Maya Ackerman

2025-11-2555:56

942: Odds of AGI by 2040? LEAP Expert Forecasts and Workforce Implications

942: Odds of AGI by 2040? LEAP Expert Forecasts and Workforce Implications

2025-11-2106:13

941: Multi-Agent Human Societies, with Dr. Vijoy Pandey

941: Multi-Agent Human Societies, with Dr. Vijoy Pandey

2025-11-1801:05:39

940: In Case You Missed It in October 2025

940: In Case You Missed It in October 2025

2025-11-1443:27

939: Mixture-of-Experts and State-Space Models on Edge Devices, with Tyler Cox and Shirish Gupta

939: Mixture-of-Experts and State-Space Models on Edge Devices, with Tyler Cox and Shirish Gupta

2025-11-1101:05:36

938: Frontier AI Agents for Data Science, with Sphinx’s Rohan Kodialam

938: Frontier AI Agents for Data Science, with Sphinx’s Rohan Kodialam

2025-11-0719:14

937: How to Design AI-First Products, with Marc Dupuis

937: How to Design AI-First Products, with Marc Dupuis

2025-11-0459:12

936: LLMs Are Delighted to Help Phishing Scams

936: LLMs Are Delighted to Help Phishing Scams

2025-10-3105:07

935: Global Issues Accelerated by AI (with Solutions), feat. Stephanie Hare

935: Global Issues Accelerated by AI (with Solutions), feat. Stephanie Hare

2025-10-2801:17:05

934: Is AI Replacing Junior Workers?

934: Is AI Replacing Junior Workers?

2025-10-2406:55

933: Future-Proofing Your Career in the AI Era, feat. Sheamus McGovern

933: Future-Proofing Your Career in the AI Era, feat. Sheamus McGovern

2025-10-2101:14:40

932: Should You Build or Buy Your AI Solution? With Larissa Schneider

932: Should You Build or Buy Your AI Solution? With Larissa Schneider

2025-10-1729:10

931: Boost Your Profits with Mathematical Optimization, feat. Jerry Yurchisin

931: Boost Your Profits with Mathematical Optimization, feat. Jerry Yurchisin

2025-10-1401:12:42

930: In Case You Missed It in September 2025

930: In Case You Missed It in September 2025

2025-10-1037:25

929: Dragon Hatchling: The Missing Link Between Transformers and the Brain, with Adrian Kosowski

929: Dragon Hatchling: The Missing Link Between Transformers and the Brain, with Adrian Kosowski

2025-10-0701:13:51

00:00

00:00

x

706: Large Language Model Leaderboards and Benchmarks

706: Large Language Model Leaderboards and Benchmarks

Jon Krohn