DiscoverJustified Posteriors
Justified Posteriors
Claim Ownership

Justified Posteriors

Author: Seth Benzell and Andrey Fradkin

Subscribed: 13Played: 327
Share

Description

Explorations into the economics of AI and innovation. Seth Benzell and Andrey Fradkin discuss academic papers and essays at the intersection of economics and technology.

empiricrafting.substack.com
28 Episodes
Reverse
In this episode, we sit down with Ben Golub, economist at Northwestern University, to talk about what happens when AI meets academic research, social learning, and network theory.We start with Ben’s startup Refine, an AI-powered technical referee for academic papers. From there, the conversation ranges widely: how scholars should think about tooling, why “slop” is now cheap, how eigenvalues explain viral growth, and what large language models might do to collective belief formation. We get math, economics, startups, misinformation, and even cow tipping.Links & References* Refine — AI referee for academic papers* Harmonic — Formal verification and proof tooling for mathematics* Matthew O. Jackson — Stanford economist and leading scholar of networks and social learning* Cow tipping (myth) — Why you can’t actually tip a cow (physics + folklore)* The Hype Machine — Sinan Aral on how social platforms amplify misinformation* Sequential learning / information cascades / DeGroot Model* AI Village — Multi-agent AI simulations and emergent behavior experiments* Virtual currencies & Quora credits — Internal markets for attention and incentivesTranscript:Seth: Welcome to Justified Posteriors, the podcast that updates its beliefs about the economics of AI and technology.Seth: I’m Seth Benzel, hoping my posteriors are half as good as the average of my erudite Friends is coming to you from Chapman University in sunny Southern California.Andrey: And I’m Andrey Fradkin coming to you from San Francisco, California, and I’m very excited that our guest for today is Ben Goleb, who is a prominent economist at Northwestern University. Ben has won the Calvó-Armengol International Prize, which recognizes a top researcher in economics or social science, younger than 40 years old, for contributions to theory and comprehension of mechanisms of social interaction.Andrey: So you want someone to analyze your social interactions, Ben is definitely the guy.Seth: If it’s in the network,Andrey: Yeah, he is, he was also a member of the Harvard Society of Fellows and had a brief stint working as an intern at Quora, and we’ve known each other for a long time. So welcome to the show, Ben.Ben: Thank you, Andrey. Thank you, Seth. It’s wonderful to be on your podcast.Refine: AI-Powered Paper ReviewingAndrey: All right. Let’s get started. I want us to get started on what’s very likely been the most on your mind thing, Ben, which is your new endeavor, Refine.Ink. Why don’t you tell us a little bit about, give us the three minute spiel about what you’re doing.Seth: and tell us why you didn’t name your tech startup after a Lord of the Rings character.Ben: Man, that’s a curve ball right there. All right, I’ll tell you what, I’ll put that on background processing. So, what refine is, is it’s an AI referee technical referee. From a user perspective, what happens is you just give it a paper and you get the experience of a really obsessive research assistant reading for as long as it takes to get through the whole thing, probing it from every angle, asking every lawyerly question about whether things make sense.Ben: And then that feedback, hopefully the really valuable parts that an author would wanna know are distilled and delivered. So as my co-founder Yann Calvó López puts it, obsession is really the obsessiveness is the nature of the company. We just bottled it up and we give it to people. So that’s the basic product—it’s an AI tool. It uses AI obviously to do all of this thinking. One thing I’ll say about it is that I have long felt it was a scandal that the level of tooling for scholars is a tiny fraction of what it is for software engineers.Ben: And obviously software engineering is a much larger and more economically valuableSeth: Boo.Ben: leastAndrey: Oh, disagree.Ben: In certain immediate quantifications. But I felt that ever since I’ve been using tech, I just felt imagine if we had really good tools and then there was this perfect storm where my co-founder and I felt we could make a tool that was state of the art for now. So that’s how I think of it.Seth: I have to quibble with you a little bit about the user experience because the way I went, the step zero was first, jaw drops to the floor at the sticker price. How much do you,Ben: not,Seth: But then I will say I have used it myself and on a paper I recently submitted, it really did find a technical error and I would a kind of error that you wouldn’t find, just throwing this into ChatGPT as of a few months ago. Who knows with the latest Gemini. But it really impressed me with my limited time using it.Andrey: So.Ben: is probably, if you think about the sticker price, if you compare that to the amount of time you’d have, you’d have had to pay error.Seth: Yeah. And water. If I didn’t have water, I’d die, so I should pay a million for water.Andrey: A question I had: how do you know it’s good? Isn’t this whole evals thing very tricky?Seth: Hmm.Andrey: Is there Is there, a paper review or benchmark that you’ve come across, or did you develop your own?Ben: Yeah. That’s a wonderful question. As Andrey knows, he’s a super insightful person about AI and this goes to the core of the issue because all the engineers we work with are immediately like, okay, I get what you’re doing.Ben: Give me the evals, give me the standard of quality. So we know we’re objectively doing a good job. What we have are a set of papers where we know what ground truth is. We basically know everything that’s wrong with them and every model update we run, so that’s a small set of fairly manual evaluations that’s available. I think one of the things that users experience is they know their own papers well and can see over time that sometimes we find issues that they know about and then sometimes we find other issues and we can see whether they’re correct.Ben: We’re not at the point where we can make confident precision recall type assessments. But another thing that we do, which I find cool, was whenever tools that our competitors come out, like Andrew Ng put out a cool paper reviewer thing targeted at CS conferences.Ben: And what we do is we just run that thing, we run our thing, we put both of them into Gemini 2.0, and we say, could you please assess these side by side as reviews of the same paper? Which one caught mistakes? We try to make it a very neutral prompt, and that’s an eval that is easy to carry out.Ben: But actually we’re in the market. We’d love to work with people who are excited about doing this for refine. We finally have the resources to take a serious run at it as founders. The simple truth is because my co-founder and I are researchers as well as founders, we constantly look at how it’s doing on documents we know.Ben: And it’s a very seat of the pants thing for now, to tell the truth.Andrey: Do you think that there’s an aspect of data-driven here and that one of your friends puts their paper into it and says, well, you didn’t catch this mistake, or you didn’t catch that mistake, and then you optimize towards that. Is that a big part of your development process?Ben: Yeah, it was more. I think we’ve reached an equilibrium where of the feedback of that form we hear, there’s usually a cost to catching it. But early on that was basically, I would just tell everyone I could find, and there were a few. When I finally had the courage to tell my main academic group chat about it and I gave it, immediately people had very clear feedback and this was in the deep, I think the first reasoning model we used for the substantive feedback was DeepSeek R1 and people, we immediately felt, okay, this is 90% slop.Ben: And that’s where we started by iterating. We got to where, and one great thing about having academic friends is they’re not gonna be shy to tell you that your thought of paper.Refereeing Math and AI for Economic TheoryAndrey: One thing that we wanted to dig a little bit into is how you think about refereeing math andSeth: Mm-hmm.Andrey: More generally opening it up to how are economic theorists using AI for math?Ben: So say a little more about your question. When you say mathSeth: Well, we see people, Axiom, I think is the name of the company, immediately converting these written proofs into Lean. Is that the end game for your tool?Ben: I see, yes. So good. Our vision for the company is that, at least for quite a while, I think there’s gonna be this product layer between tools, the core AI models and the things that are necessary to bring your median, ambitiousSeth: MiddleBen: notSeth: theorists, that’s what we call ourselves.Ben: Well, yeah. Or middle, but in a technical dimension, I think it’s almost certainly true that the median economist doesn’t use GitHub almost ever. If you told them, they set up something that, a tool that works through the terminal, think about Harmonic, right?Ben: Their tools are all, they say the first step is, go grab this from a repository and run these command line things to, they try to make it pretty easy, but it’s still a terminal tool. So a big picture vision is that we think the most sophisticated tools will be, there will be a lot of them that are not yet productized and we can just make the bundle for scholars to actually use it in their work.Ben: Now about the question of formalization per se, I have always been excited to use formalization in particular to make that product experience happen. For formalized math, my understanding is right now the coverage of the auto formalization systems is very jagged across, even across. If you compare number theory to algebraic geometry, the former is in good shape for people to start solving Erdős problems or combinatorial number theory, things like that, people can just start doing that. For algebraic geometry, there are a lot of basics that aren’t built out and so all of the lean proofs will contain a lot of stories that the user has to say, am I fine considering that settled or not?Ben: And that’s not really an experience that makes sense fo
Seth and Andrey are back to evaluating an AI evaluation, this time discussing METR’s paper “Measuring AI Ability to Complete Long Tasks.” The paper’s central claim is that the “effective horizon” of AI agents—the length of tasks they can complete autonomously—is doubling every 7 months. Extrapolate that, and AI handles month-long projects by decade’s end. They discuss the data and the assumptions that go into this benchmark. Seth and Andrey start by walking through the tests of task length, from simple atomic actions to the 8-hour research simulations in RE-Bench. They discuss whether the paper properly measures task length median success with their logarithmic models. And, of course, they zoom out to ask whether “time” is even the right metric for AI capability, and whether METR applies the concept correctly.Our hosts also point out other limitations and open questions the eval leaves us with. Does the paper properly acknowledge how messy long tasks get in practice? AI still struggles with things like playing Pokémon or coordinating in AI Village—tasks that are hard to decompose cleanly. Can completing one 10-hour task really be equated with reliably completing ten 1-hour subtasks? And Seth has a bone to pick about a very important study detail omitted from the introduction. The Priors that We Update On Are:* Is evaluating AI by time (task length) more useful/robust than evaluating by economic value (as seen in OpenAI’s GDP-eval)?* How long until an AI can autonomously complete a “human-month” sized task (defined here as a solid second draft of an economics paper, given data and research question)?* Seth’s Prior: 50/50 in 5 years, >90% in 10 years.* Andrey’s Prior: 50/50 in 5 years, almost certain in 10 years.Listen to see how our perspectives change after reading!Links & Mentions:* The Paper: Measuring AI Ability to Complete Long Tasks by METR* Complementary Benchmarks:* RE-Bench (Research Engineering Benchmark) - METR’s eval for AI R&D capabilities.* H-CAST (Human-Calibrated Autonomy Software Tasks) - The benchmark of 189 tasks used in the study.* The “Other” Eval: GDP-eval: Evaluating AI Model Performance on Real-World Economically Valuable Tasks by OpenAI* AI 2027 (A forecasting scenario discussed)* AI Village - A project where AI agents attempt to coordinate on real-world tasks.* Steve Newman on the “100 Person-Year” Project (Creator of Writely/Google Docs).* In the Beginning... Was the Command Line by Neal Stephenson* Raj ChettyTranscript[00:14] Seth Benzell: Welcome to the Justified Posteriors podcast, the podcast that updates its beliefs about the economics of AI and technology. I’m Seth Benzell, wondering just how long a task developing an AI evaluation is, at Chapman University in sunny Southern California.Andrey Fradkin: And I’m Andrey Fradkin, becoming very sad as the rate of improvement in my ability to do tasks is nowhere near the rate at which AI is improving. Coming to you from San Francisco, California.Andrey: All right, Seth. You mentioned how long it takes to do an eval. I think this is going to be a little bit of a theme of our podcast about how actually, evals are pretty hard and expensive to do. Recently there was a Twitter exchange between one of the METR members talking about their eval, which we’ll be talking about today, where he says that for each new model to evaluate it takes approximately 25 hours of staff time, but maybe even more like 60 hours in rougher cases. And that’s not even counting all the compute that’s required to do these evaluations.So, you know, evals get thrown around. I think people knowing evals know how hard they are, but I think as outsiders, we take them for granted. And we shouldn’t, because it certainly takes a lot of work. But yeah, with that in mind, what do you want to say, Seth?Seth: Well, I guess I want to say that we, I think we are the leaders in changing people’s opinions about the importance of these evals. The public responded very positively to our recent eval of Open AI’s GDP-eval, which was trying to look to bring Daron Acemoglu’s view of how can we evaluate the economic potential economic impact of AI to actual task-by-task-by-task, how successful is this AI system. People loved it. Now you demanded it, we listened. We’re coming back to you to talk to you about a new eval—well not a new eval, it’s about eight months old, but it’s the Godzilla of evals. It’s the Kaiju of evals. It’s this paper called “Measuring AI Ability to Complete Long Tasks,” a study that came out by METR. We’ve seen some updates or new evaluations of models since this first came out in March of 2025. Andrey, do you want to list the authors of this paper?[3:05] Andrey: As usual I don’t. There are a lot of authors of this paper. But, you know, I’ve interacted with some of the authors of this paper, I have a lot of respect for them. I have a lot of respect for the METR organization.Seth: Okay. But at a high level, just in a sentence, what this wants to do is evaluate different frontier AI models by the criteria of: “how long are the tasks that they complete”?” Andrey: I guess what I would say before we get to our priors is, just as context, this, from what everything I’ve seen, is the most influential evaluation of AI progress in the world right now. It is a measure that all important new models are benchmarked against. If something is above the trend, it’s news. If something is below the trend, it’s news. If something’s on the trend, it’s news. And it’s caused a lot of people to change their minds about the likely path of AI progress. So I’m very excited to discuss this.Seth: It’s been the source of many “we’re so back” memes. Yeah, I totally agree Andrey. Am I right that this was a paper that was partly inspiring the AI 2027 scenario by favorite blogger Scott Alexander?Andrey: I don’t know if it inspired it, but I think it was used as part of the evidence in that. Just to be clear though, AI 2027, it’s a scenario that was proposed that seemed a bit too soon of a vision for AGI taking over the world by many folks. We have not done an episode on it.Seth: We haven’t done an episode on it. But it’s fair to say that people look at the results of this paper and they see, you know, they see a trend that they extrapolate. But before we get into the details of the paper, are we ready to get into our priors?Andrey: Let’s do it.[05:50] Seth: Okay, so Andrey, just based on that headline description, that instead of evaluating AI systems by trying to go occupation by occupation and try to find tasks in those occupations that are economically valuable and then trying to see what percentage of those tasks the AI can do—that’s what the Open AI GDPval approach that we recently reviewed did—this approach is trying to evaluate tasks again by how long they are. So comparing those two approaches, I guess my first prior is, before we read this paper, which of those approaches do you see as like kind of intuitively more promising?Andrey: One way of thinking about this is tasks are, or things people do which could be a series of tasks, are bundles and they’re bundles embedded in some higher dimensional space. And what these two evals are doing, this one we’re discussing here versus GDPval, is they’re embedding them into different spaces. One of them is a time metric. And one of them is a dollar metric, right? And you can just by phrasing it that way, you can see what some of the issues might be with either. With the dollar metric, well, what are people getting paid for? Is it a specific deliverable or is it being on call or being the responsible party for something? So you can see how it’s kind of hard to really convert lots of things into dollar values at a systematic level. Now, you can say the same thing about how long it takes to do something. Of course, it takes different people very different times to do different tasks. And then once again chaining tasks together, how to rethink about how long it takes to do that. So I think they’re surprisingly similar. I think maybe this length of time one is more useful at the moment because it seems simpler to do frankly. It seems like, yes we can get an estimate for how long it takes to do something. It’s not going to be perfect, it’s going to be noisy, but we can get it and then we can just see whether the model does it. And that’s easier than trying to translate tasks to dollar values in my opinion.[8:42] Seth: Right. I guess I also am tempted to reject the premise of this question and say that they’re valuable for different things. But I guess I come into this thinking about, you know, we think about AI agents as opposed to AI tools as being this next frontier of automation and potentially supercharging the economy. And it really does feel like the case that working with AI models, the rate limiter is the human. It’s how often the human has to stop and give feedback and say, “Okay, here’s the next step,” or “Hey, back up a little bit and try again.” So going in, I would say I was kind of in equipoise about which of the two is the most useful kind of as a projection for where this is going. Maybe on your side of the ledger saying that economic value is kind of a socioeconomic construct, right? That could definitely change a lot even without the tool changing. Whereas time seems more innately connected to difficulty. You can think about psychometric measures of difficulty where we think about, you know, a harder exam is a longer exam. So at least going in, I think that this has a lot of potential to even potentially surpass GDP-eval in terms of its value for projection.Andrey: Yes. Yeah, yeah. Seth: Okay. The next one I was thinking to ask you Andrey was, if we buy all the premises of whatever context the paper sets up for us, the question I’d like to think about is: how long until AI can do a human month-size task on its own? In the abstract of the paper, we have that happening within five years, by 2030. That seems like a pretty big bite at the app
We continue our conversation with Columbia professor Bo Cowgill. We start with a detour through Roman Jakobson’s six functions of language (plus two bonus functions Seth insists on adding: performative and incantatory). Can LLMs handle the referential? The expressive? The poetic? What about magic?The conversation gets properly technical as we dig into Crawford-Sobel cheap talk models, the collapse of costly signaling, and whether “pay to apply” is the inevitable market response to a world where everyone can produce indistinguishable text. Bo argues we’ll see more referral hiring (your network as the last remaining credible signal), while Andrey is convinced LinkedIn Premium’s limited signals are just the beginning of mechanism design for application markets.We take a detour into Bo’s earlier life running Google’s internal prediction markets (once the largest known corporate prediction market), why companies still don’t use them for decision-making despite strong forecasting performance, and whether AI agents participating in prediction markets will have correlated errors if they all derive from the same foundation models.We then discuss whether AI-generated content will create demand for cryptographic proof of authenticity, whether “proof of humanity” protocols can scale, and whether Bo’s 4-year-old daughter’s exposure to AI-generated squirrel videos constitutes evidence of aggregate information loss.Finally: the superhuman persuasion debate. Andrey clarifies he doesn’t believe in compiler-level brain hacks (sorry, Snow Crash fans), Bo presents survey evidence that 85% of GenAI usage involves content meant for others, and Seth closes with the contrarian hot take that information transmission will actually improve on net. General equilibrium saves us all—assuming a spherical cow.Topics Covered:* Jakobson’s functions of language (all eight of them, apparently)* Signaling theory and the pooling equilibrium problem* Crawford-Sobel cheap talk games and babbling equilibria* “Pay to apply” as incentive-compatible mechanism design* Corporate prediction markets and conflicts of interest* The ABC conjecture and math as a social enterprise* Cryptographic verification and proof of humanity* Why live performance and in-person activities may increase in economic value* The Coasean singularity * Robin Hanson’s “everything is signaling” worldviewPapers & References:* Crawford & Sobel (1982), “Strategic Information Transmission”* Cowgill and Zitzewitz (2015) “Corporate Prediction Markets: Evidence from Google, Ford, and Firm X”.* Jakobson, “Linguistics and Poetics” (1960)* Binet, The Seventh Function of Language* Stephenson, Snow CrashTranscript:Andrey: Well, let’s go to speculation mode.Seth: All right. Speculation mode. I have a proposal that I’m gonna ask you guys to indulge me in as we think about how AI will affect communication in the economy. For my book club, I’ve been recently reading some postmodern fiction. In particular, a book called The Seventh Function of Language.The book is a reference to Jakobson’s six famous functions of language. He is a semioticist who is interested in how language functions in society, and he says language functions in six ways.1 I’m gonna add two bonus ones to that, because of course there are seven functions of language, not just six. Maybe this will be a good framework for us to think about how AI will change different functions of language. All right. Are you ready for me?Bo Cowgill: Yes.Seth: Bo’s ready. Okay.Bo Cowgill: Remember all six when you...Seth: No, we’re gonna do ‘em one by one. Okay. The first is the Referential or Informational function. This is just: is the language conveying facts about the world or not? Object level first. No Straussian stuff. Just very literally telling you a thing.When I think about how LLMs will do at this task, we think that LLMs at least have the potential to be more accurate, right? If we’re thinking about cover letters, the LLMs should maybe do a better job at choosing which facts to describe. Clearly there might be an element of choosing which facts to report as being the most relevant. We can think about, maybe that’s in a different function.If we ask about how LLMs change podcasts? Well, presumably an LLM-based podcast, if the LLM was good enough, would get stuff right more often. I’m sure I make errors. Andrey doesn’t make errors. So restricting attention to this object-level, “is the language conveying the facts it needs to convey,” how do you see LLMs changing communication?Bo Cowgill: Do I go first?Seth: Yeah, of course Bo, you’re the guest.Bo Cowgill: Of course. Sorry, I should’ve known. Well, it sounds like you’re optimistic that it’ll improve. Is that right?Seth: I think that if we’re talking about hallucinations, those will be increasingly fixed and be a non-issue for things like CVs and resumes in the next couple of years. And then it becomes the question of: would an LLM be less able to correctly report on commonly agreed-upon facts than a human? I don’t know. The couple-years-out LLM, you gotta figure, is gonna be pretty good at reliably reproducing facts that are agreed upon.Bo Cowgill: Yeah, I see what you mean. So, I’m gonna say “it depends,” but I’ll tell you exactly what I think it depends on. I think in instances where the sender and the receiver are basically playing a zero-sum game, I don’t think that the LLM is gonna help. And arguably, nothing is gonna help. Maybe costly signaling could help, but...Seth: Sender and the receiver are playing a zero-sum game? If I wanna hire someone, that’s a positive-sum game, I thought.Andrey: Two senders are playing a zero-sum game.Seth: Oh, two senders. Yes. Two senders are zero-sum with each other. Okay.Bo Cowgill: Right. This is another domain-specific answer, but I think that it depends on what game the two parties are playing. Are they trying to coordinate on something? Is it a zero-sum game where they have total opposite objectives? If all costly signaling has been destroyed, then I don’t think that the LLM is gonna help overcome that total separation.On the other hand, if there’s some alignment between sender and receiver—even in a cheap talk world—we know from the Crawford and Sobel literature that you can have communication happen even without the cost of a signal. I do think that in those Crawford and Sobel games, you have these multiple equilibria ranging from the babbling equilibrium to the much more precise one. And it seems like, if I’m trying to communicate with Seth costlessly, and all costly signal has been destroyed so we only have cheap talk, the LLM could put us on a more communicative equilibrium.Seth: We could say more if we’re at the level where you trust me. The LLM can tell you more facts than I ever could.Bo Cowgill: Right. Put us into those more fine partitions in the cheap talk literature. At least that’s how I think the potential for it to help would go.Andrey: I wanna jump in a little bit because I’m a little bit worried for our listeners if we have to go through eight...Seth: You’re gonna love these functions, dude. They’re gonna love... this is gonna be the highlight of the episode.Andrey: I guess rather than having a discussion after every single one, I think it’s just good to list them and then we can talk.Seth: Okay. That’ll help Bo at least. I don’t know if the audience needs this; the audience is up to date with all the most lame postmodern literature. So for the sake of Bo, though, I’ll give you the six functions plus two bonus functions.* Informational: Literal truth.* Expressive (or Emotive): Expressing something about the sender. This is what actually seems to break in your paper: I can’t express that I’m a good worker bee if now everybody can easily express they’re good worker bees.* Connotative (or Directive): The rhetorical element. That’s the “I am going to figure out how to flatter you and persuade you,” not necessarily on a factual level. That’s the zero-sum game maybe you were just talking about.* Phatic: This is funny. This is the language used to just maintain communications. So the way I’m thinking about this is if we’re in an automated setting, you know how they have those “dead man’s switches” where it’s like, “If I ever die, my lawyer will send the information to the federal government.” And so you might have a message from your heart being like, “Bo’s alive. Bo’s alive. Bo’s alive.” And then the problem is when the message doesn’t go.* Metalingual (or Metalinguistic): Language to talk about language. You can tell me if you think LLMs have anything to help us with there.* Poetic: Language as beautiful for the sake of language. Maybe LLMs will change how beautiful language is.* Performative: This comes to us from John Searle, who talks about, “I now pronounce you man and wife.” That’s a function of language that is different than conveying information. It’s an act. And maybe LLMs can or can’t do those acts.* Incantatory (Magic): The most important function. Doing magic. You can come back to us about whether or not LLMs are capable of magic.Okay? So there’s eight functions of language for you. LLMs gonna change language? All right. Take any of them, Bo.Andrey: Seth, can I reframe the question? I try to be more grounded in what might be empirically falsifiable. We have these ideas that in certain domains—and we can focus on the jobs one—LLMs are going to be writing a lot of the language that was previously written by humans, and presumably the human that was sending the signal. So how is that going to affect how people find jobs in the future? And how do we think this market is gonna adjust as a result? Do you have any thoughts on that?Bo Cowgill: Yeah. So I guess the reframing is about how the market as a whole will adjust on both sides?Andrey: Yes, exactly.Bo Cowgill: Well, one, we have some survey results about this in the paper. It suggests you would shift towards more costly signals, maybe verifiable things like, “Where did you go t
In this episode, we brought on our friend Bo Cowgill, to dissect his forthcoming Management Science paper, Does AI Cheapen Talk? The core question is one economists have been circling since Spence drew a line on the blackboard: What happens when a technology makes costly signals cheap? If GenAI allows anyone to produce polished pitches, résumés, and cover letters, what happens to screening, hiring, and the entire communication equilibrium?Bo’s answer: it depends. Under some conditions, GenAI induces an epistemic apocalypse, flattening signals and confusing recruiters. In others, it reveals skill even more sharply, giving high-types superpowers. The episode walks through the theory, the experiment, and implications.Transcript:Seth: Welcome to the Justified Posteriors Podcast, the podcast that updates its priors about the economics of AI and technology. I’m Seth Benzell, certifying my humanity with takes so implausible that no softmax could ever select them at Chapman University in sunny Southern California.Andrey: And I am Andrey Fradkin, collecting my friends in all sorts of digital media formats, coming to you from San Francisco, California. Today we’re very excited to have Bo Cowgill with us. Bo is a friend of the show and a listener of the show, so it’s a real treat to have him. He is an assistant professor at Columbia Business School and has done really important research on hiring, on prediction markets, and now on AI and the intersection of those topics. And he’s also won some very cool prizes. I’ll mention that he was on the list of the best 40 business school professors. So he is one of those professors that’s really captivating for his students. So yeah. Welcome, Bo.Bo Cowgill: Thank you so much. It’s awesome to be here. Thanks so much for having me on the podcast.Seth: What do you value about the podcast? That’s something I’ve been trying to figure out because I just do the podcast for me. I’m just having a lot of fun here with Andrey. Anything I can do to get this guy’s attention to talk about interesting stuff for 10 minutes? Why do you like the podcast? What can we do to make this an even better podcast for assistant professors at Columbia?Bo Cowgill: Well, I don’t wanna speak for all assistant professors at Columbia, but one thing it does well is aggregate papers about AI that are coming out from around the ecosystem and random places. I think it’s hard for anybody to catch all of these, so you guys do a great job. I did learn about new papers from the podcast sometimes.Another cool thing I think is there is some continuity across podcast episodes about themes and arbitrage between different topics and across even different disciplines and domains. So I think this is another thing you don’t get necessarily just kind of thumbing around papers yourself.Seth: So flattering. So now I can ask you a follow-up question, which is: obviously you’re enjoying our communication to you. A podcast is kind of a one-dimensional communication. Now we’ve got the interview going, we’ve got this back and forth. How would you think about the experience of the podcast changing if a really, really, really good AI that had read all of my papers and all of Andrey’s papers went and did the same podcast, same topics? How would that experience change for you? Would it have as much informative content? Would it have as much experiential value? How do you think about that?Bo Cowgill: Well, first of all, I do enjoy y’all’s banter back and forth. I don’t know how well an AI would do that. Maybe it would do a perfectly good job with that. I do enjoy the fact that—this is personal to me—but we know a lot of the same people. And in addition to other guests and other paper references, I like to follow some of the inside jokes and whatnot. I don’t know if that’s all that big of a deal for the average person. But I have listened to at least the latest version of NotebookLM and its ability to do a quote-unquote “deep dive podcast” on anything. And at least recently I’ve been pleased with those. I don’t know if you’ve ever tried putting in like a bad paper in theirs, and then it will of course just say, “Oh, this is the greatest paper. It’s so interesting.”Seth: Right.Bo Cowgill: You can.Seth: So that’s a little bit different, maybe slightly different than our approach.Bo Cowgill: Well, yeah, for sure. Although you can also tell NotebookLM to try to find problems and be a little bit more critical. And that I think works well too. But yeah, I don’t think we should try to replace you guys with robots just yet.Seth: We’re very highly compensated though. The opportunity cost of Andrey’s time, he could be climbing a mountain right now. Andrey, you take it up. Why are we doing this ourselves? Why isn’t an LLM doing this communication for us?Andrey: Well, mostly it’s because we have fun doing it, and so if the LLM was doing it, then we wouldn’t be having the fun.Seth: There you go. Well put. Experiential value of the act itself. Now, Bo, I did not bring up this question randomly. The reason I raised this question of how does AI modify communication... yeah, I used a softmax process, so it was not random. The reason I’m asking this question about how AI changes communication is because you have some recently accepted, forthcoming work at Management Science trying to bring some theory and empirics to the question of how LLMs change human communication, but now in the context of resumes and job search and job pitches. Do you want to briefly introduce the paper “Does AI Cheapen Talk?” and tell us about your co-authors?Bo Cowgill: Yeah, most definitely. So the paper is called “Does AI Cheapen Talk?”. It is with Natalia Berg-Wright, also at Columbia Business School, and with Pablo Hernandez Lagos, who is a professor at Yeshiva University. And what we’re looking at in this paper is the way people screen job candidates or screen entrepreneurs or, more abstractly, how they kind of screen generally. You could apply our model, I think, to lots of different things.But the core idea behind it kind of goes back to these models from Spence in the 1970s saying that costly signals are more valuable to try to separate types.Seth: Right. If I wanna become a full member of the tribe, I have to go kill a lion. Why is it important for me to kill a lion? It’s not important. The important part is I do a hard thing.Bo Cowgill: Exactly. Yeah. So maybe part of the key to this Spence idea that appears in our paper too is that it’s not just that the signal has to be costly, it has to be kind of differentially costly for different types of people. So maybe in your tribe, killing a lion is easy for tough guys like you, but for wimpier people or something, it’s prohibitively high. And so it’s like a test of your underlying cost parameter for killing lions or for being tough in general. So they go and do this. And I guess what you’re alluding to, which appears in a lot of cases, is the actual value of killing the lion is kind of irrelevant. It was just a test.And maybe one of the more potentially depressing implications of that is the idea that what we send our students to do in four-year degrees or even degrees like ours is really just as valuable as killing a lion, which is to say, you’re mainly revealing something about your own costs and your own type and your own skills, and the actual work doesn’t generate all that much value.Seth: Is education training or screening?Bo Cowgill: Right, right, right. Yes. I do think a good amount of it these days is probably screening, and maybe that’s especially true at the MBA level.Andrey: I would just say that, given the rate of hiring for MBAs, I’m not sure that the screening is really happening either. Maybe the screening is happening to get in.Bo Cowgill: What the screening function is now is like, can you get in as the ultimate thing?Seth: Right. And I think as you already suggest, the way this works can flip if there’s a change in opportunity costs, right? So maybe in the past, “Oh, I’m the high type. I go to college.” In the present, “I’m the high type. I’m gonna skip college, I’m gonna be an entrepreneur,” and now going to college is a low signal.Bo Cowgill: Yes. Exactly. So that’s kind of what’s going on in our model too. How are we applying this to job screening and AI? Well, you apply for a job, you have a resume, possibly a cover letter or, if you don’t have an old-fashioned cover letter, you probably have a pitch to a recruiter or to your friend who works at the company. And there are kind of elements of costly signaling in those pitches. So some people could have really smart-sounding pitches that use the right jargon and are kind of up to speed with regards to the latest developments in the industry or in the underlying technology or whatever. And those could actually be really useful signals because the only sort of person who would be up to speed is the one who finds it easy to follow all this information.Seth: Can I pause you for a second? Back before LLMs, when I was in high school, they helped me make a CV or a resume. It’s not like there was ever any monitoring that people had to write their own cover letters.Bo Cowgill: That’s really true. No, some people have said about our paper that this is a more general model of signal dilution, which was happening before AI and the internet and everything. And so one example of this might be SAT tutoring or other forms of help for high school students, like writing your resume for you. Where if something comes along—and this is where GenAI is gonna come in—but if anything comes along that makes it cheaper to produce signals that were once more expensive, at least for some groups, then that changes the informational content of the signal.Seth: If the tribe gets guns, it’s too easy to kill a lion.Bo Cowgill: Yeah. Then it just is too easy to kill the lions. But similar things I think have happened in the post-COVID era around the SATs. Maybe it’s become too easy, or so the t
In this episode of Justified Posteriors podcast, Seth and Andrey discuss “GDPVal” a new set of AI evaluations, really a novel approach to AI evaluation, from OpenAI. The metric is debuted in a new OpenAI paper, “GDP Val: Evaluating AI Model Performance on Real-World, Economically Valuable Tasks.” We discuss this “bottom-up” approach to the possible economic impact of AI (which evaluates hundreds of specific tasks, multiplying them by estimated economic value in the economy of each), and contrast it with Daron Acemoglu’s “top-down” “Simple Macroeconomics of AI” paper (which does the same, but only for aggregate averages), as well as with measures of AI’s use and potential that are less directly tethered to economic value (like Anthropic's AI Economic Value Index and GPTs are GPTs). Unsurprisingly, the company pouring hundreds of billions into AI thinks that AI already can do ALOT. Perhaps trillions of dollars in knowledge work tasks annually. More surprisingly, OpenAI claims the leading Claude model is better than their own!Do we believe that analysis? Listen to find out!Key Findings & Results Discussed* AI Win Rate vs. Human Experts:* The Prior: We went in with a prior that a generic AI (like GPT-5 or Claude) would win against a paid human expert in a head-to-head task only about 10% of the time.* The Headline Result: The paper found a 47.6% win rate for Claude Opus (near human parity) and a 38.8% win rate for GPT-5 High. This was the most shocking finding for the hosts.* Cost and Speed Improvements:* The paper provides a prototype for measuring economic gains. It found that using GPT-5 in a collaborative “N-shot” workflow (where the user can prompt it multiple times) resulted in a 39% speed improvement and a 63% cost improvement over a human working alone.* The “Catastrophic Error” Rate:* A significant caveat is that in 2.7% of the tasks the AI lost, it was due to a “catastrophic error,” such as insulting a customer, recommending fraud, or suggesting physical harm. This is presumed to be much higher than the human error rate.* The “Taste” Problem (Human Agreement):* A crucial methodological finding was that inter-human agreement on which work product was “better” was only 70%. This suggests that “taste” and subjective preferences are major factors, making it difficult to declare an objective “winner” in many knowledge tasks. Main Discussion Points & Takeaways* The “Meeting Problem” (Why AI Can’t Take Over):* Andrey argues that even if AI can automate artifact creation (e.g., writing a report, making a presentation), it cannot automate the core of many knowledge-work jobs.* He posits that much of this work is actually social coordination, consensus-building, and decision-making—the very things that happen in meetings. AI cannot yet replace this social function.* Manager of Agents vs. “By Hand”:* The Prior: We believed 90-95% of knowledge workers would still be working “by hand” (not just managing AI agents) in two years.* The Posterior: We did not significantly change this belief. We distinguish between “1-shot” delegation (true agent management) and “N-shot” iterative collaboration (which they still classify as working “by hand”). We believe most AI-assisted work will be the iterative kind for the foreseeable future.* Prompt Engineering vs. Model Size:* We noted that the models were not used “out-of-the-box” but benefited from significant, expert-level prompt engineering.* However, we were surprised that the data seemed to show that prompt tuning only offered a small boost (e.g., ~5 percentage points) compared to the massive gains from simply using a newer, larger, and more capable model.* Final Posterior Updates:* AI Win Rate: We updated our 10% prior to 25-30%. We remain skeptical of the 47.6% figure.PS — Should our thumbnails have anime girls in them, or Andrey with giant eyes? Let us know in the comments!Timestamps:* (00:45) Today’s Topic: A new OpenAI paper (”GDP Val”) that measures AI performance on real-world, economically valuable tasks.* (01:10) Context: How does this new paper compare to Acemoglu’s “Simple Macroeconomics of AI”?* (04:45) Prior #1: What percentage of knowledge tasks will AI win head-to-head against a human? (Seth’s prior: 10%).* (09:45) Prior #2: In two years, what share of knowledge workers will be “managers of AI agents” vs. doing work “by hand”?* (19:25) The Methodology: This study uses sophisticated prompt engineering, not just out-of-the-box models.* (25:20) Headline Result: AI (Claude Opus) achieves a 47.6% win rate against human experts, nearing human parity. GPT-5 High follows at 38.8%.* (33:45) Cost & Speed Improvements: Using GPT-5 in a collaborative workflow can lead to a 39% speed improvement and a 63% cost improvement.* (37:45) The “Catastrophic Error” Rate: How often does the AI fail badly? (Answer: 2.7% of the time).* (39:50) The “Taste” Problem: Why inter-human agreement on task quality (at only 70%) is a major challenge for measuring AI.* (53:40) The Meeting Problem: Why AI can’t (yet) automate key parts of knowledge work like consensus-building and coordination.* (58:00) Posteriors Updated: Seth and Andrey update their “AI win rate” prior from 10% to 25-30%.Seth: Welcome to the Justified Posteriors Podcast, the podcast that updates its priors on the economics of AI and technology. I’m Seth Benzell, highly competent at many real-world tasks, just not the most economically valuable ones, coming to you from Chapman University in sunny Southern California.Andrey: And I’m Andrey Fradkin, making sure to never use the Unicode character 2011, since it will not render properly on people’s computers. Coming to you from,, San Francisco, California.Seth: Amazing, Andrey. Amazing to have you here in the “state of the future.” and today we’re kind of reading about those AI companies that are bringing the future here today and are gonna, I guess, automate all knowledge work. And here they are today, with some measures about how many jobs—how much economic value of jobs—they think current generation chatbots can replace. We’ll talk about to what extent we believe those economic extrapolations. But before we go into what happens in this paper from our friends at OpenAI, do you remember one of our early episodes, that macroeconomics of AI episode we did about Daron Acemoglu’s paper?Andrey: Well, the only thing I remember, Seth, is they were quite simple, those macroeconomics., it was the...Seth: “Simple Macroeconomics of AI.” So you remembered the title. And if I recall correctly, the main argument of that paper was you can figure out the productivity of AI in the economy by multiplying together a couple of numbers. How many jobs can be automated? Then you multiply it by, if you automate the job, how much less labor do you need? Then you multiply that by, if it’s possible to automate, is it economically viable to automate? And you multiply those three numbers together and Daron concludes that if you implement all current generation AI, you’ll raise GDP by one percentage point. If you think that’s gonna take 10 years, he concludes that’s gonna be 0.1 additional percentage point of growth a year. You can see why people are losing their minds over this AI boom, Andrey.Andrey: Yeah. Yeah. I mean, I, you know, I think with such so much hype, you know, they should,, they should,, probably just stop investing altogether. Is kind of right what I would think from [Eriun’s?] paper. Yeah.Seth: Well, Andrey, why don’t I tell you, which is, the way I see this paper that we just read is that OpenAI has actually taken on the challenge and said, “Okay, you can multiply three numbers together and tell me the economic value of AI. I’m gonna multiply 200 numbers together and tell you the economic value of AI.” And in particular, rather than just try to take the sort of global aggregate of like efficiency from automation, they’re gonna go task by task by task and try to measure: Can AI speed you up? Can it do the job by itself?, this is the sort of real-world economics rubber-hits-the-road that you don’t see in macroeconomics papers.Andrey: Yeah. Yeah. I mean, it is, it is in many ways a very micro study, but I guess micro...Seth: Macro.Andrey: Micro, macro. That was the best, actually my favorite.Seth: Yeah.Andrey: I guess maybe we should start with our prior, Seth,, before we get deeper.Seth: Well, let’s say the name of the paper and the authors maybe.Andrey: There are so many authors, so OpenAI... I’m sorry guys. You gotta have fewer co-authors.Seth: We will not list the authors.Andrey: But,, the paper is called,, “GDP Val: Evaluating AI Model Performance on Real-World, Economically Valuable Tasks.”Seth: And we’re sure it’s written by humans.Andrey: We’re sure that it’s not fully written by humans because they’ve disclosed that they use AI. They have an acknowledgement—they have an AI acknowledgement section.Seth: They used AI “as per usual”? Yeah. In the “ordinary course of coding...”Andrey: And writing.Seth: And writing. And for “minor improvements.” Yes. They wanted to be clear. Okay.Andrey: Not, not the major ones. Yes.Seth: Because,, you know, base... so, all right. You gave us the name of the paper. The paper is going to... just in one sentence, what the paper is about is them going through lots of different tasks and trying to figure out if they can be automated. What are the priors? Before we go into this, what are you thinking about, Andrey?Andrey: Well, what they’re gonna do is they’re gonna create a work product, let’s say a presentation or schematic or a document, and then they’re gonna have people rate which one is better, the one created by the AI, or the one created by a professional human being. And so the first prior that we have is: What share of time is the AI’s output gonna win? so what do you think, Seth?Seth: Great question. Okay, so I’m thinking about the space of all knowledge work in the economy. All of the jobs done by humans that we think you could do 100% on a computer,
In this episode, Seth Benzell and Andrey Fradkin read “We Won’t Be Missed: Work and Growth in the AGI World” by Pascual Restrepo (Yale) to understand what how AGI will change work in the long run. A common metaphor for the post AGI economy is to compare AGIs to humans and men to ants. Will the AGI want to keep the humans around? Some argue that they would — there’s the possibility of useful exchange with the ants, even if they are small and weak, because an AGI will, definitionally, have opportunity costs. You might view Pascual’s paper as a formalization of this line of reasoning — what would be humanity’s asymptotic marginal product in a world of continually improving super AIs? Does the God Machine have an opportunity cost?Andrey, our man on the scene, attended the NBER Economics of Transformative AI conference to learn more from Pascual Restrepo, Seth’s former PhD committee member.We compare Restrepo’s stripped-down growth logic to other macro takes, poke at the tension between finite-time and asymptotic reasoning, and even detour into a “sheep theory” of monetary policy. If compute accumulation drives growth, do humans retain any essential production role—or only inessential, “cherry on top” accessory ones?Relevant Links* We Won’t Be Missed: Work and Growth in the AGI World — Pascual Restrepo (NBER TAI conference) and discussant commentary* NBER Workshop Video: “We Won’t Be Missed” (Sept 19 2025)* Marc Andreessen, Why Software Is Eating the World (WSJ 2011)* Shapiro & Varian, Information Rules: A Strategic Guide to the Network Economy (HBR Press)* Ecstasy: Understanding the Psychology of Joy — Find the sheep theory of the price level here: Seth’s ReviewPriors and PosteriorsClaim 1 — After AGI, the labor share goes to zero (asymptotically)* Seth’s prior: >90% chance of a large decline, <10% chance of literally hitting ~0% within 100 years.* Seth’s posterior: Unchanged. Big decline likely; asymptotic zero still implausible in finite time.* Andrey’s prior: Skeptical that asymptotic results tell us much about a 100-year horizon.* Andrey’s posterior: Unchanged. Finite-time dynamics dominate.* Summary: Compute automates bottlenecks, but socially or physically constrained “accessory” human work probably keeps labor share above zero for centuries.Claim 2 — Real wages 100 years after AGI will be higher than today* Seth’s prior: 70% chance real wages rise within a century of AGI.* Seth’s posterior: 71% (a tiny uptick).* Andrey’s prior: Agnostic; depends on transition path.* Andrey’s posterior: Still agnostic.* Summary: If compute accumulation drives growth and humans still trade on preference-based or ritual tasks, real wages could rise even as labor’s income share collapses.Keep your Apollonian separate from your Dionysian—and your accessory work bottlenecked.Timestamps:[00:01:47] NBER Economics of Transformative AI Conference [00:04:21] Pascual Restrepo’s paper on automation and AGI [00:05:28] Will labor share go to zero after AGI? [00:43:52] Conclusions and updating posteriors [00:48:24] Second claim: Will wages go down after AGI? [00:50:00] The sheep theory of monetary policyTranscript[00:00:00] Seth: Welcome everyone to the Justified Posteriors Podcast, where we read technology and economics papers and get persuaded by them so you don’t have to.Welcome to the Justified Posteriors Podcast, the podcast that updates its priors about the economics of AI and technology. I’m Seth Benzell performing bottleneck tasks every day in the sense that I’m holding a bottle and a baby by the neck down in Chapman University in sunny, Southern California. [00:00:40] Andrey: I’m Andrey Fradkin, practicing my accessory tasks even before the AGI comes coming to you from San Francisco, California.So Seth, great [00:00:53] Seth: to be. Yeah, please. [00:00:54] Andrey: Well, what are you, what have you been thinking about recently? What have been, [00:01:00] contemplating. [00:01:01] Seth: Well, you know, having a baby gets you to think a lot about, what’s really important in life and what kind of future are we leaving to him, you know, if we might imagine a hundred years from now, what is the economy that he’s gonna have when he’s retired?Who even knows what such a future would look like? And a lot of economists are asking this question and there was this really kind of cool conference that put together some of the favorite friends of the show. An NBER Economics of Transformative AI Conference that forced participants to accept the premise that AGI is invented.Okay, go do economics of that. And Andrey, I hear that somehow you were able to get the inside scoop. [00:01:47] Andrey: Yes. Um, it was a pleasure to contribute a paper with some co-authors to the conference and to attend. It was really fun to [00:02:00] just hear how people are, um, thinking about these things, people who oftentimes I associate with being very kind of serious, empirical, rigorous people kind of thinking pie in the sky thoughts about transformative AI.So, yeah, it was a lot of fun. Um, and there were a lot of interesting papers. [00:02:22] Seth: Go ahead. Wait. No, before I want, I’m not gonna let you off the hook Andrey. Yeah, because I have to say, just before we started the show, you did not present all of the conversation at the seminars as a hundred percent fun as enlightening, but rather you found some of the debate a little bit frustrating.Why? Why is that? [00:02:39] Andrey: Well, I mean, I, you know, dear listeners, I hope we don’t fall guilty of this, but I do find a lot of AI conversation to be a little cliche and hackneyed at this point. Right. It’s kind of surprising how little [00:03:00] new stuff can be said. If you’ve read some science fiction books, you kind of know the potential outcomes.Um, and so, you know, it’s a question of what we as a community of economists can offer that’s useful or new. And I do think we can, it’s just, it’s very easy to fall into these cliches or well trodden paths. [00:03:20] Seth: What? What’s the meaning of life? Andrey? Will life have meaning after the robot takes my job? Will my AI girlfriend really fulfill me?Why do we think economists would be good at answering those questions? [00:03:34] Andrey: Yeah, it’s a great question, Seth. I’m not sure. Um, [00:03:39] Seth: I think it’s because they’re the last respected kind of technocrat. Obviously all technocrats are hated, but if anybody’s allowed to have an opinion about whether your anime cat girl waifu AI companion is truly fulfilling.We’re the only, we’re the only source of remaining authority. [00:03:57] Andrey: Well, you know, [00:03:57] Seth: unfortunately, [00:03:58] Andrey: I think it’s a [00:04:00] common thing to speculate as to which profession will be automated last, and certainly Marc Andreessen believes that it is venture capitalist. So [00:04:11] Seth: Fair enough. I’ll narcissism, I’ll leave [00:04:13] Andrey: it as an exercise to the listener what economists think.[00:04:21] Seth: So let’s talk about, so we’re, we’re at, we’re talking about whether humans will be essential in the long run because the particular paper that struck my eye when I was looking at the list of seminars topics was a paper by friend of the show, I hope he considers us a friend of the show because I love this guy.Pascual Restrepo, a professor of economics and AI at Yale University. Um, had the honor of having this guy on my dissertation committee was definitely a role model when I was a young gun, trying to think about macro of AI before everyone on earth was thinking about macro of AI. [00:05:00] Um. And so it’s a real honor for the show to take on one of his papers and he’s got something that’s trying to respond to.Okay. Transformative AI shows up. What are the long-term dynamics of that? Which is a departure from where he wants to be. He wants to live in near future. We automate another 10% of tasks land. Right. So I was excited to take this on. Um, Andrey, do you wanna maybe, introduce some of the questions it asks us to consider?[00:05:28] Andrey: Yeah. So, Pascual presents a very stylized model of the macro economy and we picked two claims from the paper to think about in terms of our priors. Um, the first one of these is, um, after we get AGI in the limit, the labor share will go to zero. That is the first claim of this paper. Um, what do you think about that, Seth?[00:05:59] Seth: Great question. [00:06:00] Um, so to remind listeners, so the labor share is if you imagine all of the payments in the economy, some are going to workers and then some are going to people who own the machines or own the AI, right? So today about two thirds of the money or about 60% of the money is paid to workers.About 40% is paid to machines and out to profits and people who own stuff. It is a claim of this paper and a kind of a lot, a theme of a lot of the automation literature that as you get more and more automation, you’d expect the share of monies that are being paid to workers to go down, right? Because just more of the economy is just automation unconstrained by.Um, let me tell you how I think about this question, Andrey. First of all, you know, we’re not gonna talk about out to infinity. I know these are asymptotic papers, but let’s try to stay a little bit closer. Um, so I’ll, I’ll mostly be thinking about like a hundred years after [00:07:00] AGI, right? So we have AGI, and now we’re, we’ve played it out in some sense.We’ve had the next industrial revolution that happens from AGI, right? Assuming we don’t have an apocalypse, so this is, let’s set aside, conditional on, you know, we don’t destroy ourselves, which I don’t think there’s a huge chance of that, but that’s another question. I would say there’s a greater than 90% chance of very large decreases in labor share, you know, down from 60% today to 5%, 10%, 20%.I really do see that. But I think there’s like a less than a 10% chance that within a hundred years of AGI, um, we’ll have, you know, literally 0% labor share or what
Economists generally see AI as a production technology, or input into production. But maybe AI is actually more impactful as unlocking a new way of organizing society. Finish this story: * The printing press unlocked the Enlightenment — along with both liberal democracy and France’s Reign of Terror* Communism is primitive socialism plus electricity* The radio was an essential prerequisite for fascism * AI will unlock ????We read “AI as Governance” by Henry Farrell in order to understand whether and how political scientists are thinking about this question. * Concepts or other books discussed:* E. Glen Weyl, coauthor of Radical Markets: Uprooting Capitalism and Democracy for a Just Society, and key figure in the Plurality Institute was brought up by Seth as an example of an economist-political science crossover figure who is thinking about using technology to radically reform markets and governance. * Cybernetics: This is a “science” that studies human-technological systems from an engineering perspective. Historically, it has been implicated in some fantastic social mistakes, such as China’s one-child policy.* Arrow’s Impossibility Theorem: The economic result that society may not have rational preferences — if true, “satisfying social preferences” may not be a possible goal to maximize * GovAI - Centre for the Governance of AI* Papers on how much people/communication is already being distorted by AI:* Previous episode mentioned in the context of AI for social control:* Simulacra and Simulation (Baudrillard): Baudrillard (to the extent that any particular view can be attributed to someone so anti-reality) believed that society lives in “Simulacra”. That is, artificially, technologically or socially constructed realities that may have some pretense of connection to ultimate reality (i.e. a simulation) but are in fact completely untethered fantasy worlds at the whim of techno-capitalist power. A Keynesian economic model might be a simulation, whereas Dwarf Fortress is a simulacra (a simulation of something that never existed). Whenever Justified Posteriors hears “governance as simulation”, it thinks: simulation or simulacra?Episode Timestamps[00:00:00] Introductions and the hosts’ backgrounds in political science. [00:04:45] Introduction of the core essay: Henry Farrell’s “AI as Governance.” [00:05:30] Stating our Priors on AI as Governance[00:15:30] Defining Governance (Information processing and social coordination). [00:19:45] Governance as “Lossy Simulations” (Markets, Democracy, Bureaucracy). [00:25:30] AI as a tool for Democratic Consensus and Preference Extraction. [00:28:45] The debate on Algorithmic Bias and cultural bias in LLMs. [00:33:00] AI as a Cultural Technology and the political battles over information. [00:39:45] Low-cost signaling and the degradation of communication (AI-generated resumes).[00:43:00] Speculation on automated Cultural Battles (AI vs. AI). [00:51:30] Justifying Posteriors: Updating beliefs on the need for a new political science. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit empiricrafting.substack.com
Many of today’s thinkers and journalists worry that AI models are eating their lunch: hoovering up these authors’ best ideas and giving them away for free or nearly free. Beyond fairness, there is a worry that these authors will stop producing valuable content if they can’t be compensated for their work. On the other hand, making lots of data freely accessible makes AI models better, potentially increasing the utility of everyone using them. Lawsuits are working their way through the courts as we speak of AI with property rights. Society needs a better of understanding the harms and benefits of different AI property rights regimes.A useful first question is “How much is the AI actually remembering about specific books it is illicitly reading?” To find out, co-hosts Seth and Andrey read “Cloze Encounters: The Impact of Pirated Data Access on LLM Performance”. The paper cleverly measures this through how often the AI can recall proper names from the dubiously legal “Book3” darkweb data repository — although Andrey raises some experimental concerns. Listen in to hear more about what our AI models are learning from naughty books, and how Seth and Andrey think that should inform AI property rights moving forward. Also mentioned in the podcast are: * Joshua Gans paper on AI property rights “Copyright Policy Options for Generative Artificial Intelligence” accepted at the Journal of Law and Economics: * Fair Use* The Anthropic lawsuit discussed in the podcast about illegal use of books has reached a tentative settlement after the podcast was recorded. The headline summary: “Anthropic, the developer of the Claude AI system, has agreed to a proposed $1.5 billion settlement to resolve a class-action lawsuit, in which authors and publishers alleged that Anthropic used pirated copies of books — sourced from online repositories such as Books3, LibGen, and Pirate Library Mirror — to train its Large Language Models (LLMs). Approximately 500,000 works are covered, with compensation set at approximately $3,000 per book. As part of the settlement, Anthropic has also agreed to destroy the unlawfully obtained files.”* Our previous Scaling Law episode: This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit empiricrafting.substack.com
In our first ever EMERGENCY PODCAST, co-host Seth Benzell is summoned out of paternity leave by Andrey Fradkin to discuss the AI automation paper that’s making headlines around the world. The paper is Canaries in the Coal Mine? Six Facts about the Recent Employment Effects of Artificial Intelligence by Erik Brynjolfsson, Bharat Chandar, and Ruyu Chen. The paper is being heralded as the first evidence that AI is negatively impacting employment for young workers in certain careers. Seth and Andrey dive in, and ask — what do we believe about AI’s effect on youth employment going in, and what can we learn from this new evidence? Related recent paper on AI and job postings: Generative AI as Seniority-Biased Technological Change: Evidence from U.S. Résumé and Job Posting Data Also related to our discussion is the China Shock literature, which Nick Decker summarizes in his blog post: This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit empiricrafting.substack.com
In this episode, we discuss a new theoretical framework for understanding how AI integrates into the economy. We read the paper Artificial Intelligence and the Knowledge Economy (Ide & Talamas, JPE), and debate whether AI will function as a worker, a manager, or an expert. Read on to learn more about the model, our thoughts, timestamp, and at the end, you can spoil yourself on Andrey and Seth’s prior beliefs and posterior conclusions — Thanks to Abdullahi Hassan for compiling these notes to make this possible. The Ide & Talamas ModelOur discussion was based on the paper Artificial Intelligence in the Knowledge Economy by Enrique Ide and Eduard Talamas. It is a theoretical model of organizational design in the age of AI. Here’s the basic setup:* The Setting: A knowledge economy where firms’ central job is solving a continuous stream of problems.* The Players: We have Workers (human or AI) and a higher-level Solver (human manager/expert or AI). Crucially, the human players are vertically differentiated—they have different skill levels.* The Workflow: It’s a two-step process: A worker gets the first shot at solving the problem. If they fail, the problem gets escalated up the hierarchy to the Solver for a second attempt.* The Core Question: Given this hierarchy, what’s the most efficient organizational arrangement as AI gets smarter? Do we pair human workers with an AI manager, or go for the AI worker/human manager combo? * There are also possibilities not considered in the paper, such as chains of alternating managers and employees, something more network-y etc. Key Debates & CritiquesHere are the most interesting points of agreement, disagreement, and analysis we wrestled with:* Is a Solver Really a Manager? We spent a lot of time critiquing the paper’s terminology. The “manager” in this model is really an Expert who handles difficult exceptions. We argued that this role doesn’t capture the true human elements of management, like setting strategic direction, building team culture, or handling hiring/firing.* My Desire vs. Societal Growth: Andrey confessed that while he personally wants an AI worker to handle all the tedious stuff (like coding and receipts), the economy might see better growth and reduced inequality from having AI experts and managers who can unlock new productivity at the highest levels.* The Uber Driver Problem: We debate how to classify jobs like Uber driving. Is this already an example of AI managing the human (high-frequency algorithmic feedback), or is the driver still an entrepreneur who will manage their own fleet of smaller AI agents for administrative tasks?Go DeeperCheck out the sources we discussed for a deeper dive:* Main Paper: Artificial Intelligence and the Knowledge Economy (Ide & Talamas, JPE)* Mentioned Research: Generative AI at Work (Brynjolfsson, Lee, & Raymond on AI in call centers)Timestamps* [00:00] Worker, Manager, or Expert?* [00:06] Who manages the AI agents?* [00:15] Will AI worsen inequality?* [00:25] The Ide & Talamas model explained* [00:40] Limitations and critiques* [00:55] Posteriors: updated beliefsThe Bets: Priors & PredictionsWe pinned down our initial beliefs on two key questions about the future impact of AI agents, the foundation of our “Justified Posteriors.”Prediction 1: Will Managing AI Agents Become a Common Job? What percentage of U.S. workers will have “managing or creating teams of AI agents” as their main job within 5 years?Prediction 2: Will LLM-based Agents Exacerbate Wage Polarization?* Seth’s Prior: 25% chance it WILL exacerbate. Reasoning: Emerging evidence (like the call center study)* Andre’s Prior: 55% chance it WILL exacerbate. Reasoning: Skeptical of short-term studies; believes historical technology trends favor high-skill workers who capture the largest gains.Our Final PosteriorsPrediction 1: Will Managing AI Agents Become a Common Job?The model slightly convinced Seth that the high-skill vertical differentiation story might be stronger than he initially believed, leading to a small increase in his posterior for exacerbation. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit empiricrafting.substack.com
One LLM to rule them all?

One LLM to rule them all?

2025-08-1201:02:011

In this special episode of the Justified Posteriors Podcast, hosts Seth Benzell and Andrey Fradkin dive into the competitive dynamics of large language models (LLMs). Using Andrey’s working paper, Demand for LLMs: Descriptive Evidence on Substitution, Market Expansion, and Multihoming, they explore how quickly new models gain market share, why some cannibalize predecessors while others expand the user base, and how apps often integrate multiple models simultaneously.Host’s note, this episode was recorded in May 2025, and things have been rapidly evolving. Look for an update sometime soon.TranscriptSeth: ​Welcome to Justified Posterior Podcast, the podcast that updates beliefs about the economics of AI and technology. I'm Seth Benzel, possessing a highly horizontally differentiated intelligence—not saying that's a good thing—coming to you from Chapman University in sunny Southern California.Andrey: And I'm Andrey Fradkin, multi-homing across many different papers I'm working on, coming to you from sunny—in this case—Cambridge, Massachusetts.Seth: Wow…. Rare, sunny day in Cambridge, Mass. But I guess the sunlight is kind of a theme for our talk today because we're going to try to shed some light on some surprising features of AI, some important features, and yet, not discussed at all. Why don't people write papers about the important part of AI? Andrey, what's this paper about?Andrey: I agree that not enough work has been done on this very important topic. Look, we can think about the big macroeconomic implications of AI—that's really fun to talk about—but it's also fun to talk about the business of AI. Specifically, who's going to win out? Which models are better than others? And how can we measure these things as they're happening at the moment? And so that's really what this paper is about. It's trying to study how different model providers compete with each other.Seth: Before we get deep into that—I do want to push back on the idea that this isn't macroeconomically important. I think understanding the kind of way that the industry structure for AI will work will have incredible macroeconomic implications, right? If only for diversity—for equality across countries, right? We might end up in a world where there's just one country or a pair of countries that dominate AI versus a world where the entire world is involved in the AI supply chain and plugging in valuable pieces, and those are two very different worlds.Andrey: Yeah. So, you're speaking my book, Seth. Being an industrial organization economist, you know, we constantly have this belief that macroeconomists, by thinking so big-picture, are missing the important details about specific industries that are actually important for the macroeconomy.Seth: I mean—not every specific industry; there's one or two specific industries that I would pay attention to.Andrey: Have you heard of the cereal industry, Seth?Seth: The cereal industry?Andrey: It's important how mushy the cereal is.Seth: Well, actually, believe it or not, I do have a breakfast cereal industry take that we will get to before the end of this podcast. So, viewers [and] listeners at home, you gotta stay to the end for the breakfast cereal AI economics take.Andrey: Yeah. And listeners at home, the reason that I'm mentioning cereal is it's of course the favorite. It's the fruit fly of industrial organization for estimating demand specifically. So—a lot of papers have been written about estimating serial demand and other such thingsSeth: Ah—I thought it was cars. I guess cars and cereal are the two things.Andrey: Cars and cereal are the classic go-tos.Introducing the paperSeth: Amazing. So, what [REDACTED] wrote the paper we're reading today, Andrey?Andrey: Well, you know—it was me, dear reader—I wrote the paper.Seth: So we know who's responsible.Andrey: All mistakes are my fault, but I should also mention that I wrote it in a week and it's all very much in progress. And so I hope to learn from this conversation, as we—let's say my priors are diffuse enough so that I can still updateSeth: Oh dude, I want you to have a solid prior so we can get at it. But I will say I was very, very inspired by this project, Andrey. I also want to follow in your footsteps. Well, maybe we'll talk about that at the end of the podcast as well. But maybe you can just tell us the title of your paper. Andrey,Andrey: The title of the paper is Demand for LLMs, and now you're forcing me to remember the title of the—Seth: If you were an AI, you would remember the title of the paper, maybe.Andrey: The title of the paper is Demand for LLMs: Descriptive Evidence on Substitution Market Expansion and Multi-Homing. So, I will state three claims, which I do make in the paper.Seth: Ooh, ooh.Andrey: And you can tell me your priors.Seth: Prior on each one. Okay, so give me the abstract; claim number one.Andrey: So the point number one is that when a new good model gets released, it gets adopted very quickly. Within a few weeks, it achieves kind of a baseline level of adoption. So I think that's fact number one. And that's very interesting because not all industries have such quick adoption cycles.Seth: Right? It looks more like the movie or the media industry, where you have a release and then boom, everybody flocks to it. That's the sense that I got before reading this paper. So I would put my probability on a new-hot new model coming out; everybody starts trying it—I mean, a lot of these websites just push you towards the new model anyway.I know we're going to be looking at a very specific context, but if we're just thinking overall. Man, 99% when a new hot new model comes out, people try it.Andrey: So I'll push back on that. It's the claim that it's not about trying it, like these models achieve an equilibrium level of market penetration. It's not—Seth: How long? How long is—how long is just trying it? Weeks, months.Andrey: How long are—sorry, can you repeat that question?Seth: So you're pushing back on the idea that this is, quote unquote, “just trying the new release.” Right. But what is the timeline you're looking over?Andrey: It's certainly a few months, but it doesn't take a long time to just try it. So, if it was just trying we'd see us blip over a week, and then it would go back down. And I don't—Seth: If they were highly horizontally differentiated, but if they were just very slightly horizontally differentiated, you might need a long time to figure it out.Andrey: You might—that's fair. Okay, so the second claim is: the different models have very different patterns of either substituting away from existing models or expanding the market. And I think two models that really highlight that are Claude 3.7 Sonnet, which primarily cannibalizes from Claude 3.5 Sonnet.Seth: New Coke,Andrey: Yes, and it's—well, New Coke failed in this regard.Seth: Diet Coke,Andrey: Yeah. And then another model is Google's Gemini 2.0 Flash, which really expanded the market on this platform. A lot of people started using it a lot and didn't seem to have noticeable effects on other model usage.Seth: Right?Andrey: So this is kind of showing that kind of models are competing in this interesting space.Seth: My gosh. Andrey, do you want me to evaluate the claim that you made, or are you now just vaguely appealing to competition? Which of the two do you want me to put a prior on?Andrey: No no no. Go for it. Yeah.Seth: All right, so the first one is: do I think that if I look at, you know, a website with a hundred different models, some of them will steal from the same company and some of them will lead to new customers?Right? I mean with a—I, I'm a little bit… Suppose we asked this question about products and you said, “Professor Benzel, will my product steal from other demands, or will it lead to new customers?” I guess at a certain level, it doesn't even make sense, right? There's a general equilibrium problem here where you always have to draw from something else.I know we're drawing from other AIs, which would mean that there would have to be some kind of substitution. So I mean, yes, I believe sometimes there's going to be substitution, and yes, I believe sometimes, for reasons that are not necessarily directly connected to the AI model, the rollout of a new model might bring new people into the market.Right. So I guess I agree. Like at the empirical level, I would say 95% certain that models differ in whether they steal from other models or bring in new people. If you're telling me now there's like a subtler claim here, which is that the fact that some models bring in new people is suggestive of horizontal differentiation and is further evidence for strong horizontal differentiation.And I'm a little bit, I don't know, I'll put a probability on that, but that's, that seems to be going a little bit beyond the scope of the description.Andrey: Well, we can discuss that in the discussion session. And I think the final part that I make a claim about is that apps, and the users of apps as well, to multi-home across models. So it's not that people are using just one model. It's not like app developers are using just one model for each application. And that's kind of once again pointing to the fact that there isn't just kind of one superior model even within a given model class.And, Seth, go for itSeth: Andrey, you did the thing again. You did the thing again where you said, "Here, Seth, do you want to evaluate this empirical finding?" Or do you want me to now say, “This tells us something about the future of competition in AI'?"Andrey: Yes, yes, yes. All right, go for it.Seth: The empirical claim, right? Is—give me the narrow claim? One more time? Give it to me.Andrey: The apps are multihoming.Seth: The people multi-home. Okay. The narrow claim is we've got these apps; maybe we'll give the user, the listeners, a little bit of context of what a sample app would be.Andrey: Yeah, so I think about two types of apps here. One is a coding app, so Klein and RU coder ar
In a Justified Posteriors first, hosts Seth Benzell and Andrey Fradkin sit down with economist Daniel Rock, assistant professor at Wharton and AI2050 Schmidt Science Fellow, to unpack his groundbreaking research on generative AI, productivity, exposure scores, and the future of work. Through a wide-ranging and insightful conversation, the trio examines how exposure to AI reshapes job tasks and why the difference between exposure and automation matters deeply.Links to the referenced papers, as well as a lightly edited transcript of our conversation, with timestamps are below:Timestamps:[00:08] – Meet Daniel Rock[02:04] – Why AI? The MIT Catalyst Moment[04:27] – Breaking Down “GPTs are GPTs”[09:37] – How Exposed Are Our Jobs?[14:49] – What This Research Changes[16:41] – What Exposure Scores Can and Can’t Tell Us[20:10] – How LLMs Are Already Being Used[27:31] – Scissors, Wage Gaps & Task Polarization[38:22] – Specialization, Modularity & the New Tech Workplace[43:43] – The Productivity J-Curve[53:11] – Policy, Risk & Regulation[1:09:54] – Final Thoughts + Call to ActionShow Notes/Media Mentioned:* “GPTs are GPTs” – Rock et al.’s paper* https://arxiv.org/abs/2303.10130* “The Future of Employment: How susceptible are jobs to computerization?” - Frey and Osborne (2013)* https://www.oxfordmartin.ox.ac.uk/publications/the-future-of-employment* “AI exposure predicts unemployment risk: A new approach to technology-driven job loss”— Morgan Frank's paper* https://academic.oup.com/pnasnexus/article/4/4/pgaf107/8104152* "Simple Macroeconomics of AI" – By Daron Acemoglu.* https://economics.mit.edu/sites/default/files/2024-04/The%20Simple%20Macroeconomics%20of%20AI.pdf* “The Dynamo and the Computer” – Paul A. David* https://www.almendron.com/tribuna/wp-content/uploads/2018/03/the-dynamo-and-the-computer-an-historical-perspective-on-the-modern-productivity-paradox.pdf* “Productivity J-Curve” – Erik Brynjolfsson and Chad Syverson* https://www.nber.org/system/files/working_papers/w25148/w25148.pdf* “Generative AI for Economic Research: Use Cases and Implications for Economists”– Anton Korinek’s paper* https://www.newyorkfed.org/medialibrary/media/research/conference/2023/FinTech/400pm_Korinek_Paper_LLMs_final.pdf* Kremer’s O-ring Theory* https://fadep.org/wp-content/uploads/2024/03/D-63_THE_O-RING_THEORY.pdf* 12 Monkeys (film) – Directed by Terry Gilliam* Generative AI for Economic Research - Anton Korinek.* https://www.aeaweb.org/content/file?id=21904Transcript:Andrey: Welcome to the Justified Posteriors Podcast, the podcast that updates its beliefs about the economics of AI and technology. I'm Seth Benzell, exposed to and exposing myself to the AI since 2015, coming to you from Chapman University in sunny southern California.Andrey: I'm Andrey Fradkin, riding the J curve of productivity into infinity, coming to you from Cambridge, Massachusetts. Today, we're delighted to have a friend from the show, Daniel Rock, as our inaugural interview guest.Daniel: Hey, guys.Andrey: Daniel is an assistant professor of operations, information, and decisions at the Wharton School, University of Pennsylvania, and is also an AI 2050 Schmidt Science Fellow.So he is considered one of the bright young minds in the AI world. And it's a real pleasure to get to talk to him about his work and spicy takes, if you will.Daniel: Well, it's a pleasure to get to be here. I'm a big fan of what you guys are doing. If I had my intro, I'd say I've been enthusiastic about getting machines to do linear algebra for about a decade.Andrey: Alright, let's get started with some questions. I think before—Seth: Firstly, how do you pronounce the acronym? O-I-D (Note, OID is the operations, information, and decisions group at Wharton).Daniel: This is a big debate between the students and the faculty. We always say O-I-D, and the students say OID.Seth: So our very own. OID boy. All right, you can ask the serious question.Andrey: Before we get into any of the specific papers, I think one of the things that distinguishes Daniel from many other academics in our circle is that he took AI very seriously as a subject of inquiry for social sciences very early, before almost anyone else. So, what led you to that? Like, why were you so ahead of everyone else?Daniel: I'm not sure. Well, it's all relative, I suppose, but there's the very far back answer, which we can talk about later as we talk about the kind of labor and AI. And then, there is the sort of Core Catalyst Day. I kind of remember it. so back at the M-I-T-I-D-E, where we've all spent time and gotten to know each other in 2013,Seth: What is the M-I-T-I-D-E?Daniel: The MIT Initiative on the Digital Economy, Erik Bryjnolffson’s research group. I was one of Erik's PhD students. My first year, we had a seminar speaker from the Computer Science and Artificial Intelligence Lab, CSAIL. John Leonard was talking about self-driving cars, and he came out there, and he said, “Look, Google's cheating. They're putting sensors in the road. We're building the real deal: cars that can drive themselves in all sorts of different circumstances. And let me be real with all of you. This is not going to be happening anytime soon. It will be decades.”And there were other people who were knowledgeable about the subject saying, “No, it's coming in like 5 to 10 years.”And at that point I thought to myself, “Well, if all these really brilliant people can disagree about what's going to happen, surely there's something cool here to try to understand.”As you're going through econometrics classes, I wouldn't say econometrics is the same thing as AI. We could debate that, but there's enough of an overlap that I could kind of get my head around the optimization routines and things going on in the backend of the AI models and thought, “Well, this is a cool place to learn a lot and, at the same time, maybe say something that other people haven't dug into yet.”Andrey: Yeah. Very cool. So, with that, I think maybe you can tell us a little bit about your paper GPTs, which is a paper that has had an enormous amount of attention over the years and I think has been quite influential.Daniel: Yeah, we've been lucky in that sense.Seth: In two years.Andrey: that's not—I mean—some version of it was out earlier… No…. Or is it? Has it only really been two years?Daniel: It has been the longest, , Andrey. If you and I weren't already sort of bald, , it might've been a time period for us to go bald. Yeah, we put it out in March of 2023. I had a little bit of early access to GPT-4. My co-authors can attest to the fact that I rather annoyingly tried to get GPT-4 to delete itself for the first week or two that I had it rather than doing the research we needed to. But yeah, it's only been about two and a half. Okay, so the paper, as I describe it, at least recently, has kind of got a Dickensian quality to it. There is a pessimistic component, there's an optimistic component, and there's a realistic component to it.So I'll start with the pessimistic, or I'll— why don't I just start with what we do here first? So we go through O*Net's list of tasks., There are 20,000 tasks in O*NET, and for each one of those tasks, we ask a set of humans who are working with OpenAI; they kind of understand what large language models in general are capable of doing.What would help you cut that time in half? So could you cut the time to do this task in half with a large language model with no drop in quality? And there are three answers. One answer is of course not; that's like flipping a burger or something. Maybe we get large language models imbued into robotics technologies at some point in the future, but it's not quite there yet.Another answer is, of course, you can. This would be like writing an email or processing billing details or an invoice.And then there's the middle one, which we call E2. So, E0 is no, E1 is yes, and E2 is yes, you could, but we're going to need to build some additional software and systems around it.So there's a gain to be had there, but it's not like LLMs are the only component of the system. And the reason we pick other software is because there's a pretty deep literature on how software and information technologies generally require a lot of co-invention, a lot of additional processes, and tangible capital. It makes it difficult to deploy those technologies fruitfully.And we figured, okay, by comparing that E1 category, the yes you can, with an LLM out-of-the-box, to the E2 category, how much do additional systems and innovation get us? We could say something about whether generative, pre-trained transformers, GPTs, are general-purpose technologies. They'll be pervasive, they improve over time, and they necessitate that kind of complimentary innovation. They change the direction of innovation.If we can say yes to those three things, then we're in a situation where we get to the pessimistic version of the story. You just can't know what the long-term equilibrium is going to be across different markets as a result of these tools.So the prognostications that, ‘Oh yes, AI is coming to annihilate all the jobs. That the Machine God is imminent—or at least the Economic Machine God is imminent. I think those are a bit premature if you look and say this is general-purpose technology because historically general-purpose technologies have been hard to predict at the outset.So the optimistic side of things is that that impact potential is pervasive. There's a lot of benefit to be had in changing how people work. We use this exposure measure—I'm sure we'll get into this—but exposure is not automation. Exposure is potential for change, and if there's potential for fruitful change, we get more value in lots of different places in the economy.That's a good story we found—and if the reviewer is listening to this, thank you very much. One of our reviewers suggested looking at science and innovation tasks and research and development tasks and seeing how those compare to other are
A Resource Curse for AI?

A Resource Curse for AI?

2025-07-1401:09:33

In this episode of Justified Posteriors, we tackle the provocative essay “The Intelligence Curse” by Luke Drago and Rudolf Laine. What if AI is less like a productivity booster and more like oil in a failed state? Drawing from economics, political theory, and dystopian sci-fi, we explore the analogy between AI-driven automation and the classic resource curse.* [00:03:30] Introducing The Intelligence Curse – A speculative essay that blends LessWrong rationalism, macroeconomic theory, and political pessimism.* [00:07:55] Running through the six economic mechanisms behind the curse, including volatility, Dutch disease, and institutional decay.* [00:13:10] Prior #1: Will AI-enabled automation make elites less responsive to ordinary people by 2050?* [00:21:00] Prior #2: Will we get a new social contract (e.g., large-scale UBI or constitutional change) by 2050? * [00:26:31] Chapter-by-chapter breakdown.* [00:43:50] What about property rights? Can they insulate us from AI-induced tyranny? Or will they be eroded in the name of efficiency?* [00:46:01] Critiques* [00:52:00] Policy "solutions":* [01:04:44] Final posteriors and Seth’s economic-philosophical reflections: Can immortality + perfect patience = AI capital monopolies?Mentioned in the Episode📖 “The Intelligence Curse” by Luke Drago and Rudolf Laine📚 I Have No Mouth and I Must Scream📚 There Is No Antimemetics Division📚 The Naked Sun by Isaac Asimov🎮 90s point-and-click horror game based on “I Have No Mouth...”📈 Sachs & Warner (1995) and Frankel (2012) on the resource curse.🔁 The Gatsby Curve📽️ Gattaca, 1984, Gulliver’s TravelsSupport the show: Please like, share, subscribe! This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit empiricrafting.substack.com
Robots for the retired?

Robots for the retired?

2025-06-3001:00:54

In this episode of Justified Posteriors, we examine the paper "Demographics and Automation" by economists Daron Acemoglu and Pascual Restrepo. The central hypothesis of this paper is that aging societies, facing a scarcity of middle-aged labor for physical production tasks, are more likely to invest in industrial automation.Going in, we were split. One of us thought the idea made basic economic sense, while the other was skeptical, worrying that a vague trend of "modernity" might be the real force causing both aging populations and a rise in automation. The paper threw a mountain of data at the problem, from international robot counts to US patent filings. Listen to find out how we updated our priors!Timestamps:(01:45) The Central Question(04:10) Stating the Priors(10:45) Looking to the Future.(22:30) What is a Robot, Anyway?.(25:20) Reading the Footnotes.(30:45) The Most Compelling Evidence.(42:00) The Mechanism at Work.(52:20) The Final Verdict (Backward-Looking).(57:30) The Future of Automation & AI.🗞️Subscribe for upcoming episodes, post-podcast notes, and Andrey’s posts:💻 Follow us on Twitter:@AndreyFradkin https://x.com/andreyfradkin?lang=en@SBenzell https://x.com/sbenzell?lang=en This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit empiricrafting.substack.com
Andrey and Seth examine two papers exploring how both humans and AI systems don't always say what they think. They discuss Luca Braghieri's study on political correctness among UC San Diego students, which finds surprisingly small differences (0.1-0.2 standard deviations) between what students report privately versus publicly on hot-button issues. We then pivot to Anthropic's research showing that AI models can produce chain-of-thought reasoning that doesn't reflect their actual decision-making process. Throughout, we grapple with fundamental questions about truth, social conformity, and whether any intelligent system can fully understand or honestly represent its own thinking.Timestamps (Transcript below the fold):1. (00:00) Intro2. (02:35) What Is Preference Falsification & Why It Matters3. (09:38) Laying out our Priors about Lying4. (16:10) AI and Lying: “Reasoning Models” Paper5. (20:18) Study Design: Public vs Private Expression6. (24:39) Not Quite Lying: Subtle Shifts in Stated Beliefs7. (38:55) Meta-Critique: What Are We Really Measuring?8. (43:35) Philosophical Dive: What Is a Belief, Really?9. (1:01:40) Intelligence, Lying & Transparency10. (1:03:57) Social Media & Performative Excitement11. (1:06:38) Did our Views Change? Explaining our Posteriors12. (1:09:13) Outro: Liking This Podcast Might Win You a Nobel PrizeResearch Mentioned:Political Correctness, Social Image, and Information Transmission Reasoning models don’t always say what they thinkPrivate Truths, Public Lies: The Social Consequences of Preference Falsification🗞️Subscribe for upcoming episodes, post-podcast notes, and Andrey’s posts:💻 Follow us on Twitter:@AndreyFradkin https://x.com/andreyfradkin?lang=en@SBenzell https://x.com/sbenzell?lang=enTRANSCRIPTPreference FalsificationSeth: Welcome to the Justified Posteriors podcast—the podcast that updates beliefs about the economics of AI and technology. I'm Seth Benzel, unable to communicate any information beyond the blandest and most generic platitudes, coming to you from Chapman University in sunny Southern California.Andrey: And I am Andrey Fradkin, having no gap between what I say to the broader public and what I think in the confines of my own mind. Coming to you from Irvington, New York—in a castle.Seth: On the move.Andrey: Yes. This is a mobile podcast, listeners.Seth: From a castle. So, I mean, are you tweaking what you're saying to conform to the castle's social influence?Andrey: Well, you see, this is a castle used for meditation retreats, and so I'll do my best to channel the insights of the Buddha in our conversation.Seth: Okay. All right. Doesn't the Buddha have some stuff to say about what you should and shouldn’t say?Andrey: Right Speech, Seth. Right Speech. That means you should never lie.Seth: Wait.Andrey: Is it?Seth: True speech. Why doesn't he just say “true speech” then?Andrey: Well, look, I'm not an expert in Pali translations of the sacred sutras, so we’ll have to leave that for another episode—perhaps a different podcast altogether, Seth.Seth: Yes. We might not know what the Buddha thinks about preference falsification, but we have learned a lot about what the American Economic Review, as well as the students at UCSD and across the UC system, think about preference falsification. Because today, our podcast is about a paper titled Political Correctness, Social Image, and Information Transmission by Luca Braghieri from the University of Bocconi.And yeah, we learn a lot about US college students lying about their beliefs. Who would’ve ever thought they are not the most honest people in the universe?Andrey: Wow, Seth. That is such a flippant dismissal of this fascinating set of questions. I want to start off just stating the broad area that we’re trying to address with the social science research—before we get into our priors, if that’s okay.Seth: All right. Some context.Andrey: Yes. I think it’s well known that when people speak, they are concerned about their social image—namely, how the people hearing what they say are going to perceive them. And because of this, you might expect they don’t always say what they think.And we know that’s true, right? But it is a tremendously important phenomenon, especially for politics and many other domains.So politically, there’s this famous concept of preference falsification—to which we’ve already alluded many times. In political systems, particularly dictatorships, everyone might dislike the regime but publicly state that they love it. In these situations, you can have social systems that are quite fragile.This ties into the work of Timur Kuran. But even outside of dictatorships, as recent changes in public sentiment towards political parties and discourse online have shown, people—depending on what they think is acceptable—might say very different things in public.And so, this is obviously a phenomenon worth studying, right? And to add a little twist—a little spice—there’s this question of: alright, let’s say we’re all lying to each other all the time. Like, I make a compliment about Seth’s headphones, about how beautiful they are—Seth: Oh!Andrey: And he should rationally know I’m just flattering him, right? And therefore, why is this effective in the first place? If everyone knows that everyone is lying, can’t everyone use their Bayesian reasoning to figure out what everyone really thinks?That’s the twist that’s very interesting.Seth: Right. So, there’s both the question of: do people lie? And then the question of: do people lie in a way that blocks the transmission of information? And then you move on to all the social consequences.Let me just take a step back before we start talking about people lying in the political domain. We both have an economics background. One of the very first things they teach you studying economics is: revealed preferences are better than stated preferences.People will say anything—you should study what they do, right? So, there’s a sense in which the whole premise of doing economic research is just premised on the idea that you can’t just ask people what they think.So, we’ll get into our priors in one moment. But in some ways, this paper sets up a very low bar for itself in terms of what it says it’s trying to prove. And maybe it says actually more interesting things than what it claims—perhaps even its preferences are falsified.Andrey: Now we’re getting meta, Seth. So, I’d push back a little bit on this. That’s totally correct in that when people act, we think that conveys their preferences better than when they speak.But here, we’re specifically studying what people say. Just because we know people don’t always say what they really want or think doesn’t mean it’s not worth studying the difference between what they think and what they say.Seth: Well, now that you’ve framed it that way, I’ll tell you the truth.Andrey: All right. So let’s get to kind of the broad claim. I don’t think we should discuss it too much, but I’ll state it because it’s in the abstract.The broad claim is: social image concerns drive a wedge between sensitive sociopolitical attitudes that college students report in private versus in public.Seth: It is almost definitionally true.Andrey: Yeah. And the public ones are less informative.Seth: That’s the...Andrey: And then the third claim, maybe a little harder to know ex ante, is: information loss is exacerbated by partial audience naivete—Seth: —meaning people can’t Bayesian-induce back to the original belief based on the public utterance?Andrey: Yes, they don’t.Seth: Rather, whether or not they could, they don’t.Andrey: Yes, they don’t.Seth: Before we move on from these—in my opinion—either definitionally correct and therefore not worth studying, or so context-dependent that it’s unreasonable to ask the question this way, let me point out one sentence from the introduction: “People may feel social pressure to publicly espouse views… but there is little direct evidence.” That sentence reads like it was written by someone profoundly autistic.Andrey: I thought you were going to say, “Only an economist could write this.”Seth: Well, that’s basically a tautology.Andrey: True. We are economists, and we’re not fully on the spectrum, right?Seth: “Fully” is doing a lot of work there.Andrey: [laughs] Okay, with that in mind—Seth: Sometimes people lie about things.Andrey: We all agree on that. That’s not even a worthwhile debate. But what is more interesting are the specific issues being studied, because they were highly relevant both then and now.Seth: Even though they didn’t show up in the abstract.Andrey: Right, not in the abstract—which might itself be a bit of preference falsification.Seth: Yeah.Andrey: So let’s go through each statement. We’ll state our priors. I’ve already committed to not falsifying my preferences.Seth: Here we go. Maximum controversy. Are we using the 0–10 scale like in the paper?Andrey: Of course. I’m reporting the difference between what people publicly and privately say among UCSD students.Seth: And you’re including magnitude?Andrey: Yes. The sign is obvious—it’s about the magnitude.Seth: Okay.Andrey: You don’t have to join if you don’t want to. I know not everyone is as courageous as I am.Seth: I would never call myself a coward on camera, Andrey.Andrey: [laughs] All right, first sensitive statement: “All statues and memorials of Confederate leaders should be removed.” I thought the difference here would be pretty small—around 10%. My reasoning is that among UCSD students, there likely isn’t much of a gap between public and private views on this issue.Seth: I’m looking at the results right now, so it’s hard to place myself in the mindset of what would’ve been considered more or less controversial.Andrey: That’s fair. I do have preregistered beliefs, but you’re welcome to just react and riff.Seth: Great.Andrey: Remember, this study is based around issues that were particularly salient in 2019–2020.Seth: Right. Even though the final survey was conducted in 2022 or
In this episode, we tackle the thorny question of AI persuasion with a fresh study: "Scaling Language Model Size Yields Diminishing Returns for Single-Message Political Persuasion." The headline? Bigger AI models plateau in their persuasive power around the 70B parameter mark—think LLaMA 2 70B or Qwen-1.5 72B.As you can imagine, this had us diving deep into what this means for AI safety concerns and the future of digital influence. Seth came in worried that super-persuasive AIs might be the top existential risk (60% confidence!), while Andrey was far more skeptical (less than 1%).Before jumping into the study, we explored a fascinating tangent: what even counts as "persuasion"? Is it pure rhetoric, mathematical proof, or does it include trading incentives like an AI offering you money to let it out of the box? This definitional rabbit hole shaped how we thought about everything that followed.Then we broke down the study itself, which tested models across the size spectrum on political persuasion tasks. So where did our posteriors land on scaling AI persuasion and its role in existential risk? Listen to find out!🔗Links to the paper for this episode's discussion:* (FULL PAPER) Scaling Language Model Size Yields Diminishing Returns for Single-Message Political Persuasion by Kobe Hackenberg, Ben Tappin, Paul Röttger, Scott Hale, Jonathan Bright, and Helen Margetts🔗Related papers we discussed:* Durably Reducing Conspiracy Beliefs Through Dialogues with AI by Costello, Pennycook, and David Rand - showed 20% reduction in conspiracy beliefs through AI dialogue that persisted for months* The controversial Reddit "Change My View" study (University of Zurich) - found AI responses earned more "delta" awards but was quickly retracted due to ethical concerns* David Shor's work on political messaging - demonstrates that even experts are terrible at predicting what persuasive messages will work without extensive testing(00:00) Intro(00:37) Persuasion, Identity, and Emotional Resistance(01:39) The Threat of AI Persuasion and How to Study It(05:29) Registering Our Priors: Scaling Laws, Diminishing Returns, and AI Capability Growth(15:50) What Counts as Persuasion? Rhetoric, Deception, and Incentives(17:33) Evaluation & Discussion of the Main Study (Hackenberg et al.)(24:08) Real-World Persuasion: Limits, Personalization, and Marketing Parallels(27:03) Related Papers & Research(34:38) Persuasion at Scale and Equilibrium Effects(37:57) Justifying Our Posteriors(39:17) Final Thoughts and Wrap Up🗞️Subscribe for upcoming episodes, post-podcast notes, and Andrey’s posts:💻 Follow us on Twitter:@AndreyFradkin https://x.com/andreyfradkin?lang=en@SBenzell https://x.com/sbenzell?lang=enTranscript:AI PersuasionSeth: Justified Posteriors podcast, the podcast that updates beliefs about the economics of AI and technology. I'm Seth Benzel, possessing superhuman levels in the ability to be persuaded, coming to you from Chapman University in sunny Southern California.Andrey: And I'm Andrey Fradkin, preferring to be persuaded by the 200-word abstract rather than the 100-word abstract, coming to you from rainy Cambridge, Massachusetts.Seth: That's an interesting place to start. Andrey, do you enjoy being persuaded? Do you like the feeling of your view changing, or is it actually unpleasant?Andrey: It depends on whether that view is a key part of my identity. Seth, what about yourself?Seth: I think that’s fair. If you were to persuade me that I'm actually a woman, or that I'm actually, you know, Salvadoran, that would probably upset me a lot more than if you were to persuade me that the sum of two large numbers is different than the sum that I thought that they summed to. Um.Andrey: Hey, Seth, I found your birth certificate...Seth: No.Andrey: ...and it turns out you were born in El Salvador.Seth: Damn. Alright, well, we're gonna cut that one out of the podcast. If any ICE officers hear about this, I'm gonna be very sad. But that brings up the idea, right? When you give someone either information or an argument that might change the way they act, it might help them, it might hurt them. And I don't know if you've noticed, Andrey, but there are these new digital technologies creating a lot of text, and they might persuade people.Andrey: You know, there are people going around saying these things are so persuasive, they’re going to destroy society. I don’t know...Seth: Persuade us all to shoot ourselves, the end. One day we’ll turn on ChatGPT, and the response to every post will be this highly compelling argument about why we should just end it now. Everyone will be persuaded, and then the age of the machine. Presumably that’s the concern.Andrey: Yes. So here's a question for you, Seth. Let's say we had this worry and we wanted to study it.Seth: Ooh.Andrey: How would you go about doing this?Seth: Well, it seems to me like I’d get together a bunch of humans, try to persuade them with AIs, and see how successful I was.Andrey: Okay, that seems like a reasonable idea. Which AI would you use?Seth: Now that's interesting, right? Because AI models vary along two dimensions. They vary in size, do you have a model with a ton of parameters or very few? and they also vary in what you might call taste, how they’re fine-tuned for particular tasks. It seems like if you want to persuade someone, you’d want a big model, because we usually think bigger means more powerful, as well as a model that’s fine-tuned toward the specific thing you’re trying to achieve. What about you, Andrey?Andrey: Well, I’m a little old-school, Seth. I’m a big advocate of the experimentation approach. What I would do is run a bunch of experiments to figure out the most persuasive messages for a certain type of person, and then fine-tune the LLM based on that.Seth: Right, so now you’re talking about micro-targeting. There are really two questions here: can you persuade a generic person in an ad, and can you persuade this person, given enough information about their context?Andrey: Yeah. So with that in mind, do we want to state what the questions are in the study we’re considering in this podcast?Seth: I would love to. Today, we’re studying the question of how persuasive AIs are. And more importantly, or what gives this question particular interest, is not just can AI persuade people, because we know anything can persuade people. A thunderstorm at the right time can persuade people. A railroad eclipse or some other natural omen. Rather, we’re asking: as we make these models bigger, how much better do they get at persuading people? That’s the key, this flavor of progression over time.If you talk to Andrey, he doesn’t like studies that just look at what the AI is like now. He wants something that gives you the arrow of where the AI is going. And this paper is a great example of that. Would you tell us the title and authors, Andrey?Andrey: Sure. The title is Scaling Language Model Size Yields Diminishing Returns for Single-Message Political Persuasion by Kobe Hackenberg, Ben Tappin, Paul Röttger, Scott Hale, Jonathan Bright, and Helen Margetts. Apologies to the authors for mispronouncing everyone’s names.Seth: Amazing. A crack team coming at this question. Maybe before we get too deep into what they do, let’s register our priors and tell the audience what we thought about AI persuasion as a potential thing, as an existential risk or just a regular risk. Let’s talk about our views.Seth: The first prior we’re considering is: do we think LLMs are going to see reducing returns to scale from increases in parameter count? We all think a super tiny model isn’t going to be as powerful as the most up-to-date, biggest models, but are there diminishing returns to scale? What do you think of that question, Andrey?Andrey: Let me throw back to our Scaling Laws episode, Seth. I do believe the scaling laws everyone talks about exhibit diminishing returns by definition.Seth: Right. A log-log relationship... wait, let me think about that for a second. A log-log relationship doesn’t tell you anything about increasing returns...Andrey: Yeah, that’s true. It’s scale-free, well, to the extent that each order of magnitude costs an order of magnitude more, typically.Seth: So whether the returns are increasing or decreasing depends on which number is bigger to start with.Andrey: Yes, yes.Seth: So the answer is: you wouldn’t necessarily expect returns to scale to be a useful way to even approach this problem.Andrey: Yeah, sure. I guess, let’s reframe it a bit. In any task in statistics, we have diminishing returns, law of large numbers, central limit theorem, combinations. So it would be surprising if the relationship wasn’t diminishing. The other thing to say here is that there’s a natural cap on persuasiveness. Like, if you’re already 99% persuasive, there’s only so far you can go.Seth: If you talk to my friends in my lefty economics reading groups from college, you’ll realize there’s always a view crazier than the one you're sitting at.Andrey: So, yeah. I mean, you can imagine a threshold where, if the model gets good enough, it suddenly becomes persuasive. But if it’s not good enough, it has zero persuasive value. That threshold could exist. But conditional on having some persuasive value, I’d imagine diminishing returns.Seth: Right.Andrey: And I’d be pretty confident of that.Seth: Andrey is making the trivial point that when you go from a model not being able to speak English to it speaking English, there has to be some increasing returns to persuasion.Andrey: Exactly.Seth: But once you’re on the curve, there have to be decreasing returns.Andrey: Yeah. What do you think?Seth: I’m basically in the same place. If you asked me what the relationship is between model size and any outcome of a model, I’d anticipate a log-log relationship. Andre brought up our Scaling Laws episode, where we talked about how there seems to be an empirical pattern: models get a constant percent better as you increase size by an order of magn
In this episode, we tackle a brand new paper from the folks at Epoch AI called the "GATE model" (Growth and AI Transition Endogenous model). It makes some bold claims. The headline grabber? Their default scenario projects a whopping 23% global GDP growth in 2027! As you can imagine, that had us both (especially Andrey) practically falling out of our chairs. Before diving into GATE, Andrey shared a bit about the challenge of picking readings for his PhD course on AGI and business – a tough task when the future hasn't happened yet! Then, we broke down the GATE model itself. It’s ambitious, trying to connect three crucial pieces:* AI Development: How investment in chips and R&D boosts "effective compute."* Automation & Work: How that effective compute translates into automating tasks (they love their sigmoids for this part!).* Macroeconomics: How automation feeds into a fairly standard growth model with a representative agent making all the big saving and investment decisions.So, where did our posteriors land? Listen to find out (or read the transcript at the end of the post).The episode is also sponsored by the Digital Business Institute at Boston University’s Questrom School of Business. Big thanks to Chih-Ting “Karina” Yang for her help editing the episode.-🔗Links to the paper for this episode’s discussion:(FULL PAPER) GATE: An Integrated Assessment Model for AI Automation by Epoch AIThe modeling sandbox is available at AI and Automation Scenario Explorer🔗Related papers* Situational Awareness by Leopold Aschenbrenner: https://situational-awareness.ai/ and our episode about it.* Transformative AI, existential risk, and real interest rates by Trevor Chow, Basil Halperin, J.Zachary Mazlish: https://basilhalperin.com/papers/agi_emh.pdf* The AI Dilemma- Growth versus Existential Risk by Charles I. Jones: https://web.stanford.edu/~chadj/existentialrisk.pdf and episode.* How Much Should We Spend to Reduce A.I.’s Existential Risk? by Charles I.: https://web.stanford.edu/~chadj/reduce_xrisk.pdf* The Productivity J-Curve: How Intangibles Complement General Purpose Technologies by Erik Brynjolfsson, Daniel Rock, and Chad Syverson https://www.aeaweb.org/articles?id=10.1257/mac.20180386🗞️Subscribe for upcoming episodes, post-podcast notes, and Andrey’s posts:💻 Follow us on Twitter:@AndreyFradkin https://x.com/andreyfradkin?lang=en@SBenzell https://x.com/sbenzell?lang=enTranscript:Welcome to The Justified Posteriors Podcast, the podcast that updates its beliefs about the economics of AI and technology.Seth: I'm Seth Benzell, getting ahead of the automation of all productive human labor by starting a podcast. Coming to you from Chapman University in sunny, Southern California.Andrey: And I'm Andrey Fradkin, coming to you from that place in my brain which almost forgot what I learned about macroeconomics from Bob Hall, coming to you from gloomy Cambridge, Massachusetts. And I should say that we are sponsored by the Digital Business Institute at the Questrom School of Business at Boston University. So Seth, what are we talking about today?Seth: We are talking about the most important thing in the world, which is projecting AI takeoff and a paper that claims to add a very important element to these models. So, thinking about AGI takeoff and the arrival of these superhuman technologies that can automate all our labor, but sort of intentionally trying to think through the economic feedback loops that would go with the AI and the technology development. So, an ambitious but potentially very impactful paper.Andrey: Yeah.Setting the Stage: Essential Readings on AGISeth: So I have a question for you, Andrey, which is: as I was reading this paper about a bunch of people in gloomy Cambridge, Massachusetts, trying to project AGI—Artificial General Intelligence—timelines, I thought to myself, if I had to assign a PhD class just one or two things to read on this subject, what would I give them? Because, you know, this paper is a suggestion, but I understand you've recently confronted exactly this dilemma.Andrey: Well, this was a serious dilemma, Seth. You see, I'm teaching a PhD course, and I felt compelled to offer one lecture on AGI and its possibilities, even though this class is about business topics.Seth: Business, Andrey? Why are you wasting their time?Andrey: Well, see, one of the interesting things about teaching something like this is, it hasn't happened yet. And being an empirical researcher and teaching mostly empirical topics means that there are no published papers in business or economics journals that are really getting at these issues. Right? We're thinking about the future that might affect, you know, obviously the entire world, but also, you know, what we do in our jobs. So it's a really important lecture.Seth: And yet, you should publish this in journals! All the journal editors listening to this podcast, hi! Upside by being the change you wanna see in the world. But what did you give them?Andrey: I gave them two readings. One was "Situational Awareness," something that we've covered on this podcast. Why did I give that reading? I wanted the students to get the insider view of what it feels like to be inside an AI company, thinking about the profound implications that might happen very, very quickly. And then I also gave them a reading that's more of a classic reading in economics about general purpose technologies and kind of the economics of whether general purpose technologies take off quickly enough and what determines how much is invested in them and how useful they are. And this is a reading by Bresnahan and Trachtenberg. And so I thought that that offered a nice contrast. Now, of course, my syllabus has many other readings that I discuss, including some other papers we've covered.Seth: Not worried that you're not making your students read enough?Andrey: So I, I'm worried. I, you know…Seth: Well, we're moving to an oral culture, right? And they're gonna have to listen to the podcast if they wanna pick it up. And so, but you're basically, your reading list is the podcast, right?Andrey: Yeah, it's a large part of the podcast, at least for this class specifically. And so it was a real joy to read for today's episode another paper that one could have put on the syllabus, but came out too recently for me to do it.Seth: Hot off the presses, listeners. Oh, and of course, before we move on, we will put in the show notes links to the "Situational Awareness" episode that Andrey mentioned so you can get caught up.Introducing the GATE ModelAndrey: Alright, so we're discussing this paper about a new macroeconomic model that is called GATE: Growth and AI Transition Endogenous model, that attempts to…Seth: Alright, authors?Andrey: Yes, we, yeah, fine. The authors are Epoch AI, et al. I'm not gonna list all of them, but you're welcome to.Seth: I'll get it. Okay, so I'll just say there's about 10 authors on the paper. Two names that jump out at me are Ege Erdil, who I know is a leader of Epoch AI, as well as Tamay. Oh man, these names are some real challenges from these AI folks. Hopefully, AI will help me. But I will say, Tamay I have met in person in Cambridge. He brings a certain intensity to these questions. I gave some feedback on this model while it was in progress. My feedback was not a hundred percent addressed, it has turned out, but happy to raise that limitation when we get to it. But anyway, so to give some context to this, this Epoch AI group is a group of scholars who have been working for the last several years on trying to track AI progress and project the implications of AI. They've kind of been ahead of the curve in talking about the implications of AI for the economy. So I take their work on this subject very seriously, even if I take it knowing that this is not straight economics; these are definitely technologists sort of first and then economists second.Andrey: Alright. So with that kind of introduction, let's talk about the priors.Our Priors on the GATE ModelAndrey: The priors. So the priors, I mean, we can't forget those. I think we came up with two priors to discuss. The first one is, is this model useful? And then the second one is the default version of this model…Seth: What does the model actually predict? So, object level…Andrey: …predicts output growth in the year 2027 of 23%.Seth: Globally.Andrey: I believe that is a global estimate.Seth: It's a global model. Okay. 23% GDP growth rate in 2027. What is your prior on that prediction? You can't… Andrey actually fell out of his chair.Andrey: Yes, I actually transcended my location in space and time.Seth: The growth created was so large, they just started instantaneously levitating.Andrey: I think it is extraordinarily unlikely that we'll have 27% GDP growth in 2027.Seth: One in a thousand?Andrey: Yeah, yeah, somewhere in that range.Seth: Yeah, I'm in one in a thousand plan too. I mean, like, the easiest way to get 23% GDP growth in 2027 would be destroying a lot of the economy in 2026.Andrey: Yeah. Yeah. Yeah. A war will do wonders for GDP growth after the war.Seth: Yeah. Broken windows, right? Andrey, you seem rather skeptical about this quote-unquote default projection of the Epoch AI model. Why were you so skeptical going into reading this?Andrey: Well, I don't wanna say I didn't know what the predictions of the model were before reading this, so maybe… but I guess 27% is just unprecedented. It is just hard to imagine in such a short timeframe, us solving all of the adjustment frictions necessary to drastically boost production. Right? And we've talked about this many times because there are so many portions of GDP that seemingly would be very hard to increase, like housing stock. Are we gonna solve all of our political issues all of a sudden? What about health outcomes research? Do we still need to run clinical trials? Are people just gonna willingly submit themselves to robot operations right away? You know, once again, I can imagine a world where
We hear it constantly: social media algorithms are driving polarization, feeding us echo chambers, and maybe even swinging elections. But what does the evidence actually say? In the darkest version of this narrative, social media platform owners are shadow king-makers and puppet masters who can select the winner of close election by selectively promoting narratives. Amorally, they disregard the heightened political polarization and mental anxiety which are the consequence of their manipulations of the public psyche. In this episode, we dive into an important study published in Science (How do social media feed algorithms affect attitudes and behavior in an election campaign?https://www.science.org/doi/10.1126/science.abp9364) that tackled this question. Researchers worked with Meta to experimentally change the feeds of tens of thousands of Facebook and Instagram users in the crucial months surrounding the 2020 election.One of the biggest belief swings in the history of Justified Posteriors in this one!The Core Question: What happens when you swap out the default, engagement-optimized algorithmic feed for a simple, reverse-chronological one showing posts purely based on recency?Following our usual format, we lay out our priors before dissecting the study's findings:* Time Spent: The algorithmic feed kept users scrolling longer.* Content Consumed: The types of content changed in interesting ways. The chronological feed users saw more posts from groups and pages, more political content overall, and paradoxically, more content from untrustworthy news sources.* Attitudes & Polarization: The study found almost no effect on key measures like affective polarization (how much you dislike the other side), issue polarization, political knowledge, or even self-reported voting turnout.So, is the panic over algorithmic manipulation overblown?While the direct impact of this specific algorithmic ranking vs. chronological feed seems minimal on core political beliefs in this timeframe, other issues are at play:* Moderation vs. Ranking: Does this study capture the effects of outright content removal or down-ranking (think the Hunter Biden laptop controversy)?* Long-term Effects & Spillovers: Could small effects accumulate over years, or did the experiment miss broader societal shifts?* Platform Power: Even if this comparison yields null results, does it mean platforms couldn't exert influence if they deliberately tweaked algorithms differently (e.g., boosting a specific figure like Elon Musk on X)?(Transcript below)🗞️ Subscribe for upcoming episodes, post-podcast notes, and Andrey’s posts:💻Follow us on Twitter:@AndreyFradkin https://x.com/andreyfradkin?lang=en@SBenzell https://x.com/sbenzell?lang=enTranscript:Andrey: We might have naively expected that the algorithmic feed serves people their "red meat"—very far-out, ideologically matched content—and throws away everything else. But that is not what is happening.Seth: Welcome everyone to the Justified Posterior Podcast, where we read and are persuaded by research on economics and technology so you don't have to. I'm Seth Benzell, a man completely impervious to peer influence, coming to you from Chapman University in sunny Southern California.Andrey: And this is Andrey Fradkin, effectively polarized towards rigorous evidence and against including tables in the back of the article rather than in the middle of the text.Seth: Amazing. And who's our sponsor for this season?Andrey: Our sponsor for the season is the Digital Business Institute at the Questrom School of Business at Boston University. Thanks to the DBI, we're able to provide you with this podcast.Seth: Great folks. My understanding is that they're sponsoring us because they want to see information like ours out there on various digital platforms, such as social media, right? Presumably, Questrom likes the idea of information about them circulating positively. Isn't that right?Andrey: Oh, that's right. They want you to know about them, and by virtue of listening to us, you do. But I think, in addition, they want us to represent the ideal of what university professors should be doing: evaluating evidence and contributing to important societal discussions.Andrey: So with that set, what are we going to be talking about today?Seth: Well, we're talking about the concept of participating in important societal discussions itself. Specifically, we're discussing research conducted and published in Science, a prestigious journal. The research was conducted on the Facebook and Instagram platforms, trying to understand how those platforms are changing the way American politics works.The name of the paper is, "How Do Social Media Feed Algorithms Affect Attitudes and Behavior in an Election Campaign?" by Guess et al. There are many co-authors who I'm sure did a lot of work on this paper; like many Science papers, it's a big team effort. See the show notes for the full credit – we know you guys put the hours in.This research tries to get at the question, specifically in the 2020 election, of to what extent decisions made by Mark Zuckerberg and others about how Facebook works shaped America's politics. It's an incredibly exciting question.Andrey: Yeah, this is truly a unique study, and we'll get into why in just a bit. But first, as you know, we need to state our prior beliefs about what the study will find. We're going to pose two claims: one narrow and one broader. Let's start with the narrow claim.Seth: Don't state a claim, we hypothesize, Andrey.Andrey: Pardon my imprecision. A hypothesis, or question, if you will: How did the algorithmic feed on Facebook and Instagram affect political attitudes and behavior around the time of the 2020 presidential election? Seth, what is your prior?Seth: Alright, I'm putting myself in a time machine back to 2020. It was a crazy time. The election was at the end of 2020, and the pandemic really spread in America starting in early 2020. I remember people being hyper-focused on social media because everyone was locked in their houses. It felt like a time of unusually high social media-generated peer pressure, with people pushing in both directions for the 2020 election. Obviously, Donald Trump is a figure who gets a lot of digital attention – I feel like that's uncontroversial.On top of that, you had peak "woke" culture at that time and the Black Lives Matters protests. There was a lot of crazy stuff happening. I remember it as a time of strong populist forces and a time where my experience of reality was really influenced by social media. It was also a time when figures like Mark Zuckerberg were trying to manage public health information, sometimes heavy-handedly silencing real dissent while trying to act for public welfare.So, that's a long wind-up to say: I'm very open to the claim that Facebook and Instagram had a thumb on the scale during the 2020 election season, broadly in favor of chaos or political polarization – BLM on one side and MAGA nationalism on the other. At the same time, maybe vaguely lefty technocratic, like the "shut up and listen to Fauci" era. Man, I actually have a pretty high prior on the hypothesis that Facebook's algorithms put a real thumb on the scale. Maybe I'll put that around two-thirds. How about you, Andrey?Andrey: In which direction, Seth?Seth: Towards leftiness and towards political chaos.Andrey: And what variable represents that in our data?Seth: Very remarkably, the paper we studied does not test lefty versus righty; they do test polarization. I don't want to spoil what they find for polarization, but my prediction was that the algorithmic feed would lead to higher polarization. That was my intuition.Andrey: I see. Okay. My prior on this was very tiny effects.Seth: Tiny effects? Andrey, think back to 2020. Wasn't anything about my introduction compelling? Don't you remember what it was like?Andrey: Well, Seth, if you recall, we're not evaluating the overall role of social media. We're evaluating the role of a specific algorithm versus not having an algorithmic feed and having something else – the reverse chronological feed, which shows items in order with the newest first. That's the narrow claim we're putting a prior on, rather than the much broader question of what social media in general did.Seth: Yeah, but I guess that connects to my censorship comments. To the extent that there is a Zuckerberg thumb on the scale, it's coming through these algorithmic weightings, or at least it can come through that.Andrey: I think we can come back to that. My understanding of a lot of platform algorithm stuff, especially on Facebook, is that people mostly get content based on who they follow – people, groups, news outlets. The algorithm shifts those items around, but in the end, it might not be that different from a chronological feed. Experts in this field were somewhat aware of this already. That's not to say the algorithmic feed had no effects, but I expected the effects to be very small.Another aspect is how our political beliefs are formed. Yes, we spend time online, but we also talk to friends, read the news, get chain emails from our crazy uncle (not my crazy uncle, but people do).Seth: One thing we'll get to see is what people substitute into when we take away their Facebook algorithmic feed.Andrey: Yes. Furthermore, political beliefs generally don't change very frequently. I don't have a specific study handy, but it's fairly understood. There are exceptions, like preference cascades, but generally, if you believe markets work well, you won't suddenly change your mind, and vice versa. This holds for many issues. Imagine polling people on who they voted for in 2016 versus 2020 – the correlation for voting for Donald Trump would be immensely high. It's really hard to move people's political preferences.Seth: I think that's right. There are things people's beliefs move around more on shorter timelines, though. One thing they look at is political knowledge, which also seem
In this episode of Justified Posteriors, we dive into the paper "Which Economic Tasks Are Performed with AI: Evidence from Millions of Claude Conversations." We analyze Anthropic's effort to categorize how people use their Claude AI assistant across different economic tasks and occupations, examining both the methodology and implications with a critical eye.We came into this discussion expecting coding and writing to dominate AI usage patterns—and while the data largely confirms this, our conversation highlights several surprising insights. Why are computer and mathematical tasks so heavily overrepresented, while office administrative work lag behind? What explains the notably low usage for managerial tasks, despite AI's apparent suitability for scheduling and time management?We raise questions about the paper's framing: Is a gamer asking for help with their crashing video game really engaging in "economic activity"? How much can we learn from analyzing four million conversations when only 150 were human-verified? And what happens when different models specialize—are people going to Claude for coding but elsewhere for art generation?We also asked Claude itself to review this paper about Claude usage, revealing some surprisingly pointed critiques from the AI about the paper's fundamental assumptions.Throughout the episode, we balance our appreciation for this valuable descriptive work with thoughtful critiques, ultimately suggesting directions for future research that could better connect what people currently use AI for with its potential economic impact. Whether you're interested in AI adoption, labor economics, or just curious about how people are actually using large language models today, we offer our perspectives as economists studying AI's integration into our economy.Join us as we update our beliefs about what the Anthropic Economic Index actually tells us—and what it doesn't—about the future of AI in economic tasks. The full transcript is available at the end of this post.The episode is sponsored by the Digital Business Institute at Boston University’s Questrom School of Business. Big thanks to Chih-Ting (Karina) Yang for her help editing the episode.-🔗 Links to the paper for this episode’s discussion:Which Economic Tasks are Performed with AI? Evidence from Millions of Claude ConversationsGPTs are GPTs: Labor market impact potential of LLMs🗞️ Subscribe for upcoming episodes, post-podcast notes, and Andrey’s posts:💻 Follow us on Twitter:@AndreyFradkin https://x.com/andreyfradkin?lang=en@SBenzell https://x.com/sbenzell?lang=enTranscriptSeth: Welcome to the Justified Posteriors Podcast. The podcast that updates beliefs about the economics of AI and technology. I'm Seth Benzel with nearly half of my total output constituting software development and writing tasks coming to you from Chapman University in sunny Southern California.Andrey: And I'm Andrey Fradkin, enjoying playing around with Claude 3.7 coming to you from Cambridge, Massachusetts.Seth: So Andrey, what's the last thing you used AI for?Andrey: The last thing I use AI for, well, it's a great question, Seth, because I was so excited about the new Anthropic model that I decided to test run it by asking it to write a referee report about the paper we are discussing today.Seth: Incredible. It's a little bit meta, I would say, given the topic of the paper. Maybe we can hold in our back pockets the results of that experiment for later. What do you think?Andrey: Yeah, I think we don't want to spoil the mystery about how Claude reviewed the work of its creators.Seth: Claude reviewing the work of its creators - can Frankenstein's monster judge Frankenstein? Truly. So Andrey, maybe we've danced around this a little bit, but why don't you tell me what's the name of today's paper?Andrey: The name of the paper is a bit of a mouthful: "Which Economic Tasks Are Performed with AI: Evidence from Millions of Claude Conversations." But on a more easy-to-explain level, the paper is introducing the Anthropic Economic Index, which is a measure of how people use the Claude chatbot, demonstrating how it can be useful in a variety of interesting ways for thinking about what people are using AI for.Seth: Right. So at a high level, this paper is trying to document what people are using Claude for. I was also perplexed about the fact that they refer to this paper as an AI index given that an index usually means a number, and it's unclear what is the one number they want you to take away from this analysis. But that doesn't mean they don't give you a lot of interesting numbers over the course of their analysis of how people are using Claude.Andrey: So before we get into the paper a bit more, let's talk about the narrow and broad claims and what our priors are. The narrow claim is maybe what specifically are people using Claude for. Do we think this is a representative description of the actual truth? The authors divide up the analysis in many different ways, but one way to think about it is: is it true that the primary uses of this chatbot are computer and mathematical tasks? And is it also true that relatively few people use the chatbot for office and administrative support as well as managerial decision making?Seth: Those are excellent questions. The first question is what are people using Claude for right now? And do we buy that the way they're analyzing the usage data gives us an answer to that question? Before I answer whether I think Claude's approach in analyzing their own chats is appropriate, let me tell you what my sense was coming in. If you had asked "What are people using chatbots for right now?" I would have guessed: number one, they're using it for doing their homework instead of actually learning the material, and number two, actual computer programmers are using it to speed up their coding. It can be a great coding assistant for speeding up little details.Although homework wasn't a category analyzed by Claude, they do say that nearly half of the tasks they see people using these AI bots for are either some form of coding and software development or some form of writing. And of course, writing could be associated with tasks in lots of different industries, which they try to divide up. If you told me that half of what people use chatbots for is writing help and coding help - if anything, I would have thought that's on the low side. To me, that sounds like 80 percent of use cases.Andrey: I think I'd say I'm with you. I think we probably agree on our priors. I'd say that most of the tasks I would expect to be done with the chatbot might be writing and programming related. There's a caveat here, though - there's a set of behaviors using chatbots for entertainment's sake. I don't know how frequent that is, and I don't know if I would put it into writing or something else, but I do know there is a portion of the user base that just really likes talking to Claude, and I don't know where that would be represented in this dataset.Seth: Maybe we'll revisit this question when we get to limitations, but I think one of the limitations of this work is they're trying to fit every possible usage of AI into this government list of tasks that are done in the economy. But I've been using AI for things that aren't my job all the time. When America came up with this O*NET database of tasks people do for their jobs, I don't think they ever pretended for this to be a list of every task done by everyone in America. It was supposed to be a subset of tasks that seem to be economically useful or important parts of jobs that are themselves common occupations. So there are some limitations to this taxonomical approach right from the start.Coming back to your point about people playing around with chatbots instead of using them for work - I have a cousin who loves to get chatbots to write slightly naughty stories, and then he giggles. He finds this so amusing! Presumably that's going to show up in their data as some kind of creative writing task.Andrey: Yeah.Seth: So moving from the question of what we think people are using chatbots for - where I think we share this intuition that it's going to be overwhelmingly coding and writing - now we go to this next question you have, which is: to what extent can we just look at conversations people have with chatbots and translate the number of those conversations or what sort of things they talk about into a measure of how people are going to usefully be integrating AI into the economy? There seems to be a little bit of a step there.Andrey: I don't think the authors actually make the claim that this is a map of where the impact is going to be. I think they mostly just allude to the fact that this is a really useful system for real-time tracking of what the models are being used for. I don't think the authors would likely claim that this is a sign of what's to come necessarily. But it's still an interesting question.Seth: I hear that, but right on the face, they call it the Anthropic Economic Index. If they wanted to call it the "Anthropic What Are People Using Anthropic For Right Now Snapshot" or the "Anthropic Usage Index," I'm a lot more sympathetic. I think they have to do a lot less work defending that idea than the "Anthropic Economic Index."Andrey: Well, this is maybe where the academic and corporate lingo collide. But I hear you in the sense that it's not clear that what is being done in these chats is necessarily economic activity versus personal activity, learning activity, and so on. A more humble naming of the index could have appeased some of the criticisms.Seth: You've gotta be on the defensive when you come on the Justified Posteriors podcast, because we challenge you to justify your posterior, so you better be ready to defend yourself. So, for the narrow question, I gave you my prior - it's gonna be overwhelmingly used for coding and people doing homework assignments. And homework assignments will look like mostly creative writin
In this episode, we tackle one of the most pressing questions of our technological age: how much risk of human extinction should we accept in exchange for unprecedented economic growth from AI?The podcast explores research by Stanford economist Chad Jones, who models scenarios where AI might deliver a staggering 10% annual GDP growth but carry a small probability of triggering an existential catastrophe. We dissect how our risk tolerance depends on fundamental assumptions about utility functions, time horizons, and what actually constitutes an "existential risk."We discuss how Jones’ model presents some stark calculations: with certain plausible assumptions, society might rationally accept up to a 33% cumulative chance of extinction for decades of AI-powered prosperity. Yet slight changes to risk assumptions or utility functions can flip the calculation entirely, suggesting we should halt AI development altogether.We also discuss how much of global GDP—potentially trillions of dollars—should be invested in AI safety research. Jones' models suggest anywhere from 1.8% to a staggering 15.8% of world GDP might be the optimal investment level to mitigate existential risk, numbers that dwarf current spending.Beyond the mathematics, we discuss philosophical tensions: Should a world government be more or less risk-averse than individuals? Do we value additional years of life more than additional consumption? And how do we navigate a world where experts might exploit "Pascal's Mugger" scenarios to demand funding?"If we delay AI," Seth concludes, "it will require killing something of what is essential to us. The unbounded optimism about the power of thought and freedom, or as the way Emerson would've put it, the true romance."Justified Posteriors is sponsored by the Digital Business Institute at Boston University’s Questrom School of Business. Big thanks to Ching-Ting “Karina” Yang for her help editing the episode.—🔗Links to the paper for this episode’s discussion:(FULL PAPER) The AI Dilemma: Growth versus Existential Risk by Charles I. Jones(FULL PAPER) How Much Should We Spend to Reduce A.I.’s Existential Risk? by Charles I. Jones🔗Related papersRobust Technology Regulation by Andrew Koh and Sivakorn SanguanmooExistential Risk and Growth by Leopold Aschenbrenner and Philip Trammell🗞️Subscribe for upcoming episodes, post-podcast notes, and Andrey’s posts:💻 Follow us on Twitter:@AndreyFradkin https://x.com/andreyfradkin?lang=en@SBenzell https://x.com/sbenzell?lang=en This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit empiricrafting.substack.com
loading
Comments