Interviewing Eugene Vinitsky on self-play for self-driving and what else people do with RL
Description
Eugene Vinitsky is a professor a New York University department of Civil and Urban Engineering. He’s one of my original reinforcement learning friends from when we were both doing our Ph.D.’s in RL at UC Berkeley circa 2020. Eugene has extensive experience in self-driving, open endedness, multi-agent reinforcement learning, and self-play with RL. In this conversation we focus on a few key topics:
* His latest results on self-play for self-driving and what they say about the future of RL,
* Why self-play is confusing and how it relates to the recent takeoff of RL for language models, and
* The future of RL in LMs and elsewhere.
This is a conversation where we take the time to distill very cutting edge research directions down into the core essences. I felt like we were learning in real time what recent developments mean for RL, how RL has different scaling laws for deep learning, and what is truly salient about self-play.
The main breakthrough we discuss is scaling up self-play techniques for large-scale, simulated reinforcement learning. Previously, scaling RL in simulation has become economical in single-agent domains. Now, the door is open to complex, multi-agent scenarios where more diversity is needed to find solutions (in this case, that’s what self play does).
Eugene’s Google Scholar | Research Lab | Linkedin | Twitter | BlueSky | Blog (with some great career advice).
Listen on Apple Podcasts, Spotify, YouTube, and where ever you get your podcasts. For other Interconnects interviews, go here.
Show outline & links
We cover many papers in this podcast. Also, as an experiment, here’s a Deep Research report on “all the papers that appeared in this podcast transcript.”
In this episode, we cover:
* Self-play for self-driving, mostly around the recent paper Robust Autonomy Emerges from Self-Play (Cusumano-Towner et al. 2025). The simulator they built powering this is Gigaflow. More discussion on HackerNews.(Here’s another self-play for self-driving paper and another from Eugene from earlier this year).A few highlights:
“All simulated agents use the same neural net with the same weights, albeit with randomized rewards and conditioning vector to allow them to behave as different types of vehicles with different types of aggressiveness. This is like driving in a world where everyone is different copies of you, but some of your copies are in rush while others are patient. This allows backprop to optimize for a sort of global utility across the entire population.”
“The resulting policy simulates agents that are human-like, even though the system has never seen humans drive.”
* Large Language Models are In-context Preference Learners — how language models can come up with reward functions that will be applied to RL training directly. Related work from Stanford.
* Related literature from Interconnects! The first includes literature we mention on the learning locomotion for quadrupeds with deep RL (special shoutout as usual to Marco Hutter’s group).
* Recent and relevant papers Value-based RL Scales Predictably, Magnetic control of tokamak plasmas through deep reinforcement learning.
* Other things we mention:
* Cruise, Tesla, and Waymo’s autonomy stacks (speculation) and how the self-driving industry has changed since we were / were considering working in it.
* Evo 2 foundation model for biology.
* Eugene is working with a new startup on some LLM and RL stuff. If you’re interested in this episode, ping eugene@aitco.dev. Not a paid promotion.
Chapters
* 00:00:00 Introduction & RL Fundamentals
* 00:11:27 Self‑Play for Self‑Driving Cars
* 00:31:57 RL Scaling in Robotics and Other Domains
* 00:44:23 Language Models and In-Context Preference Learning
* 00:55:31 Future of RL and Grad School Advice
Transcript
I attempted to generate with ElevenLab’s new Scribe tool, but found the formatting annoying and reverted back to Alessio’s smol-podcaster. If you’re interested in working part-time as an editorial aide to Interconnects, please get in touch.
Nathan Lambert [00:01:27 ]: Hey, Eugene. Welcome to the show.
Eugene Vinitsky [00:01:29 ]: Hey, Nathan. Thanks for having me. Excited to be here.
Nathan Lambert [00:01:32 ]: Yeah, so I'll have said this in the intro as well, but we definitely go well back in all the way to Berkeley days and RL days, I think.
I will embarrass you a little bit now on the live read, which is like, you were one of the people when I was switching into RL, and they're like, oh, it seems like you only figured out how to get into AI from a potentially different background, and that's what I was trying to do in 2017 and 2018.
So that was kind of fun, and now we're just friends, which is good.
Eugene Vinitsky [00:02:01 ]: Yeah, we were both figuring out. If I had any lead over you there, I was also frantically trying to figure it out, because I was coming from a weird background.
Nathan Lambert [00:02:11 ]: There are definitely a lot of people that do that now and over-attribute small time deltas to big strategic plans, which was probably what it was.
And we're just going to do some of our normal conversations on RL and self-play.
I think the backstory of this is you told me that your recent paper from some of your time at Apple, I think I don't want to time for it too specifically, was something, paraphrasing, like the most exciting RL thing you've ever had a part of.
And major RL projects are not that frequent.
I think if you segment out all the language model excitement in the past 10 years, there's really a few major milestones, and it's good to kind of talk about them.
So we can kind of start, I think, basic things, like how do you define reinforcement learning, and it will kind of build up to this self-driving project.
Eugene Vinitsky [00:03:05 ]: Yeah, so I think RL is kind of a big thing, but at a really basic level, you have this process of taking actions in the world.
You're seeing the state of the world.
If you're taking actions in the world, you sometimes receive a reward that tells you the value of that action, and you're trying to kind of optimize your cumulative behavior over time.
So that, you know, over long trajectories, you're optimizing those costs.
That's both, you know, the hard thing and the exciting thing is that if you do it well, you can really optimize really long horizon behaviors.
Nathan Lambert [00:03:41 ]: Yeah, I agree.
And it's funny because now it's finally, the language models are finally doing this long chain of thought, and I don't really think that's the same.
I think the interactive notion will come up a lot here where these long context behaviors are many, many actions interacting with the world relative to one really, really long action, which is kind of odd.
Eugene Vinitsky [00:04:04 ]: Yeah, I guess, yeah, it mixes things, right?
Because it has very long state, right?
It's got very long contexts, and it's generating its own context.
But in the end, there's really one action at the end that, like, kind of determines how everything went, you know?
Nathan Lambert [00:04:23 ]: Yeah, yeah, yeah, we'll get into this.
And then the next thing that we kind of need to set up is what do you define self-play as?
I think this word has been particularly broken in recent times with language models, and I'm hoping we can get a fairly specific criteria for what is self-play and what ar