Update: 2023-09-22


In this episode, Nathan sits down with three researchers at Carnegie Mellon studying adversarial attacks and mimetic initialization: Zico Kolter, Andy Zou, and Asher Trockman. They discuss: the motivation behind researching universal adversarial attacks on language models, how the attacks work, and the short term harms and long term risks of these jailbreaks.

[00:00:00 ] - Introducing the podcast and guests Zico Kolter, Andy Zou, and Asher Trockman

[00:06:32 ] - Discussing the motivation and high-level strategy for the universal adversarial attack on language models

[00:09:33 ] - Explaining how the attacks work by adding nonsense tokens to maximize target sequence probability

[00:11:06 ] - Comparing to prior adversarial attacks in vision models

[00:13:47 ] - Details on the attack optimization process and discrete token search

[00:17:09 ] - The empirical notion of "mode switching" in the language models

[00:21:18 ] - Technical details on gradient computation across multiple models and prompts

[00:23:46 ] - Operating in one-hot vector space rather than continuous embeddings

[00:25:50 ] - Evaluating candidate substitutions across all positions to find the best update

[00:28:05 ] - Running the attack optimization for hundreds of steps across multiple GPUs

[00:39:14 ] - The difficulty of understanding the loss landscape and internal model workings

[00:43:55 ] - The flexibility afforded by separating the loss and optimization approach

[00:48:16 ] - The challenges of creating inherently robust models via adversarial training

[00:52:34 ] - Potential approaches to defense through filtering or inherent model robustness

[00:55:51 ] - Transferability results to commercial models like GPT-4 and Claude

[00:59:25 ] - Hypotheses on why the attacks transfer across different model architectures

[01:04:36 ] - The mix of human-interpretable and nonsense features in effective attacks

[01:08:29 ] - The appearance of intuitive manual jailbreak triggers in some attacks

[01:15:33 ] - Short-term harms of attacks vs long-term risks

[01:18:37 ] - Influencing those with incomplete understanding of LLMs to appreciate differences from human reasoning

[01:24:16 ] - Mitigating risks by training on filtered datasets vs broad web data

[01:29 16] - Curriculum learning as a strategy for both capability and safety

[01:30:35 ] - Influencing developers building autonomous systems with LLMs

[01:33:19 ] - Alienness of LLM failure modes compared to human reasoning

[01:35:45 ] - Getting inspiration from biological visual system structure

[01:40:35 ] - Initialization as an alternative to pretraining for small datasets

[01:51:41 ] - Encoding useful structures like grammars in initialization without training

[02:12:10 ] - Most ideas don't progress to research projects

[02:13:02 ] - Pursuing ideas based on interest and feasibility

[02:15:14 ] - Fun of exploring uncharted territory in ML research

