Publications (* denotes equal contribution)
|
|
Competition Dynamics Shape Algorithmic Phases of In-Context Learning
Core Francisco Park,
Ekdeep Singh Lubana,
Itamar Pres,
and
Hidenori Tanaka
ICLR, 2025 (Spotlight)
We introduce a framework for understanding in-context learning (ICL) using a synthetic task based on Markov chain
mixtures. We find this task replicates most of the previously described ICL phenomena. We identify four distinct algorithmic phases, blending unigram or bigram statistics with fuzzy retrieval or inference.
These phases compete dynamically, revealing sharp transitions in ICL behavior due to changes in training conditions,
such as data diversity and context size. I’m proud to have led the interpretability work,
quantifying neuron memorization and tracking attention head evolution during training—check it out!
|
|
Towards Reliable Evaluation of Behavior Steering Interventions in LLMs
Itamar Pres,
Laura Ruis,
Ekdeep Singh Lubana,
and
David Krueger
NeurIPS workshop on Foundation Model Interventions, 2024 (Spotlight)
We propose a robust evaluation pipeline for behavioral steering interventions in LLMs, addressing gaps in current
methods like subjective metrics and lack of comparability. Our pipeline aligns with downstream tasks, considers model
likelihoods, enables cross-behavioral comparisons, and includes baselines. Testing interventions like Contrastive
Activation Addition (CAA) and Inference-Time Intervention (ITI), we find their efficacy varies by behavior, with results
often overstated and critical distinctions between promoting and suppressing behaviors overlooked.
|
|
A Mechanistic Understanding of Alignment Algorithms: A Case Study on DPO and Toxicity
Andrew Lee,
Xiaoyan Bai,
Itamar Pres,
Martin Wattenberg,
Jonathan K. Kummerfeld,
and
Rada Mihalcea
ICML, 2024 (Oral)
We study a popular algorithm, direct preference optimization (DPO), and the mechanisms by which it reduces
toxicity. We first study how toxicity is represented and elicited in pre-trained language
models (GPT2-medium, Llama2-7b). We then apply DPO to reduce toxicity and find that capabilities learned from
pre-training are not removed,
but rather bypassed. We use this insight to demonstrate a simple method to un-align the models,
reverting them back to their toxic behavior.
|
Website template source available here.
|
|