Itamar Pres

I'm a PhD student at MIT, where I work with Jacob Andreas in the Language and Intelligence Group at CSAIL. My primary focus is on making artificial intelligence systems safer and more interpretable. My work is supported by the NSF Graduate Research Fellowship Program.

At Michigan, I was involved with the Language and Information Technologies Lab and have worked alongside Dr. Andrew Lee and Prof. Rada Mihalcea to leverage interpretability to study toxicity and personas in LLMs.

I previously interned with the Krueger AI Safety Lab at the University of Cambridge working alongside Prof. David Krueger, Dr. Ekdeep Singh Lubana, and Dr. Laura Ruis to develop new inference-time methods for model behavioral control.

I have also worked with Hidenori Tanaka at the Harvard Center for Brain Science to mechanistically study in-context learning.

I first started doing interpretability research with Neel Nanda through the ML Alignment & Theory Scholars Program.

News

Jun 2026 Self-CTRL is out on arXiv.
May 2026 Our position paper on self-consistency was accepted to ICML 2026.
Aug 2025 Started my PhD at MIT CSAIL, supported by the NSF GRFP.
Jan 2025 Competition Dynamics accepted to ICLR 2025 as a Spotlight.

Selected publications (* denotes equal contribution)

	Self-CTRL: Self-Consistency Training with Reinforcement Learning Itamar Pres, Laura Ruis, Melat Ghebreselassie, Belinda Z. Li, and Jacob Andreas Preprint, 2026 We introduce a method that optimizes for consistency between a language model's self-explanations and its behavior on related inputs, either updating explanations to better predict behavior or modifying behavior to align with explanations. On formal probabilistic reasoning tasks, our method raises the correlation between self-reported and measured latent biases. In a constitutional AI setting, it improves a third-party auditor's refusal-prediction accuracy.
	Position: It's Time to Optimize for Self-Consistency Itamar Pres, Belinda Z. Li, Laura Ruis, Zifan Carl Guo, Keya Hu, Mehul Damani, Isha Puri, Ekdeep Singh Lubana, and Jacob Andreas ICML, 2026 We argue that many failures in current LMs arise from a shared modeling assumption: that behavior can be specified and evaluated independently on single-output pairs. In this position paper, we propose self-consistency* as a framework for understanding these failures. We observe that a wide variety of techniques designed to improve specific aspects of LM behavior—targeting properties as diverse as adversarial robustness and factual coherence—can be understood as special cases of a common "consistency optimization" procedure and addressed with a standard set of optimization tools. The same framework can be used to specify emerging model capabilities, such as introspection and self-improvement, by constraining a model's behavior to be consistent with its own descriptions of that behavior.
	Competition Dynamics Shape Algorithmic Phases of In-Context Learning Core Francisco Park, Ekdeep Singh Lubana, Itamar Pres, and Hidenori Tanaka ICLR, 2025 (Spotlight) We introduce a framework for understanding in-context learning (ICL) using a synthetic task based on Markov chain mixtures. We find this task replicates most of the previously described ICL phenomena. We identify four distinct algorithmic phases, blending unigram or bigram statistics with fuzzy retrieval or inference. These phases compete dynamically, revealing sharp transitions in ICL behavior due to changes in training conditions, such as data diversity and context size. I’m proud to have led the interpretability work, quantifying neuron memorization and tracking attention head evolution during training—check it out!
	Towards Reliable Evaluation of Behavior Steering Interventions in LLMs Itamar Pres, Laura Ruis, Ekdeep Singh Lubana, and David Krueger NeurIPS workshop on Foundation Model Interventions, 2024 (Spotlight) We propose a robust evaluation pipeline for behavioral steering interventions in LLMs, addressing gaps in current methods like subjective metrics and lack of comparability. Our pipeline aligns with downstream tasks, considers model likelihoods, enables cross-behavioral comparisons, and includes baselines. Testing interventions like Contrastive Activation Addition (CAA) and Inference-Time Intervention (ITI), we find their efficacy varies by behavior, with results often overstated and critical distinctions between promoting and suppressing behaviors overlooked.
	A Mechanistic Understanding of Alignment Algorithms: A Case Study on DPO and Toxicity Andrew Lee, Xiaoyan Bai, Itamar Pres, Martin Wattenberg, Jonathan K. Kummerfeld, and Rada Mihalcea ICML, 2024 (Oral) We study a popular algorithm, direct preference optimization (DPO), and the mechanisms by which it reduces toxicity. We first study how toxicity is represented and elicited in pre-trained language models (GPT2-medium, Llama2-7b). We then apply DPO to reduce toxicity and find that capabilities learned from pre-training are not removed, but rather bypassed. We use this insight to demonstrate a simple method to un-align the models, reverting them back to their toxic behavior.

Website template source available here.