Itamar Pres

I'm a PhD student at MIT, where I work with Jacob Andreas in the Language and Intelligence Group at CSAIL. My primary focus is on making artificial intelligence systems safer and more interpretable. My work is supported by the NSF Graduate Research Fellowship Program.

At Michigan, I was involved with the Language and Information Technologies Lab and have worked alongside Dr. Andrew Lee and Prof. Rada Mihalcea to leverage interpretability to study toxicity and personas in LLMs.

I previously interned with the Krueger AI Safety Lab at the University of Cambridge working alongside Prof. David Krueger, Dr. Ekdeep Singh Lubana, and Dr. Laura Ruis to develop new inference-time methods for model behavioral control.

I have also worked with Hidenori Tanaka at the Harvard Center for Brain Science to mechanistically study in-context learning.

I first started doing interpretability research with Neel Nanda through the ML Alignment & Theory Scholars Program.

profile photo
News
  • Jun 2026 Self-CTRL is out on arXiv.
  • May 2026 Our position paper on self-consistency was accepted to ICML 2026.
  • Aug 2025 Started my PhD at MIT CSAIL, supported by the NSF GRFP.
  • Jan 2025 Competition Dynamics accepted to ICLR 2025 as a Spotlight.
Selected publications (* denotes equal contribution)
Self-CTRL Self-CTRL: Self-Consistency Training with Reinforcement Learning
Itamar Pres, Laura Ruis, Melat Ghebreselassie, Belinda Z. Li, and Jacob Andreas
Preprint, 2026

We introduce a method that optimizes for consistency between a language model's self-explanations and its behavior on related inputs, either updating explanations to better predict behavior or modifying behavior to align with explanations. On formal probabilistic reasoning tasks, our method raises the correlation between self-reported and measured latent biases. In a constitutional AI setting, it improves a third-party auditor's refusal-prediction accuracy.

Self-Consistency Position: It's Time to Optimize for Self-Consistency
Itamar Pres*, Belinda Z. Li*, Laura Ruis*, Zifan Carl Guo, Keya Hu, Mehul Damani, Isha Puri, Ekdeep Singh Lubana, and Jacob Andreas
ICML, 2026

We argue that many failures in current LMs arise from a shared modeling assumption: that behavior can be specified and evaluated independently on single-output pairs. In this position paper, we propose self-consistency as a framework for understanding these failures. We observe that a wide variety of techniques designed to improve specific aspects of LM behavior—targeting properties as diverse as adversarial robustness and factual coherence—can be understood as special cases of a common "consistency optimization" procedure and addressed with a standard set of optimization tools. The same framework can be used to specify emerging model capabilities, such as introspection and self-improvement, by constraining a model's behavior to be consistent with its own descriptions of that behavior.

Comp Dynamics Competition Dynamics Shape Algorithmic Phases of In-Context Learning
Core Francisco Park, Ekdeep Singh Lubana, Itamar Pres, and Hidenori Tanaka
ICLR, 2025 (Spotlight)

We introduce a framework for understanding in-context learning (ICL) using a synthetic task based on Markov chain mixtures. We find this task replicates most of the previously described ICL phenomena. We identify four distinct algorithmic phases, blending unigram or bigram statistics with fuzzy retrieval or inference. These phases compete dynamically, revealing sharp transitions in ICL behavior due to changes in training conditions, such as data diversity and context size. I’m proud to have led the interpretability work, quantifying neuron memorization and tracking attention head evolution during training—check it out!

CAA Benchmark Towards Reliable Evaluation of Behavior Steering Interventions in LLMs
Itamar Pres, Laura Ruis, Ekdeep Singh Lubana, and David Krueger
NeurIPS workshop on Foundation Model Interventions, 2024 (Spotlight)

We propose a robust evaluation pipeline for behavioral steering interventions in LLMs, addressing gaps in current methods like subjective metrics and lack of comparability. Our pipeline aligns with downstream tasks, considers model likelihoods, enables cross-behavioral comparisons, and includes baselines. Testing interventions like Contrastive Activation Addition (CAA) and Inference-Time Intervention (ITI), we find their efficacy varies by behavior, with results often overstated and critical distinctions between promoting and suppressing behaviors overlooked.

Toxicity DPO A Mechanistic Understanding of Alignment Algorithms: A Case Study on DPO and Toxicity
Andrew Lee, Xiaoyan Bai, Itamar Pres, Martin Wattenberg, Jonathan K. Kummerfeld, and Rada Mihalcea
ICML, 2024 (Oral)

We study a popular algorithm, direct preference optimization (DPO), and the mechanisms by which it reduces toxicity. We first study how toxicity is represented and elicited in pre-trained language models (GPT2-medium, Llama2-7b). We then apply DPO to reduce toxicity and find that capabilities learned from pre-training are not removed, but rather bypassed. We use this insight to demonstrate a simple method to un-align the models, reverting them back to their toxic behavior.


Website template source available here.