MATS - NeurIPS 2025

🏛 MATS

4 papers across 3 sessions

Poster Session 4

1 paper

Thursday, December 4, 2025 · 4:30 PM → 7:30 PM

Exhibit Hall C,D,E

A is for Absorption: Studying Feature Splitting and Absorption in Sparse Autoencoders

#1004 · David Chanin, James Wilken-Smith, Tomáš Dulka, Hardik Bhatnagar, Satvik Golechha, Joseph Bloom

Poster Session 5

2 papers

Friday, December 5, 2025 · 11:00 AM → 2:00 PM

Exhibit Hall C,D,E

Among Us: A Sandbox for Measuring and Detecting Agentic Deception

#1517 Spotlight · Satvik Golechha, Adrià Garriga-Alonso

We create a sandbox for LLM-agents to elicit goal-directed open-ended strategic deception, evaluate this deceptive capability, and show that linear probes do very well at detecting it, even OOD.

Large language models can learn and generalize steganographic chain-of-thought under process supervision

#1008 · ROBERT MC CARTHY, Joey SKAF, Luis Ibanez-Lissen, Vasil Georgiev, Connor Watts, Hannes Whittingham, Lorena Gonzalez-Manzano, Cameron Tice, Edward Young, Puria Radmard, David Lindner

We show that penalizing certain CoT reasoning makes LLMs learn encoding schemes that generalize to unseen examples.

Poster Session 6

1 paper

Friday, December 5, 2025 · 4:30 PM → 7:30 PM

Exhibit Hall C,D,E

Too Late to Recall: Explaining the Two-Hop Problem in Multimodal Knowledge Retrieval

#4615 · Constantin Venhoff, Ashkan Khakzar, Sonia Joseph, Philip Torr, Neel Nanda

VLMs often perform worse at recalling facts than their LLM backbones because visual representations are formed too late in the forward pass to trigger the LLMs factual recall circuit.