causal alignment

1 paper across 1 session

Poster Session 1

Wednesday, December 3, 2025 · 11:00 AM → 2:00 PM

The Non-Linear Representation Dilemma: Is Causal Abstraction Enough for Mechanistic Interpretability?

#1014 Spotlight · Denis Sutter, Julian Minder, Thomas Hofmann, Tiago Pimentel

We show in theory and practice that by allowing non-linear transformations in causal abstraction, any neural network (even random ones) can be perfectly aligned to any algorithm, rendering this interpretability approach meaningless if unconstrained.