Full Professor, Harvard University
2 papers at NeurIPS 2025
Interpretability methods based on linear, orthogonal features fall short for modern neural representations, which are often hierarchical and nonlinear. Better results come from aligning methods with the true structure of these representations.
We show that Sparse Autoencoders (SAEs) are inherently biased toward detecting only a subset of concepts in model activations shaped by their internal assumptions, highlighting the need for concept geometry-aware design of novel SAE architectures.