1 paper across 1 session
We show in theory and practice that by allowing non-linear transformations in causal abstraction, any neural network (even random ones) can be perfectly aligned to any algorithm, rendering this interpretability approach meaningless if unconstrained.