7 papers across 3 sessions
Using crosscoders (SAE variant) for chat-tuning concept identification, we diagnose spurious chat-only concepts arising from L1 loss artifacts and show BatchTopK robustly reveals genuine, interpretable ones.
We propose a sparse autoencoder that maps the semantics of vision and language representations into a unified concept set.
We propose a data selection method that leverages sparse, monosemantic neuronal activations learned via a sparse autoencoder to improve task-specific instruction tuning for large language models.