Full Professor, Northeastern University
3 papers at NeurIPS 2025
We propose a novel feature attribution method that disentangles attributions based on a feature's value and its position within a sequence.
We propose, analyze, and validate a method for guiding LLM behavior at inference time by applying steering vectors to query and value representations.
H-SPLID learns salient features via latent space decomposition and we provide theoretical guarantees w.r.t. input perturbations.