Full Professor, Stanford University
3 papers at NeurIPS 2025
We analyze the effective depth of LLMs and find that they are unlikely to compose subresults, and deeper models spread out the same type of computation as the shallow ones.
We develop tests for proving that someone is producing text using a particular language model by correlating the text with the order of examples used to train the model.
We propose Reference-free Preference Steering (RePS), a bidirectional preference-optimization objective that jointly does concept steering and suppression.