Researcher, Amazon
3 papers at NeurIPS 2025
Soft Thinking enables large language models to reason more accurately and efficiently by using probability-weighted concept tokens in a continuous space, rather than committing to discrete tokens at each step.
We propose Think-RM, a training framework for generative reward models that enables long-horizon reasoning, and introduce a pairwise RLHF pipeline that directly optimizes policies using pairwise preference rewards.