Researcher, Anthropic
2 papers at NeurIPS 2025
SWE-smith auto-generates large-scale software engineering tasks by breaking tests in Python repos, enabling a new open-source model that sets state-of-the-art results on SWE-bench.
We conduct a user study to evaluate how well language models help humans internalize their reasoning, revealing that strong model performance alone doesn't guarantee effective reasoning transfer.