2 papers across 2 sessions
GSO: SWE Agents Struggle at Reasoning and Engineering for Software Optimization
We automatically collect software engineering tasks from github at scale, build a decontaminated SWE agent benchmark out of them and discover contamination in some well-known LLMs.