1 paper across 1 session
We automatically collect software engineering tasks from github at scale, build a decontaminated SWE agent benchmark out of them and discover contamination in some well-known LLMs.