2 papers across 2 sessions
We automatically collect software engineering tasks from github at scale, build a decontaminated SWE agent benchmark out of them and discover contamination in some well-known LLMs.
We present robust sampling strategies for active statistical inference, a new method for inference with AI-assisted data collection.