2 papers across 2 sessions
We propose a benchmark to evaluate the large language models' instruction following ability in agentic scenarios.