2 papers across 1 session
Ultra-realistic benchmark environments and evaluation framework for web agents
We introduce VivaBench, an extendable benchmark that simulates multi-turn medical conversations. We demonstrate that LLM agents are clinically knowledgeable, but limited in ability to gather information and diagnose from incomplete presentations.