3 papers across 2 sessions
We introduce TRoVe, an automated approach for discovering error-inducing static feature biases learned by temporal VLMs.
We introduce VivaBench, an extendable benchmark that simulates multi-turn medical conversations. We demonstrate that LLM agents are clinically knowledgeable, but limited in ability to gather information and diagnose from incomplete presentations.