2 papers across 2 sessions
Contextualizing MLLM-based agents with grounded scene graphs boosts their performance.
We present Open CaptchaWorld, a benchmark that tests multimodal LLM agents on solving real-world CAPTCHAs via multi-step reasoning and interaction, revealing large gaps between current models and human performance.