logo
today local_bar
Poster Session 5 · Friday, December 5, 2025 11:00 AM → 2:00 PM
#1602

InfantAgent-Next: A Multimodal Generalist Agent for Automated Computer Interaction

NeurIPS OpenReview

Abstract

This paper introduces InfantAgent-Next, a generalist agent capable of interacting with computers in a multimodal manner, encompassing text, images, audio, and video.
Unlike existing approaches that either build intricate workflows around a single large model or only provide workflow modularity, our agent integrates tool-based and pure vision agents within a highly modular architecture, enabling different models to collaboratively solve decoupled tasks in a step-by-step manner.
Our generality is demonstrated by our ability to evaluate not only pure vision-based real-world benchmarks (i.e., OSWorld), but also more general or tool-intensive benchmarks (e.g., GAIA and SWE-Bench). Specifically, we achieve a \mathbf{7.27\\%} accuracy gain over Claude-Computer-Use on OSWorld.
Codes and evaluation scripts are included in the supplementary material and will be released as open-source.