1 paper across 1 session
We have built a highly modular, multimodal general-purpose agent that can interact with a computer via text, images, audio, and video.