1 paper across 1 session
We enable vision-language models to efficiently encode and utilize multimodal multilingual knowledge through a lightweight, continuous memory mechanism.