Multi-Modal Interactive Agent Layer for Few-Shot Universal Cross-Domain Retrieval and Beyond

Few-Shot Universal Cross Domain Retrieval Vision-Language Models Cross-Domain Interaction Agent Layer

⋅ NeurIPS ⋅ Project Page ⋅Poster ⋅OpenReview

Abstract

This paper firstly addresses the challenge of few-shot universal cross-domain retrieval (FS-UCDR), enabling machines trained with limited data to generalize to novel retrieval scenarios, with queries from entirely unknown domains and categories. To achieve this, we first formally define the FS-UCDR task and propose the Multi-Modal Interactive Agent Layer (MAIL), which enhances the cross-modal interaction in vision-language models (VLMs) by aligning the parameter updates of target layer pairs across modalities.

Specifically, MAIL freezes the selected target layer pair and introduces a trainable agent layer pair to approximate localized parameter updates. A bridge function is then introduced to couple the agent layer pair, enabling gradient communication across modalities to facilitate update alignment.

The proposed MAIL offers four key advantages:

its cross-modal interaction mechanism improves knowledge acquisition from limited data, making it highly effective in low-data scenarios;
during inference, MAIL integrates seamlessly into the VLM via reparameterization, preserving inference complexity;
extensive experiments validate the superiority of MAIL, which achieves substantial performance gains over data-efficient UCDR methods while requiring significantly fewer training samples;
beyond UCDR, MAIL also performs competitively on few-shot classification tasks, underscoring its strong generalization ability.

Code.