Poster Session 3 · Thursday, December 4, 2025 11:00 AM → 2:00 PM
#4810
Few-Shot Learning from Gigapixel Images via Hierarchical Vision-Language Alignment and Modeling
Abstract
Vision-language models (VLMs) have recently been integrated into multiple instance learning (MIL) frameworks to address the challenge of few-shot, weakly supervised classification of whole slide images (WSIs). A key trend involves leveraging multi-scale information to better represent hierarchical tissue structures.
However, existing methods often face two key limitations:
- insufficient modeling of interactions within the same modalities across scales (e.g., 5x and 20x)
- inadequate alignment between visual and textual modalities on the same scale.
To address these gaps, we propose HiVE-MIL, a hierarchical vision-language framework that constructs a unified graph consisting of:
- parent–child links between coarse (5x) and fine (20x) visual/textual nodes to capture hierarchical relationships
- heterogeneous intra-scale edges linking visual and textual nodes on the same scale.
To further enhance semantic consistency, HiVE-MIL incorporates a two-stage, text-guided dynamic filtering mechanism that removes weakly correlated patch–text pairs, and introduces a hierarchical contrastive loss to align textual semantics across scales. Extensive experiments on TCGA breast, lung, and kidney cancer datasets demonstrate that HiVE-MIL consistently outperforms both traditional MIL and recent VLM-based MIL approaches, achieving gains of up to 4.1% in macro F1 under 16-shot settings. Our results demonstrate the value of jointly modeling hierarchical structure and multimodal alignment for efficient and scalable learning from limited pathology data. The code is available at
https://github.com/bryanwong17/HiVE-MIL.