Interaction-Centric Knowledge Infusion and Transfer for Open Vocabulary Scene Graph Generation

Lin Li, Chuhan ZHANG, Dong Zhang, Chong Sun, Chen Li, Long Chen

HKUST· AI Chip Center for Emerging Smart Systems· Tencent

Open Vocabulary Scene Graph Generation Weakly Supervised Pre-training Interaction-Centric Knowledge Infusion Interaction-Centric Knowledge Transfer

⋅ NeurIPS ⋅ Slides ⋅Poster ⋅OpenReview

Abstract

Open-vocabulary scene graph generation (OVSGG) extends traditional SGG by recognizing novel objects and relationships beyond predefined categories, leveraging the knowledge from pre-trained large-scale models. Existing OVSGG methods always adopt a two-stage pipeline:

Infusing knowledge into large-scale models via pre-training on large datasets;
Transferring knowledge from pre-trained models with fully annotated scene graphs during supervised fine-tuning.

However, due to a lack of explicit interaction modeling, these methods struggle to distinguish between interacting and non-interacting instances of the same object category. This limitation induces critical issues in both stages of OVSGG: it generates noisy pseudo-supervision from mismatched objects during knowledge infusion, and causes ambiguous query matching during knowledge transfer.

To this end, in this paper, we propose an interACtion-Centric end-to-end OVSGG framework (ACC) in an interaction-driven paradigm to minimize these mismatches.

For interaction-centric knowledge infusion, ACC employs a bidirectional interaction prompt for robust pseudo-supervision generation to enhance the model's interaction knowledge. For interaction-centric knowledge transfer, ACC first adopts interaction-guided query selection that prioritizes pairing interacting objects to reduce interference from non-interacting ones. Then, it integrates interaction-consistent knowledge distillation to bolster robustness by pushing relational foreground away from the background while retaining general knowledge.

Extensive experimental results on three benchmarks show that ACC achieves state-of-the-art performance, demonstrating the potential of interaction-centric paradigms for real-world applications.