1 paper across 1 session
We introduce faithful interaction explanations of CLIP and SigLIP models (FIxLIP), offering a unique perspective on interpreting image–text similarity predictions.