Poster Session 1 · Wednesday, December 3, 2025 11:00 AM → 2:00 PM
#4411
ShoeFit: A New Dataset and Dual-image-stream DiT Framework for Virtual Footwear Try-On
Abstract
Virtual footwear try-on (VFTON), a critical yet underexplored area in virtual try-on (VTON), aims to synthesize faithful try-on results given diverse footwear and model images while maintaining 3D consistency and texture authenticity.
Unlike conventional garment-focused VTON methods, VFTON presents unique challenges due to
- Data Scarcity, which arises from the difficulty of perfectly matching product shoes with models wearing the identical ones,
- Viewpoint Misalignment, where the target foot pose and source shoe views are always misaligned, leading to incomplete texture information and detail distortion, and
- Background-induced Color Distortion, where complex material of footwear interacts with environmental lighting, causing unintended color contamination.
To address these challenges, we introduce MVShoes, a multi-view shoe try-on dataset consisting of 7305 well-annotated image triplets, covering diverse footwear categories and challenging try-on scenarios. Furthermore, we propose a dual-stream DiT architecture, ShoeFit, designed to mitigate viewpoint misalignment through Multi-View Conditioning with 3D Rotary Position Embedding, and alleviate background-induced distortion using the LayeredRefAttention which leverages background features to modulate footwear latents.
The proposed framework effectively decouples shoe appearance from environmental interferences while preserving high-quality texture detail through decoupled denoising and conditioning branches. Extensive quantitative and qualitative experiments demonstrate that our method substantially improves rendering fidelity and robustness under challenging real-world product shoes, establishing a new benchmark in high-fidelity footwear try-on synthesis. The dataset and benchmark will be publicly available upon acceptance of the paper.