1 paper across 1 session
Variance scale-up of text-token embeddings in MM-DiTs before joint attention lets rare prompts emerge without retraining, extra data, or denoising tweaks, boosting text-to-image, video, and editing