2 papers across 2 sessions
We propose ParaStep, a step-wise parallelization method based on a reuse-then-predict mechanism that parallelizes diffusion inference by exploiting adjacent-step similarity, reducing latency with minimal quality loss.
New Transformer architectures, FAL and FAL+, reuse the first attention output carefully —FAL cuts TP communication for up to 44% faster training, while FAL+ improves perplexity.