5 papers across 3 sessions
SageAttention3: Microscaling FP4 Attention for Plug-and-Play Inference Acceleration and An Exploration of 8-Bit Attention for Training.
This paper presents FlashBias to speed up computation of attention with bias, which brings 1.5x speedup for AlphaFold and 2x speedup for SwinV2.
KV cache retrieval for large language models using nonlinear hashing function.
We propose a convolution-like linearization strategy that accelerates pre-trained diffusion transformers for ultra-resolution image generation.