logo
today local_bar
Poster Session 2 · Wednesday, December 3, 2025 4:30 PM → 7:30 PM
#3512 Spotlight

SageAttention3: Microscaling FP4 Attention for Inference and An Exploration of 8-Bit Training

NeurIPS Project Page Slides OpenReview

Abstract

The efficiency of attention is important due to its quadratic time complexity. We enhance the efficiency of attention through two key contributions: First, we leverage the new Tensor Cores in Blackwell GPUs to accelerate attention computation. Our implementation achieves on , which is a speedup over the fastest FlashAttention on . Experiments show that our attention can accelerate inference of various models in a plug-and-play way.
Second, we pioneer low-bit attention to training tasks. Existing low-bit attention works like FlashAttention3 and SageAttention focus only on inference. However, the efficiency of training large models is also important. To explore whether low-bit attention can be effectively applied to training tasks, we design an accurate and efficient attention for both forward and backward propagation. Experiments indicate that attention achieves lossless performance in fine-tuning tasks but exhibits slower convergence in pretraining tasks.