3 papers across 3 sessions
We perform an important step towards LLM pure FP8 training by enabling stable FP8 dot product attention reaching new throughput records
FALQON accelerates LoRA fine-tuning by up to 3$\times$ through merging adapters into an FP8-quantized backbone, removing redundant quantization overhead from small matrices.