MeCeFO: Enhancing LLM Training Robustness via Fault-Tolerant Optimization

Rizhen Hu, Yutong He, Ran Yan, Mou Sun, Binhang Yuan, Kun Yuan

Fault tolerance Memory efficiency Computation efficiency Distributed training

⋅ NeurIPS ⋅ Project Page ⋅Poster ⋅OpenReview

Abstract

As distributed optimization scales to meet the demands of Large Language Model (LLM) training, hardware failures become increasingly non-negligible. Existing fault-tolerant training methods often introduce significant computational or memory overhead, demanding additional resources.

To address this challenge, we propose Memory- and Computation- efficient Fault-tolerant Optimization (MeCeFO), a novel algorithm that ensures robust training with minimal overhead. When a computing node fails, MeCeFO seamlessly transfers its training task to a neighboring node while employing memory- and computation-efficient algorithmic optimizations to minimize the extra workload imposed on the neighboring node handling both tasks.

MeCeFO leverages three key algorithmic designs:

Skip-connection, which drops the multi-head attention (MHA) module during backpropagation for memory- and computation-efficient approximation;
Recomputation, which reduces activation memory in feedforward networks (FFNs); and
Low-rank gradient approximation, enabling efficient estimation of FFN weight matrix gradients.

Theoretically, MeCeFO matches the convergence rate of conventional distributed training, with a rate of

O (1/ n T http://www.w3.org/2000/svg" width="400em" height="1.08em" viewBox="0 0 400000 1080" preserveAspectRatio="xMinYMin slice">)

, where

n

is the data parallelism size and

T

is the number of iterations. Empirically, MeCeFO maintains robust performance under high failure rates, incurring only a 4.18% drop in throughput, demonstrating

5.0 \times

6.7 \times

greater resilience than previous SOTA approaches.