PseuZO: Pseudo-Zeroth-Order Algorithm for Training Deep Neural Networks

Pengyun Yue, Xuanlin Yang, Mingqing Xiao, Zhouchen Lin

Peking University· Pazhou Laboratory· Microsoft· Zhongguancun Academy

zeroth-order optimization LLM fine-tuning

Abstract

Zeroth-order Optimization (ZO) has received wide attention in machine learning, especially when computing full gradient is expensive or even impossible. Recently, ZO has emerged as an important paradigm for memory-efficient fine-tuning of large language models (LLMs), circumventing the memory overhead of backpropagation. However, existing ZO gradient estimators exhibit dimension-dependent variance scaling as

Θ (d)

, leading to dimension-dependent convergence rates without further assumptions on the objective function, which is prohibitive for large-scale LLM parameters.

To address this problem, we present a Pseudo-Zeroth-Order (PseuZO) framework for optimizing composite objective functions, especially large-scale models:

min_{x \in X} F (x) = \bbE_{z} g \circ h (x; z)

, where

h

represents complex, high-dimensional representations and

g

is a task-specific loss. While existing zeroth-order methods estimate gradients with final loss functions, our PseuZO algorithm estimate the Jacobian matrix of

h (x)

with the model output

o = h (x)

, and the gradient of the loss function on model output

e = \nabla_{o} g (o)

, and apply exponential moving average on Jacobian estimators to reduce the variance. Moreover, we use the sliding window technique to reduce memory costs.

Our algorithm achieves an

O (max {α_{1} L ϵ^{- 2}, α_{1} L σ_{2}^{2} ϵ^{- 4}})

convergence rate, where

α_{1}

is the effective dimension of

F

. Experimental results demonstrate that PseuZO outperforms MeZO and MeZO-SVRG in classification, multiple choice and generation tasks in both full-parameter and PEFT fine-tuning settings by boosting convergence in the early stages of training. For instance, under the same computation time, with respect to SST2 task, PesuZO gets 9.8% higher accuracy than MeZO (91.2% v.s. 82.4%). With the sliding window technique, our PseuZO achieves

70% \sim 80%

memory reduction compared to FO-SGD for different model sizes as PseuZO only introduced a small dimension-independent memory overhead, which enables efficient scaling of the model size. The code is available at https://github.com/YangBigMn/PseuZO.