logo
today local_bar
Poster Session 3 · Thursday, December 4, 2025 11:00 AM → 2:00 PM
#1004

PseuZO: Pseudo-Zeroth-Order Algorithm for Training Deep Neural Networks

NeurIPS OpenReview Code

Abstract

Zeroth-order Optimization (ZO) has received wide attention in machine learning, especially when computing full gradient is expensive or even impossible. Recently, ZO has emerged as an important paradigm for memory-efficient fine-tuning of large language models (LLMs), circumventing the memory overhead of backpropagation. However, existing ZO gradient estimators exhibit dimension-dependent variance scaling as , leading to dimension-dependent convergence rates without further assumptions on the objective function, which is prohibitive for large-scale LLM parameters.
To address this problem, we present a Pseudo-Zeroth-Order (PseuZO) framework for optimizing composite objective functions, especially large-scale models: , where represents complex, high-dimensional representations and is a task-specific loss. While existing zeroth-order methods estimate gradients with final loss functions, our PseuZO algorithm estimate the Jacobian matrix of with the model output , and the gradient of the loss function on model output , and apply exponential moving average on Jacobian estimators to reduce the variance. Moreover, we use the sliding window technique to reduce memory costs.
Our algorithm achieves an convergence rate, where is the effective dimension of . Experimental results demonstrate that PseuZO outperforms MeZO and MeZO-SVRG in classification, multiple choice and generation tasks in both full-parameter and PEFT fine-tuning settings by boosting convergence in the early stages of training. For instance, under the same computation time, with respect to SST2 task, PesuZO gets 9.8% higher accuracy than MeZO (91.2% v.s. 82.4%). With the sliding window technique, our PseuZO achieves memory reduction compared to FO-SGD for different model sizes as PseuZO only introduced a small dimension-independent memory overhead, which enables efficient scaling of the model size. The code is available at https://github.com/YangBigMn/PseuZO.