Functional Scaling Laws in Kernel Regression: Loss Dynamics and Learning Rate Schedules

Binghui Li, Fengling Chen, Zixun Huang, Lean Wang, Lei Wu

Peking University· AI for Science Institute, Beijing

Scaling Laws Learning Rate Schedule Kernel Regression LLM Pre-Training Loss Dynamics

Abstract

Scaling laws have emerged as a unifying lens for understanding and guiding the training of large language models (LLMs). However, existing studies predominantly focus on the final-step loss, leaving open whether the entire loss dynamics obey similar laws and, crucially, how the learning rate schedule (LRS) shapes them. We address these gaps in a controlled theoretical setting by analyzing stochastic gradient descent (SGD) on a power-law kernel regression model.

The key insight is a novel intrinsic-time viewpoint, which captures the training progress more faithfully than iteration count. We then establish a Functional Scaling Law (FSL) that captures the full loss trajectory under arbitrary LRSs, with the schedule’s influence entering through a simple convolutional functional.

We further instantiate the theory for three representative LRSs---constant, exponential decay, and warmup–stable–decay (WSD)---and derive explicit scaling relations in both data- and compute-limited regimes. These comparisons explain key empirical phenomena:

higher-capacity models are more data- and compute-efficient;
learning-rate decay improves training efficiency; and
WSD-type schedules outperform pure decay.

Finally, experiments on LLMs ranging from 0.1B to 1B parameters demonstrate the practical relevance of FSL as a surrogate model for fitting and predicting loss trajectories in large-scale pre-training.