logo
today local_bar
Poster Session 6 · Friday, December 5, 2025 4:30 PM → 7:30 PM
#5108 Spotlight

Functional Scaling Laws in Kernel Regression: Loss Dynamics and Learning Rate Schedules

NeurIPS OpenReview

Abstract

Scaling laws have emerged as a unifying lens for understanding and guiding the training of large language models (LLMs). However, existing studies predominantly focus on the final-step loss, leaving open whether the entire loss dynamics obey similar laws and, crucially, how the learning rate schedule (LRS) shapes them. We address these gaps in a controlled theoretical setting by analyzing stochastic gradient descent (SGD) on a power-law kernel regression model.
The key insight is a novel intrinsic-time viewpoint, which captures the training progress more faithfully than iteration count. We then establish a Functional Scaling Law (FSL) that captures the full loss trajectory under arbitrary LRSs, with the schedule’s influence entering through a simple convolutional functional.
We further instantiate the theory for three representative LRSs---constant, exponential decay, and warmup–stable–decay (WSD)---and derive explicit scaling relations in both data- and compute-limited regimes. These comparisons explain key empirical phenomena:
  1. higher-capacity models are more data- and compute-efficient;
  2. learning-rate decay improves training efficiency; and
  3. WSD-type schedules outperform pure decay.
Finally, experiments on LLMs ranging from 0.1B to 1B parameters demonstrate the practical relevance of FSL as a surrogate model for fitting and predicting loss trajectories in large-scale pre-training.