1 paper across 1 session
We prove that stochastic momentum can improve the scaling law exponents over SGD on power-law random features by selecting hyperparameters to properly depend on data dimension or model size.