5 papers across 3 sessions
MLPs contain "channels to infinity" where pairs of neurons evolve to form gated linear units with diverging output weights, creating regions that appear like flat minima but actually have slowly decreasing loss value
We prove that the gradient descent map is non-singular for any neural network using piecewise analytic activation functions.
We prove that the interaction of parameter symmetry and equivariance constraints can create critical points and minima in the loss landscape.
We show that Schedule-Free methods effectively navigate the river structure of the loss landscape, enabling scalable language model training without decay schedules or extra memory.