4 papers across 3 sessions
We analyse the convergence of one-hidden-layer ReLU networks trained by gradient flow on n data points, when the input lies in very high dimension.
We prove divergence results for gradient flows for deep neural networks with analytic activation and polynomial target functions.
We characterize the structure of embeddings obtained via gradient descent, showing that the attention mechanism provably selects important tokens.