3 papers across 3 sessions
ACCO is a new and principled optimization techniques with provable guarantees for Sharded Distributed LLM Training
We prove that a softmax self-attention layer trained via GD can solved the so-called single-locator regression problem