4 papers across 2 sessions
We analyze the flow of tokens across attention layers and use these insights to enhance performance of Transformers.
Efficient technique for enforcing hard constraints on the gradients of DNN by editing the DNN parameters.
Sinusoidal initialization replaces random weight seeding with a deterministic, structured scheme that balances weights and neuron activations from the outset, yielding faster, more stable training and higher accuracy across diverse models.