2 papers across 2 sessions
A SoTA sequence parallelism for linear attention with a brand new collective communication.
Efficient Long Context Fine-tuning through Dynamic Data Scheduling