2 papers across 2 sessions
A sparse attention mechanism balances efficiency, long-range random access flexibility and length generalization ability
We propose joint recall, a novel synthetic task, and hybrid sparse attention with context-dependent sparsity for better sub-quadratic long-context modeling.