Researcher, NVIDIA
3 papers at NeurIPS 2025
We present JetLM, a new family of LMs, which matches leading full-attention models while significantly improving generation throughput.
We present the first systematic study of lossy latency–quality trade-offs in LLM agents, introducing HFTBench and StreetFighter benchmarks, and proposing an adaptive mixed-precision framework for real-world latency-sensitive tasks.
We propose a method to speedup video diffusion generation through efficient attention.