Research Scientist, Google
3 papers at NeurIPS 2025
We propose an inference-strategy for long-video QA, which substantially improves the accuracy of a VLM by curating its input context.
We uncover the emergent open-vocabulary semantic segmentation capability of diffusion transformers and show that amplifying this property enhances both segmentation and image generation.
We make neural network training cheaper and more accurate by progressively dropping parts of the data after each epoch.