5 papers across 3 sessions
Our empirically and theoretically informed method, which treats diversity as a reward, achieves new SOTA average performance across 7 benchmarks on SOTA LLMs with domain-undetermined data.
We propose a diversity-aware policy optimization method for LLM reasoning that introduces token-level diversity focusing on positive samples, achieving higher performance improvement on mathematical benchmarks while generating more diverse solutions.
Learning proof system dynamics, pruning proof search based on diversity and expected outcome
We introduce an RL algorithm leveraging reparameterization and distance-based diversity regularization to train intractable multimodal policies for diversity-critical tasks.
Generating diverse plausible images with GFlowNets and diverse condition latent representation in conditional image generation