Researcher, Tencent AI Lab
3 papers at NeurIPS 2025
We provide O(\epsilon^{-4}) iteration complexity policy optimization algorithm for robust constrained Markov Decision Processing
We address distributional shift among diverse preferences with robust DPO: Wasserstein DPO (WDPO) and Kullback–Leibler DPO (KLDPO). Finite-sample guarantees, tractable gradient-based algorithms for hard DRO objectives, strong empirical robustness.