4 papers across 2 sessions
We propose Rectified Policy Optimization (RePO) to mitigate "safety compensation", which replaces the average safety metric with stricter safety constraints.
Creating safe and reward maximization policies from offline data via min-max optimization formulation and solving it using no-regret algorithms
We view offline optimization from the new lens of a distributional translation task which can be modeled with a generalized Brownian Bridge Diffusion process mapping between the low-value and high-value input distributions.
Learning Reconfigurable Representations for Multimodal Federated Learning with Heterogenous Missing Data