2 papers across 2 sessions
In the context of off-policy RL, we give a theoretical analysis of the role of an additive reward correction in improving performance, accompanied by experiments on bandits and LLM posttraining.