2 papers across 2 sessions
Instance-level adaptive KL penalty control method for Direct Preference Optimization