Principal Researcher, FAR.AI
1 paper at NeurIPS 2025
We incorporate lie detectors into the labelling step of preference learning and characterize the factors that lead the trained policy to be honest or to evade the detector.