1 paper across 1 session
We find that reasoning traces of a RL-trained model often have illegible segments, potentially compromising chain-of-thought monitoring for detecting malicious behavior.