2 papers across 2 sessions
This paper proposes Panacea, a post-fine-tuning method that mitigates harmful fine-tuning in large language models, maintaining safety alignment without sacrificing performance across different tasks and models.
Defences against LLM misuse fine-tuning attacks that aim to detect individual malicious or suspicious samples are insufficient.