2 papers across 2 sessions
We propose BAM-ICL, a novel budgeted adversarial manipulation hijacking attack framework for in-context learning.
We introduce a novel method leveraging noise injection as a tool to elicit the latent capabilities of sandbagging LLMs.