1 paper across 1 session
We introduce a benchmark to measure safety of general computer use agents across diverse categories of harm