2 papers across 2 sessions
KGGen is a text-to-knowledge-graph generator that uses language models to extract high-quality graphs from plain text.
We create a sandbox for LLM-agents to elicit goal-directed open-ended strategic deception, evaluate this deceptive capability, and show that linear probes do very well at detecting it, even OOD.