2 papers across 1 session
This paper introduces a novel method for generating benchmarks to evaluate semantic similarity methods for LLM outputs, achieving cross-domain scalability and not replying on human judgment
Factorio Learning Environment is an evaluation for frontier models that offers exponentially scaling challenges.