1 paper across 1 session
This paper introduces a novel method for generating benchmarks to evaluate semantic similarity methods for LLM outputs, achieving cross-domain scalability and not replying on human judgment