5 papers across 3 sessions
CSCR embeds both prompts and LLMs into a shared space using fast logit or perplexity fingerprints. A cost‑banded InfoNCE loss trains the space to balance quality against cost. It generalizes to unseen models and out‑of‑distribution prompts.
We propose the first efficient, training-free online routing algorithm for high-volume LLM serving under token budget constraints, achieving significant improvements in both routing performance and cost efficiency.