logo
today local_bar
Poster Session 2 · Wednesday, December 3, 2025 4:30 PM → 7:30 PM
#4815

SpaceServe: Spatial Multiplexing of Complementary Encoders and Decoders for Multimodal LLMs

NeurIPS OpenReview Code

Abstract

Recent multimodal large language models (MLLMs) marry modality-specificvision or audio encoders with a shared text decoder. While the encoder is compute-intensive but memory-light, the decoder is the opposite, yet state-of-the-art servingstacks still time-multiplex these complementary kernels, idling SMs or HBM inturn.
We introduce SpaceServe, a serving system that space-multiplexes MLLMs: it decouples all modality encoders from the decoder, and co-locates them on thesame GPU using fine-grained SM partitioning available in modern runtimes. Acost-model-guided Space-Inference Scheduler (SIS) dynamically assigns SM slices, while a Time-Windowed Shortest-Remaining-First (TWSRFT) policy batches en-coder requests to minimise completion latency and smooth decoder arrivals.
Evaluation shows that SpaceServe reduces time-per-output-token by 4.81× on average and up to 28.9× on Nvidia A100 GPUs. SpaceServe is available at https://github.com/gofreelee/SpaceServe