SpaceServe: Spatial Multiplexing of Complementary Encoders and Decoders for Multimodal LLMs

zhicheng li, Shuoming Zhang, Jiacheng Zhao, Siqi Li, Xiyu Shi, Yangyu Zhang, Shuaijiang Li, Donglin Yu, Zheming Yang, YUAN WEN, Huimin Cui

Chinese Academy of Sciences· University of Aberdeen· Beijing University of Technology· University of Illinois Urbana-Champaign· XCORESIGMA CO.,LTD.

Multimodal large language models; Inference optimizations; Infrastructure

⋅ NeurIPS ⋅ OpenReview ⋅Code

Abstract

Recent multimodal large language models (MLLMs) marry modality-specificvision or audio encoders with a shared text decoder. While the encoder is compute-intensive but memory-light, the decoder is the opposite, yet state-of-the-art servingstacks still time-multiplex these complementary kernels, idling SMs or HBM inturn.

We introduce SpaceServe, a serving system that space-multiplexes MLLMs: it decouples all modality encoders from the decoder, and co-locates them on thesame GPU using fine-grained SM partitioning available in modern runtimes. Acost-model-guided Space-Inference Scheduler (SIS) dynamically assigns SM slices, while a Time-Windowed Shortest-Remaining-First (TWSRFT) policy batches en-coder requests to minimise completion latency and smooth decoder arrivals.

Evaluation shows that SpaceServe reduces time-per-output-token by 4.81× on average and up to 28.9× on Nvidia A100 GPUs. SpaceServe is available at https://github.com/gofreelee/SpaceServe