MS-Bench: Evaluating LMMs in Ancient Manuscript Study through a Dunhuang Case Study

Yuqing Zhang, Yue Han, Shuanghe Zhu, Haoxiang Wu, Hangqi Li, Shengyu Zhang, Junchi Yan, Zemin Liu, Kun Kuang, Huaiyong Dou, Yongquan Zhang, Fei Wu

Zhejiang University· Shanghai Jiao Tong University

Multimodal Large Language Model Benchmark Dataset Ancient Manuscript

⋅ NeurIPS ⋅ Project Page ⋅OpenReview

Abstract

Analyzing ancient manuscripts has traditionally been a labor-intensive and time-consuming task for philologists. While recent advancements in LMMs have demonstrated their potential across diverse domains, their effectiveness in manuscript study remains underexplored.

In this paper, we introduce MS-Bench, the first comprehensive benchmark co-developed with archaeologists, comprising 5,076 high-resolution images from 4th to 14th century and 9,982 expert-curated questions across nine sub-tasks aligned with archaeological workflows. Through four prompting strategies, we systematically evaluate 32 LMMs on their effectiveness, robustness, and cultural contextualization.

Our analysis reveals scale-driven performance and reliability improvements, prompting strategies' impact on performance (CoT has two-sides effect, while visual retrieval-augmented prompts provide consistent boost), and task-specific preferences depending on LMM’s visual capabilities.

Although current LMMs are not yet capable of replacing domain expertise, they demonstrate promising potential to accelerate manuscript research through future human–AI collaboration.