logo
today local_bar
Poster Session 4 · Thursday, December 4, 2025 4:30 PM → 7:30 PM
#5403

ChemOrch: Empowering LLMs with Chemical Intelligence via Groundbreaking Synthetic Instructions

NeurIPS OpenReview

Abstract

Empowering large language models (LLMs) with chemical intelligence remains a challenge due to the scarcity of high-quality, domain-specific instruction-response datasets and the misalignment of existing synthetic data generation pipelines with the inherently hierarchical and rule-governed structure of chemical information.
To address this, we propose ChemOrch, a framework that synthesizes chemically grounded instruction–response pairs through a two-stage process: task-controlled instruction generation and tool-aware response construction. ChemOrch enables controllable diversity and levels of difficulty for the generated tasks and ensures response precision through tool planning & distillation, and tool-based self-repair mechanisms.
The effectiveness of ChemOrch is evaluated based on:
  1. the high quality of generated instruction data, demonstrating superior diversity and strong alignment with chemical constraints;
  2. the dynamic generation of evaluation tasks that more effectively reveal LLM weaknesses in chemistry; and
  3. the significant improvement of LLM chemistry capabilities when the generated instruction data are used for fine-tuning.
Our work thus represents a critical step toward scalable and verifiable chemical intelligence in LLMs. The code is available at https://anonymous.4open.science/r/ChemOrch-854A.