Speak-to-Structure: Evaluating LLMs in Open-domain Natural Language-Driven Molecule Generation

from arxiv, Accepted by KDD 2026. Our codes and datasets are fully accessible through the https://github.com/phenixace/S2-TOMG-Bench and https://huggingface.co/datasets/phenixace/S2-TOMG-Bench

Recently, Large Language Models (LLMs) have demonstrated great potential in natural language-driven molecule discovery. However, existing datasets and benchmarks for molecule-text alignment are predominantly built on one-to-one mappings, measuring LLMs' ability to retrieve a single, pre-defined answer, rather than their creative potential to generate diverse, yet equally valid, molecular candidates. To address this critical gap, we propose Speak-to-Structure (S^2-Bench), the first benchmark to evaluate LLMs in open-domain natural language-driven molecule generation. S^2-Bench is specifically designed for one-to-many relationships, challenging LLMs to exhibit genuine molecular understanding and open-ended generation capabilities. Our benchmark includes three key tasks: molecule editing (MolEdit), molecule optimization (MolOpt), and customized molecule generation (MolCustom), each probing a different aspect of molecule discovery. We also introduce OpenMolIns, a large-scale instruction tuning dataset that enables Llama3.1-8B to surpass the most powerful LLMs like GPT-4o and Claude-3.5 on S^2-Bench. Our comprehensive evaluation of 31 LLMs shifts the focus from simple pattern recall to realistic molecular design, paving the way for more capable LLMs in natural language-driven molecule discovery. Our codes and datasets are fully accessible through the Github Repository: https://github.com/phenixace/S2-TOMG-Bench and Huggingface Datasets: https://huggingface.co/datasets/phenixace/S2-TOMG-Bench.

翻译：近期，大语言模型（LLMs）在自然语言驱动的分子发现任务中展现出巨大潜力。然而，现有用于分子-文本对齐的数据集与基准测试主要基于一一映射关系，仅评测LLMs检索单一预定义答案的能力，而非其生成多样化且同等有效的候选分子的创造性潜能。为填补这一关键空白，我们提出Speak-to-Structure（S^2-Bench）——首个用于评估LLMs在开放域自然语言驱动分子生成中的基准测试。S^2-Bench专为"一对多"关系设计，要求LLMs展现真正的分子理解能力与开放式生成能力。该基准包含三项核心任务：分子编辑（MolEdit）、分子优化（MolOpt）和定制化分子生成（MolCustom），分别从不同维度探索分子发现过程。我们还引入了大规模指令微调数据集OpenMolIns，使Llama3.1-8B在S^2-Bench上超越GPT-4o、Claude-3.5等最强LLMs。我们对31个LLMs的系统性评估将研究焦点从简单模式记忆转向真实分子设计场景，为开发更强大的自然语言驱动分子发现LLMs铺平道路。代码与数据集已通过Github仓库（https://github.com/phenixace/S2-TOMG-Bench）及Huggingface数据集（https://huggingface.co/datasets/phenixace/S2-TOMG-Bench）完全开放获取。