Large Language Models (LLMs) excel at textual reasoning and are beginning to develop spatial understanding, prompting the question of whether these abilities can be combined for complex, domain-specific tasks. This question is essential in fields like materials science, where deep understanding of 3D atomic structures is fundamental. While initial studies have successfully applied LLMs to tasks involving pure crystal generation or coordinate understandings, a standardized benchmark to systematically evaluate their core reasoning abilities across diverse atomic structures has been notably absent. To address this gap, we introduce the AtomWorld benchmark to evaluate LLMs on tasks based in Crystallographic Information Files (CIFs), a standard structure representation format. These tasks, including structural editing, CIF perception, and property-guided modeling, reveal a critical limitation: current models, despite establishing promising baselines, consistently fail in structural understanding and spatial reasoning. Our experiments show that these models make frequent errors on structure modification tasks, and even in the basic CIF format understandings, potentially leading to cumulative errors in subsequent analysis and materials insights. By defining these standardized tasks, AtomWorld lays the ground for advancing LLMs toward robust atomic-scale modeling, crucial for accelerating materials research and automating scientific workflows.
翻译:大语言模型(LLMs)在文本推理方面表现出色,并开始发展空间理解能力,这引发了一个问题:这些能力能否结合以完成复杂的领域特定任务。这一问题在材料科学等领域至关重要,其中对三维原子结构的深入理解是基础。尽管初步研究已成功将LLMs应用于纯晶体生成或坐标理解等任务,但一个能够系统评估其在不同原子结构中核心推理能力的标准化基准却明显缺失。为填补这一空白,我们引入了AtomWorld基准,以基于晶体学信息文件(CIFs,一种标准结构表示格式)的任务来评估LLMs。这些任务包括结构编辑、CIF感知和属性引导建模,揭示了一个关键局限:当前模型尽管建立了有希望的基线,但在结构理解和空间推理方面持续失败。我们的实验表明,这些模型在结构修改任务上频繁出错,甚至在基本的CIF格式理解中也是如此,可能导致后续分析和材料见解中的累积错误。通过定义这些标准化任务,AtomWorld为推进LLMs实现稳健的原子尺度建模奠定了基础,这对于加速材料研究和自动化科学工作流程至关重要。