Large language models are increasingly applied to materials science reasoning, yet their behavior under physically structured distribution shifts remains poorly understood. We introduce SCALAR (Structural Consistency And Logic Across Regimes), a benchmark for evaluating geometric scale generalization and its connection to structural hallucination, consistency, and reasoning in materials foundation models. Given canonical crystal representations, models must reason about derived nanoparticle structures obtained through supercell expansion and geometric truncation across length scales spanning a few atoms to over 18,000 atoms, totaling $\approx$100,000 structures from DFT-validated unit cells. SCALAR defines three tasks. (i) CIF to property prediction. (ii) A Chain-of-Thought variant with explicit physics-grounded reasoning. (iii) Inverse retrieval identifying crystals from candidates given target properties. Outputs are evaluated via structured metrics capturing numeric error, hallucination, cross-prompt consistency, monotonic reasoning, output validity, and retrieval regret. Experiments across diverse foundation models reveal large, model-dependent shifts under explicit reasoning, often reducing hallucination and error, but frequently destabilizing consistency or validity. These results demonstrate that geometric scale generalization cannot be inferred from accuracy alone. Supplementary materials are available at https://github.com/KurbanIntelligenceLab/SCALAR.
翻译:大型语言模型在材料科学推理中的应用日益增多,然而其在物理结构化分布偏移下的行为仍知之甚少。我们提出了SCALAR(跨体系的结构一致性与逻辑),这是一个用于评估几何尺度泛化能力及其与材料基础模型中结构幻觉、一致性和推理之间联系的基准。给定标准晶体表示,模型必须对通过超晶胞扩展和几何截断在不同长度尺度(从几个原子到超过18,000个原子)下获得的衍生纳米颗粒结构进行推理,总计包含约100,000个来自DFT验证晶胞的结构。SCALAR定义了三个任务:(i) CIF到性质预测;(ii) 包含显式物理基础推理的思维链变体;(iii) 给定目标性质,从候选晶体中进行逆向检索识别。输出通过结构化指标进行评估,这些指标捕捉数值误差、幻觉、跨提示一致性、单调推理、输出有效性和检索遗憾。在不同基础模型上的实验表明,在显式推理下存在显著且模型依赖的偏移,通常能减少幻觉和误差,但常常破坏一致性或有效性。这些结果表明,几何尺度泛化能力不能仅从准确性推断。补充材料可在 https://github.com/KurbanIntelligenceLab/SCALAR 获取。