Language models can hallucinate when performing complex and detailed mathematical reasoning. Physics provides a rich domain for assessing mathematical reasoning capabilities where physical context imbues the use of symbols which needs to satisfy complex semantics (\textit{e.g.,} units, tensorial order), leading to instances where inference may be algebraically coherent, yet unphysical. In this work, we assess the ability of Language Models (LMs) to perform fine-grained mathematical and physical reasoning using a curated dataset encompassing multiple notations and Physics subdomains. We improve zero-shot scores using synthetic in-context examples, and demonstrate non-linear degradation of derivation quality with perturbation strength via the progressive omission of supporting premises. We find that the models' mathematical reasoning is not physics-informed in this setting, where physical context is predominantly ignored in favour of reverse-engineering solutions.
翻译:语言模型在执行复杂且详细的数学推理时可能产生幻觉。物理学为评估数学推理能力提供了丰富的领域,其中物理上下文赋予符号以含义,这些符号需要满足复杂的语义(例如单位、张量阶数),导致推理可能在代数上连贯但在物理上不成立的情况。在本工作中,我们利用涵盖多种符号体系和物理子领域的精选数据集,评估了语言模型执行细粒度数学与物理推理的能力。我们通过使用合成的上下文示例提升了零样本得分,并展示了随着支持性前提的逐步省略,推导质量随扰动强度呈现非线性下降。我们发现,在此设置下模型的数学推理并未融入物理知识,物理上下文主要被忽略,转而倾向于逆向工程求解。