Large language models (LLMs) have demonstrated an unprecedented ability to perform complex tasks in multiple domains, including mathematical and scientific reasoning. We demonstrate that with carefully designed prompts, LLMs can accurately carry out key calculations in research papers in theoretical physics. We focus on a broadly used approximation method in quantum physics: the Hartree-Fock method, requiring an analytic multi-step calculation deriving approximate Hamiltonian and corresponding self-consistency equations. To carry out the calculations using LLMs, we design multi-step prompt templates that break down the analytic calculation into standardized steps with placeholders for problem-specific information. We evaluate GPT-4's performance in executing the calculation for 15 research papers from the past decade, demonstrating that, with correction of intermediate steps, it can correctly derive the final Hartree-Fock Hamiltonian in 13 cases and makes minor errors in 2 cases. Aggregating across all research papers, we find an average score of 87.5 (out of 100) on the execution of individual calculation steps. Overall, the requisite skill for doing these calculations is at the graduate level in quantum condensed matter theory. We further use LLMs to mitigate the two primary bottlenecks in this evaluation process: (i) extracting information from papers to fill in templates and (ii) automatic scoring of the calculation steps, demonstrating good results in both cases. The strong performance is the first step for developing algorithms that automatically explore theoretical hypotheses at an unprecedented scale.
翻译:大型语言模型(LLMs)已在数学与科学推理等多个领域展现出前所未有的复杂任务处理能力。我们证明,通过精心设计的提示模板,LLMs能精确完成理论物理研究论文中的关键计算。研究聚焦于量子物理中广泛使用的近似方法——哈特里-福克方法,该方法需要进行解析多步计算以推导近似哈密顿量及相应的自洽方程。为实现LLMs计算,我们设计了多步提示模板,将解析计算分解为标准步骤,并用占位符替代问题特定信息。我们评估了GPT-4在近十年15篇研究论文中的计算表现:通过中间步骤修正,模型能在13个案例中正确推导最终哈特里-福克哈密顿量,其余2个案例出现轻微错误。汇总所有研究论文,各计算步骤的平均得分为87.5分(满分100分)。总体而言,完成这些计算所需技能达到量子凝聚态理论研究生水平。我们进一步利用LLMs缓解该评估流程中的两大瓶颈:(i)从论文中提取信息以填充模板和(ii)计算步骤的自动评分,在两种情况下均取得良好效果。这一出色表现是开发可自动探索理论假设算法(其规模前所未有)的第一步。