Navigating the landscape of particle accelerators has become increasingly challenging with recent surges in contributions. These intricate devices challenge comprehension, even within individual facilities. To address this, we introduce PACuna, a fine-tuned language model refined through publicly available accelerator resources like conferences, pre-prints, and books. We automated data collection and question generation to minimize expert involvement and make the data publicly available. PACuna demonstrates proficiency in addressing intricate accelerator questions, validated by experts. Our approach shows adapting language models to scientific domains by fine-tuning technical texts and auto-generated corpora capturing the latest developments can further produce pre-trained models to answer some intricate questions that commercially available assistants cannot and can serve as intelligent assistants for individual facilities.
翻译:摘要:随着近期相关研究的激增,粒子加速器领域的探索变得日益困难。这些精密装置即使在单个设施内部也面临理解上的挑战。为此,我们提出PACuna——一种通过公开加速器资源(如会议论文、预印本及专著)进行微调的语言模型。我们采用自动化数据采集与问题生成流程,以最小化专家人工干预,并将数据集公开。经专家验证,PACuna在处理复杂的加速器问题方面表现出色。我们的研究表明,通过微调技术文本和自动生成的语料库(可捕获最新进展)将语言模型适配至科学领域,能够进一步生成预训练模型,以解答商业助手无法处理的复杂问题,并可作为单个设施的智能辅助工具。