Navigating the landscape of particle accelerators has become increasingly challenging with recent surges in contributions. These intricate devices challenge comprehension, even within individual facilities. To address this, we introduce PACuna, a fine-tuned language model refined through publicly available accelerator resources like conferences, pre-prints, and books. We automated data collection and question generation to minimize expert involvement and make the data publicly available. PACuna demonstrates proficiency in addressing intricate accelerator questions, validated by experts. Our approach shows adapting language models to scientific domains by fine-tuning technical texts and auto-generated corpora capturing the latest developments can further produce pre-trained models to answer some intricate questions that commercially available assistants cannot and can serve as intelligent assistants for individual facilities.
翻译:随着粒子加速器领域相关成果的近期激增,理解该领域的技术版图变得愈发困难。即使在单一设施内,这些精密装置也难以被全面理解。为此,我们提出PACuna——一种通过公开加速器资源(如会议论文、预印本和专著)进行微调的语言模型。我们实现了数据收集与问题生成的自动化,以最小化专家参与,并将相关数据公开。专家验证表明,PACuna在解答复杂的加速器技术问题方面表现优异。我们的研究表明,通过微调技术文本及自动生成语料(涵盖最新进展)将语言模型适配至科学领域,可进一步开发出预训练模型,使其能够解答现有商用助手无法回答的复杂问题,并可作为单个设施的智能助手。