Large-scale scientific collaborations, such as the Compact Muon Solenoid (CMS) at CERN, produce a vast and ever-growing corpus of internal documentation. Navigating this complex information landscape presents a significant challenge for both new and experienced researchers, hindering knowledge sharing and slowing down the pace of scientific discovery. To address this, we present a prototype of MITRA, a Retrieval-Augmented Generation (RAG) based system, designed to answer specific, context-aware questions about physics analyses. MITRA employs a novel, automated pipeline using Selenium for document retrieval from internal databases and Optical Character Recognition (OCR) with layout parsing for high-fidelity text extraction. Crucially, MITRA's entire framework, from the embedding model to the Large Language Model (LLM), is hosted on-premise, ensuring that sensitive collaboration data remains private. We introduce a two-tiered vector database architecture that first identifies the relevant analysis from abstracts before focusing on the full documentation, resolving potential ambiguities between different analyses. We demonstrate the prototype's superior retrieval performance against a standard keyword-based baseline on realistic queries and discuss future work towards developing a comprehensive research agent for large experimental collaborations.
翻译:大型科学协作项目,如欧洲核子研究中心(CERN)的紧凑μ子螺线管(CMS)实验,产生了体量庞大且持续增长的内部文档库。驾驭这一复杂的信息环境对新老研究人员而言均构成重大挑战,阻碍了知识共享并减缓了科学发现的步伐。为此,我们提出了MITRA的原型系统,这是一个基于检索增强生成(RAG)的系统,旨在回答关于物理分析的、具体且上下文感知的问题。MITRA采用了一种新颖的自动化流程,利用Selenium从内部数据库检索文档,并结合光学字符识别(OCR)与版面解析技术以实现高保真度的文本提取。至关重要的是,MITRA的整个框架——从嵌入模型到大型语言模型(LLM)——均在本地部署,确保了敏感协作数据的私密性。我们引入了一种双层向量数据库架构,该架构首先从摘要中识别相关分析,再聚焦于完整文档,从而解决了不同分析之间可能存在的歧义。我们通过实际查询展示了该原型系统相对于标准基于关键词基线的优越检索性能,并讨论了未来为大型实验协作开发全面研究助手的工作方向。