In the era of graph-based retrieval-augmented generation (RAG), link prediction is a significant preprocessing step for improving the quality of fragmented or incomplete domain-specific data for the graph retrieval. Knowledge management in the process industry uses RAG-based applications to optimize operations, ensure safety, and facilitate continuous improvement by effectively leveraging operational data and past insights. A key challenge in this domain is the fragmented nature of event logs in shift books, where related records are often kept separate, even though they belong to a single event or process. This fragmentation hinders the recommendation of previously implemented solutions to users, which is crucial in the timely problem-solving at live production sites. To address this problem, we develop a record linking model, which we define as a cross-document coreference resolution (CDCR) task. Record linking adapts the task definition of CDCR and combines two state-of-the-art CDCR models with the principles of natural language inference (NLI) and semantic text similarity (STS) to perform link prediction. The evaluation shows that our record linking model outperformed the best versions of our baselines, i.e., NLP and STS, by 28% (11.43 p) and 27.4% (11.21 p), respectively. Our work demonstrates that common NLP tasks can be combined and adapted to a domain-specific setting of the German process industry, improving data quality and connectivity in shift logs.
翻译:在基于图的检索增强生成(RAG)时代,链接预测是提升图检索中碎片化或不完整领域特定数据质量的重要预处理步骤。流程工业中的知识管理通过有效利用运营数据和历史经验,采用基于RAG的应用来优化操作、确保安全并促进持续改进。该领域的一个关键挑战是交接班日志中事件日志的碎片化特性——相关记录虽然属于同一事件或流程,却往往被分散保存。这种碎片化阻碍了向用户推荐先前实施的解决方案,而这对于生产现场及时解决问题至关重要。为解决该问题,我们开发了一个记录链接模型,并将其定义为跨文档共指消解(CDCR)任务。该记录链接模型通过适配CDCR任务定义,结合两种最先进的CDCR模型与自然语言推理(NLI)及语义文本相似度(STS)原理进行链接预测。评估表明,我们的记录链接模型分别以28%(11.43个百分点)和27.4%(11.21个百分点)的优势超越最佳版本的基线模型(即NLP模型和STS模型)。本研究证明了通用NLP任务可被组合并适配至德语流程工业的特定领域场景,从而提升交接班日志的数据质量与关联性。