In the era of graph-based retrieval-augmented generation (RAG), link prediction is a significant preprocessing step for improving the quality of fragmented or incomplete domain-specific data for the graph retrieval. Knowledge management in the process industry uses RAG-based applications to optimize operations, ensure safety, and facilitate continuous improvement by effectively leveraging operational data and past insights. A key challenge in this domain is the fragmented nature of event logs in shift books, where related records are often kept separate, even though they belong to a single event or process. This fragmentation hinders the recommendation of previously implemented solutions to users, which is crucial in the timely problem-solving at live production sites. To address this problem, we develop a record linking (RL) model, which we define as a cross-document coreference resolution (CDCR) task. RL adapts the task definition of CDCR and combines two state-of-the-art CDCR models with the principles of natural language inference (NLI) and semantic text similarity (STS) to perform link prediction. The evaluation shows that our RL model outperformed the best versions of our baselines, i.e., NLP and STS, by 28% (11.43 p) and 27.4% (11.21 p), respectively. Our work demonstrates that common NLP tasks can be combined and adapted to a domain-specific setting of the German process industry, improving data quality and connectivity in shift logs.
翻译:在图检索增强生成(RAG)时代,链接预测是提升用于图检索的碎片化或不完整领域特定数据质量的重要预处理步骤。过程工业中的知识管理利用基于RAG的应用程序,通过有效利用运营数据和历史洞见来优化操作、确保安全并促进持续改进。该领域的一个关键挑战在于交接班记录本中事件日志的碎片化特性,即属于同一事件或流程的相关记录往往分散保存。这种碎片化阻碍了向用户推荐先前已实施的解决方案,而这对于生产现场的及时问题解决至关重要。为解决此问题,我们开发了一种记录链接(RL)模型,并将其定义为跨文档共指消解(CDCR)任务。RL模型调整了CDCR的任务定义,并结合两种最先进的CDCR模型与自然语言推理(NLI)及语义文本相似度(STS)原理来执行链接预测。评估结果表明,我们的RL模型相较于基线模型(即NLP和STS)的最佳版本,性能分别提升了28%(11.43个百分点)和27.4%(11.21个百分点)。我们的工作表明,常见的自然语言处理任务可以结合并适配到德国过程工业的特定领域场景中,从而提升交接班日志的数据质量与关联性。