Large-language Models (LLMs) need to adopt Retrieval-Augmented Generation (RAG) to generate factual responses that are better suited to knowledge-based applications in the design process. We present a data-driven method to identify explicit facts of the form - head entity :: relationship :: tail entity from patented artefact descriptions. We train roBERTa Transformer-based sequence classification models using our proprietary dataset of 44,227 sentences. Upon classifying tokens in a sentence as entities or relationships, our method uses another classifier to identify specific relationship tokens for a given pair of entities. We compare the performances against linear classifiers and Graph Neural Networks (GNNs) that both incorporate BERT Transformer-based token embeddings to predict associations among the entities and relationships. We apply our method to 4,870 fan system related patents and populate a knowledge base that constitutes around 3 million facts. Using the knowledge base, we demonstrate retrieving generalisable and specific domain knowledge for contextualising LLMs.
翻译:大型语言模型(LLMs)需要采用检索增强生成(RAG)来生成更适用于设计过程中基于知识的应用的事实性响应。我们提出了一种数据驱动方法,从专利工件的描述中识别显式事实,形式为“头实体::关系::尾实体”。我们使用包含44,227句句子的专有数据集,训练基于roBERTa Transformer的序列分类模型。在对句子中的词元进行实体或关系分类后,我们的方法利用另一个分类器为给定的实体对识别特定的关系词元。我们将性能与结合基于BERT Transformer的词元嵌入来预测实体与关系之间关联的线性分类器和图神经网络(GNNs)进行比较。我们将该方法应用于4,870项与风扇系统相关的专利,并构建了一个包含约300万个事实的知识库。利用该知识库,我们演示了如何检索可泛化及特定的领域知识,以对LLMs进行上下文化。