Large-language Models (LLMs) need to adopt Retrieval-Augmented Generation (RAG) to generate factual responses that are better suited to knowledge-based applications in the design process. We present a data-driven method to identify explicit facts of the form - head entity :: relationship :: tail entity from patented artefact descriptions. We train roBERTa Transformer-based sequence classification models using our proprietary dataset of 44,227 sentences. Upon classifying tokens in a sentence as entities or relationships, our method uses another classifier to identify specific relationship tokens for a given pair of entities. We compare the performances against linear classifiers and Graph Neural Networks (GNNs) that both incorporate BERT Transformer-based token embeddings to predict associations among the entities and relationships. We apply our method to 4,870 fan system related patents and populate a knowledge base that constitutes around 3 million facts. Using the knowledge base, we demonstrate retrieving generalisable and specific domain knowledge for contextualising LLMs.
翻译:大型语言模型(LLM)需要采用检索增强生成(RAG)技术,以生成更适合设计过程中知识型应用的、基于事实的响应。我们提出了一种数据驱动的方法,用于从专利制品描述中识别形式为“头实体 :: 关系 :: 尾实体”的显式事实。我们使用包含44,227个句子的专有数据集,训练了基于roBERTa Transformer的序列分类模型。在将句子中的词元分类为实体或关系后,我们的方法使用另一个分类器来识别给定实体对之间的特定关系词元。我们将模型性能与线性分类器和图神经网络(GNN)进行了比较,这两者都结合了基于BERT Transformer的词元嵌入来预测实体与关系之间的关联。我们将该方法应用于4,870项与风扇系统相关的专利,并构建了一个包含约300万条事实的知识库。利用该知识库,我们演示了如何检索可泛化的及特定领域的知识,以为大型语言模型提供上下文。