The key to success in automating prior art search in patent research using artificial intelligence (AI) lies in developing large datasets for machine learning (ML) and ensuring their availability. This work is dedicated to providing a comprehensive solution to the problem of creating infrastructure for research in this field, including datasets and tools for calculating search quality criteria. The paper discusses the concept of semantic clusters of patent documents that determine the state of the art in a given subject, as proposed by the authors. A definition of such semantic clusters is also provided. Prior art search is presented as the task of identifying elements within a semantic cluster of patent documents in the subject area specified by the document under consideration. A generator of user-configurable datasets for ML, based on collections of U.S. and Russian patent documents, is described. The dataset generator creates a database of links to documents in semantic clusters. Then, based on user-defined parameters, it forms a dataset of semantic clusters in JSON format for ML. A collection of publicly available patent documents was created. The collection contains 14 million semantic clusters of US patent documents and 1 million clusters of Russian patent documents. To evaluate ML outcomes, it is proposed to calculate search quality scores that account for semantic clusters of the documents being searched. To automate the evaluation process, the paper describes a utility developed by the authors for assessing the quality of prior art document search.
翻译:利用人工智能(AI)自动化专利研究中的先有技术检索,其成功关键在于开发适用于机器学习(ML)的大规模数据集并确保其可用性。本研究致力于为构建该领域研究基础设施提供一个全面的解决方案,包括数据集以及用于计算检索质量标准的工具。本文讨论了作者提出的、用于确定特定主题领域技术现状的专利文档语义聚类的概念,并给出了此类语义聚类的定义。先有技术检索被表述为一项任务:在由待审文档指定的主题领域内,识别专利文档语义聚类中的元素。文中描述了一个基于美国和俄罗斯专利文档集合的、用户可配置的ML数据集生成器。该数据集生成器创建一个指向语义聚类中文档链接的数据库。然后,根据用户定义的参数,它以JSON格式生成用于ML的语义聚类数据集。我们创建了一个公开可用的专利文档集合,其中包含1400万个美国专利文档的语义聚类和100万个俄罗斯专利文档的聚类。为评估ML结果,本文提出计算检索质量分数,该分数需考虑被检索文档的语义聚类。为自动化评估过程,文中描述了作者开发的一个用于评估先有技术文档检索质量的实用工具。