In this work, we introduce a comprehensive error typology specifically designed for evaluating two distinct tasks in machine-generated patent texts: claims-to-abstract generation, and the generation of the next claim given previous ones. We have also developed a benchmark, PatentEval, for systematically assessing language models in this context. Our study includes a comparative analysis, annotated by humans, of various models. These range from those specifically adapted during training for tasks within the patent domain to the latest general-purpose large language models (LLMs). Furthermore, we explored and evaluated some metrics to approximate human judgments in patent text evaluation, analyzing the extent to which these metrics align with expert assessments. These approaches provide valuable insights into the capabilities and limitations of current language models in the specialized field of patent text generation.
翻译:在本研究中,我们提出了一种专门用于评估机器生成专利文本中两项独立任务的综合错误类型学:权利要求至摘要的生成,以及基于先前权利要求生成下一项权利要求。我们还开发了一个基准测试集PatentEval,用于在此背景下系统评估语言模型。我们的研究包含对多种模型的人工标注比较分析,这些模型涵盖了在训练期间专门针对专利领域任务适配的模型,以及最新的通用大语言模型(LLMs)。此外,我们探索并评估了若干用于近似人类判断的专利文本评估指标,分析了这些指标与专家评估的一致程度。这些方法为当前语言模型在专利文本生成这一专业领域的能力与局限提供了有价值的见解。