This paper presents Patent-CR, the first dataset created for the patent claim revision task in English. It includes both initial patent applications rejected by patent examiners and the final granted versions. Unlike normal text revision tasks that predominantly focus on enhancing sentence quality, such as grammar correction and coherence improvement, patent claim revision aims at ensuring the claims meet stringent legal criteria. These criteria are beyond novelty and inventiveness, including clarity of scope, technical accuracy, language precision, and legal robustness. We assess various large language models (LLMs) through professional human evaluation, including general LLMs with different sizes and architectures, text revision models, and domain-specific models. Our results indicate that LLMs often bring ineffective edits that deviate from the target revisions. In addition, domain-specific models and the method of fine-tuning show promising results. Notably, GPT-4 outperforms other tested LLMs, but further revisions are still necessary to reach the examination standard. Furthermore, we demonstrate the inconsistency between automated and human evaluation results, suggesting that GPT-4-based automated evaluation has the highest correlation with human judgment. This dataset, along with our preliminary empirical research, offers invaluable insights for further exploration in patent claim revision.
翻译:本文提出了Patent-CR,这是首个为英文专利权利要求修订任务创建的数据集。该数据集同时包含了被专利审查员驳回的初始专利申请版本以及最终获得授权的版本。与通常主要关注提升句子质量的文本修订任务(如语法修正和连贯性改进)不同,专利权利要求修订旨在确保权利要求满足严格的法律标准。这些标准超越了新颖性和创造性,还包括权利范围的清晰性、技术准确性、语言精确性以及法律稳健性。我们通过专业人工评估,测试了多种大型语言模型,包括不同规模和架构的通用LLM、文本修订模型以及领域特定模型。我们的结果表明,LLM常常会产生偏离目标修订的无效编辑。此外,领域特定模型以及微调方法显示出有希望的结果。值得注意的是,GPT-4的表现优于其他测试的LLM,但仍需进一步修订才能达到审查标准。此外,我们揭示了自动评估与人工评估结果之间的不一致性,表明基于GPT-4的自动评估与人工判断具有最高的相关性。该数据集连同我们的初步实证研究,为专利权利要求修订的进一步探索提供了宝贵的见解。