A novel approach to measuring patent claim scope based on probabilities obtained from (large) language models

from arxiv, 58 pages, 8 tables, 6 figures. Substantial changes made to version 2: New section 4.1 added (including a new table); Minor normalization issue corrected in values listed in Appendix B; Content of former appendix C now moved to Section 3; and new Appendix C added. Minor changes made to version 3 (style, typos, language)

This work proposes to measure the scope of a patent claim as the reciprocal of the self-information contained in this claim. A probability of occurrence of the claim is obtained from a language model and this probability is used to compute the self-information. Grounded in information theory, this approach is based on the assumption that an unlikely concept is more informative than a usual concept, insofar as it is more surprising. In turn, the more surprising the information required to defined the claim, the narrower its scope. Five language models are considered, ranging from simplest models (each word or character is assigned an identical probability) to intermediate models (using average word or character frequencies), to a large language model (GPT2). Interestingly, the scope resulting from the simplest language models is proportional to the reciprocal of the number of words or characters involved in the claim, a metric already used in previous works. Application is made to multiple series of patent claims directed to distinct inventions, where each series consists of claims devised to have a gradually decreasing scope. The performance of the language models is assessed with respect to several ad hoc tests. The more sophisticated the model, the better the results. I.e., the GPT2 probability model outperforms models based on word and character frequencies, which themselves outdo the simplest models based on word or character counts. Still, the character count appears to be a more reliable indicator than the word count.

翻译：本研究提出将专利权利要求范围度量为其所含自信息的倒数。通过语言模型获取权利要求的发生概率，并利用该概率计算自信息。该方法以信息论为基础，基于以下假设：非常见概念比常见概念更具信息量，因其更令人意外。相应地，定义权利要求所需信息越令人意外，其范围越窄。本研究考虑了五种语言模型，涵盖最简单的模型（每个单词或字符被赋予相同概率）、中等模型（使用平均词频或字符频率）直至大型语言模型（GPT2）。有趣的是，最简单语言模型得出的范围与权利要求所含单词或字符数量的倒数成正比——这一度量指标在既往研究中已有应用。该方法被应用于针对不同发明的多系列专利权利要求，每个系列由预设范围逐渐缩小的权利要求组成。通过多项特定测试评估语言模型性能：模型越复杂，结果越优。即GPT2概率模型优于基于词频和字符频率的模型，而这些模型又优于基于单词或字符计数的简单模型。尽管如此，字符计数仍是比单词计数更可靠的指标。