This work proposes to measure the scope of a patent claim as the reciprocal of the self-information contained in this claim. Grounded in information theory, this approach is based on the assumption that a rare concept is more informative than a usual concept, inasmuch as it is more surprising. The self-information is calculated from the probability of occurrence of that claim, where the probability is calculated in accordance with a language model. Five language models are considered, ranging from the simplest models (each word or character is drawn from a uniform distribution) to intermediate models (using average word or character frequencies), to a large language model (GPT2). Interestingly, the simplest language models reduce the scope measure to the reciprocal of the word or character count, a metric already used in previous works. Application is made to nine series of patent claims directed to distinct inventions, where the claims in each series have a gradually decreasing scope. The performance of the language models is then assessed with respect to several ad hoc tests. The more sophisticated the model, the better the results. The GPT2 model outperforms models based on word and character frequencies, which are themselves ahead of models based on word and character counts.
翻译:本研究提出将专利权利要求的保护范围测量为其所含自信息的倒数。该法基于信息论,认为罕见概念比常见概念更具信息量,因其具有更高的意外性。自信息通过该权利要求出现概率计算,其中概率依据语言模型得出。研究考虑了五种语言模型:从最简单的模型(每个词或字符服从均匀分布)到中间模型(使用平均词频或字符频率),再到大型语言模型(GPT2)。有趣的是,最简语言模型将保护范围测量简化为词数或字符数的倒数——这一指标在先前文献中已有应用。该方法应用于九个针对不同发明且保护范围逐步递减的专利权利要求系列。随后通过多项特设测试评估各语言模型表现:模型越复杂,结果越优。GPT2模型优于基于词频和字符频率的模型,而后者又领先于基于词数与字符数的模型。