We propose an unsupervised method to extract keywords and keyphrases from texts based on a pre-trained language model (LM) and Shannon's information maximization. Specifically, our method extracts phrases having the highest conditional entropy under the LM. The resulting set of keyphrases turns out to solve a relevant information-theoretic problem: if provided as side information, it leads to the expected minimal binary code length in compressing the text using the LM and an entropy encoder. Alternately, the resulting set is an approximation via a causal LM to the set of phrases that minimize the entropy of the text when conditioned upon it. Empirically, the method provides results comparable to the most commonly used methods in various keyphrase extraction benchmark challenges.
翻译:我们提出了一种基于预训练语言模型(LM)与香农信息最大化的无监督文本关键词及关键短语提取方法。具体而言,本方法从语言模型中提取具有最高条件熵的短语。由此得到的关键短语集合解决了相关信息论问题:若将其作为侧信息提供,在利用该语言模型和熵编码器压缩文本时,可实现预期的最小二进制码长度。另一种等价表述是,该集合是通过因果语言模型对一组短语的近似——这些短语在作为条件时能最小化文本的熵。实验表明,该方法在多项关键短语提取基准测试中取得了与主流方法相当的效果。