We propose an unsupervised, corpus-independent method to extract keywords from a single text. It is based on the spatial distribution of words and the response of this distribution to a random permutation of words. As compared to existing methods (such as e.g. YAKE) our method has three advantages. First, it is significantly more effective at extracting keywords from long texts. Second, it allows inference of two types of keywords: local and global. Third, it uncovers basic themes in texts. Additionally, our method is language-independent and applies to short texts. The results are obtained via human annotators with previous knowledge of texts from our database of classical literary works (the agreement between annotators is from moderate to substantial). Our results are supported via human-independent arguments based on the average length of extracted content words and on the average number of nouns in extracted words. We discuss relations of keywords with higher-order textual features and reveal a connection between keywords and chapter divisions.
翻译:我们提出了一种无监督、不依赖语料库的单一文本关键词提取方法。该方法基于词语的空间分布及其对随机排列的响应。与现有方法(如YAKE)相比,本方法具有三个优势:首先,在长文本关键词提取中显著更有效;其次,能够推断出局部与全局两类关键词;第三,可揭示文本的基本主题。此外,该方法不依赖语言且适用于短文本。实验结果通过人类标注员对经典文学作品数据库的先验知识验证(标注员间一致性介于中等至显著之间)。我们基于提取实词的平均长度和提取词中名词的平均数量提出了不依赖人类的论证支持,并讨论了关键词与高阶文本特征的关系,揭示了关键词与章节划分的关联。