Domain dependence and annotation subjectivity pose challenges for supervised keyword extraction. Based on the premises that second-order keyness patterns are existent at the community level and learnable from annotated keyword extraction datasets, this paper proposes a supervised ranking approach to keyword extraction that ranks keywords with keyness patterns consisting of independent features (such as sublanguage domain and term length) and three categories of dependent features -- heuristic features, specificity features, and representavity features. The approach uses two convolutional-neural-network based models to learn keyness patterns from keyword datasets and overcomes annotation subjectivity by training the two models with bootstrap sampling strategy. Experiments demonstrate that the approach not only achieves state-of-the-art performance on ten keyword datasets in general supervised keyword extraction with an average top-10-F-measure of 0.316 , but also robust cross-domain performance with an average top-10-F-measure of 0.346 on four datasets that are excluded in the training process. Such cross-domain robustness is attributed to the fact that community-level keyness patterns are limited in number and temperately independent of language domains, the distinction between independent features and dependent features, and the sampling training strategy that balances excess risk and lack of negative training data.
翻译:领域依赖性和标注主观性为监督式关键词抽取带来了挑战。基于社区层面存在二阶关键性模式且可从标注的关键词抽取数据集中学习的假设,本文提出了一种监督式排序方法用于关键词抽取,该方法利用关键性模式对关键词进行排序,关键性模式由独立特征(如子语言领域和术语长度)以及三类依赖特征——启发式特征、特异性特征和代表性特征构成。该方法使用两个基于卷积神经网络的模型从关键词数据集中学习关键性模式,并通过采用自助采样策略训练这两个模型来克服标注主观性。实验表明,该方法不仅在十个关键词数据集上的一般监督式关键词抽取中取得了最先进的性能(平均Top-10 F值为0.316),而且在训练过程中未包含的四个数据集上表现出稳健的跨领域性能(平均Top-10 F值为0.346)。这种跨领域稳健性归因于以下事实:社区级关键性模式数量有限且适度独立于语言领域;独立特征与依赖特征之间的区分;以及平衡过度风险和负训练数据不足的采样训练策略。