In text documents such as news articles, the content and key events usually revolve around a subset of all the entities mentioned in a document. These entities, often deemed as salient entities, provide useful cues of the aboutness of a document to a reader. Identifying the salience of entities was found helpful in several downstream applications such as search, ranking, and entity-centric summarization, among others. Prior work on salient entity detection mainly focused on machine learning models that require heavy feature engineering. We show that fine-tuning medium-sized language models with a cross-encoder style architecture yields substantial performance gains over feature engineering approaches. To this end, we conduct a comprehensive benchmarking of four publicly available datasets using models representative of the medium-sized pre-trained language model family. Additionally, we show that zero-shot prompting of instruction-tuned language models yields inferior results, indicating the task's uniqueness and complexity.
翻译:在新闻文章等文本文档中,内容与关键事件通常围绕文档中提及的所有实体中的一部分展开。这些常被视为显著实体的元素,为读者理解文档的主题提供了有用线索。识别实体的显著度已被证实在搜索、排序及以实体为核心的摘要等下游应用中具有重要价值。关于显著实体检测的前期研究主要集中于需要大量特征工程的机器学习模型。我们证明了采用交叉编码器架构对中等规模语言模型进行微调,相比特征工程方法能带来显著的性能提升。为此,我们利用代表预训练中等规模语言模型家族的模型,对四个公开数据集进行了全面基准测试。此外,我们还展示了对指令调优语言模型进行零样本提示并未取得理想结果,这凸显了该任务的独特性和复杂性。