Keyword extraction is a foundational task in natural language processing, underpinning countless real-world applications. One of these is contextual advertising, where keywords help predict the topical congruence between ads and their surrounding media contexts to enhance advertising effectiveness. Recent advances in artificial intelligence have improved keyword extraction capabilities but also introduced concerns about computational cost. Moreover, although the end-user experience is of vital importance, human evaluation of keyword extraction performances remains under-explored. This study provides a comparative evaluation of prevalent keyword extraction algorithms with different levels of complexity represented by~TF-IDF, KeyBERT, and Llama~2. To evaluate their effectiveness, a mixed-methods approach is employed, combining quantitative benchmarking with qualitative assessments from 855 participants through four survey-based experiments. The findings demonstrate that KeyBERT achieves an effective balance between user preferences and computational efficiency, compared to the other algorithms. We observe a clear overall preference for gold-standard keywords, but there is a misalignment between algorithmic benchmark performance and user ratings. This reveals a long-overlooked gap between traditional precision-focused metrics and user-perceived algorithm efficiency. The study underscores the importance of human-in-the-loop evaluation methodologies and proposes analytical tools to facilitate their implementation.
翻译:关键词提取是自然语言处理领域的一项基础任务,支撑着众多现实世界应用。其中,上下文广告便是重要应用场景之一:通过提取关键词来预测广告与其所在媒体语境之间的主题契合度,从而提升广告投放效果。人工智能的最新进展虽然提升了关键词提取能力,但也引发了关于计算成本的担忧。此外,尽管终端用户体验至关重要,针对关键词提取性能的人工评估研究仍显不足。本研究对以TF-IDF、KeyBERT和Llama~2为代表的不同复杂度的主流关键词提取算法进行了比较评估。为全面评估其效能,研究采用混合方法,将定量基准测试与通过四项调查实验获得的855名参与者的定性评估相结合。研究结果表明,相较于其他算法,KeyBERT在用户偏好与计算效率之间实现了有效平衡。我们观察到参与者对黄金标准关键词存在明确的整体偏好,但算法基准性能与用户评分之间存在错位现象。这揭示了传统以精确度为核心的评估指标与用户感知的算法效能之间长期被忽视的差距。本研究强调了人在回路评估方法论的重要性,并提出了促进其实际应用的分析工具。