Term frequency-inverse document frequency, or tf-idf for short, and its many variants form a class of term weighting functions the members of which are widely used in information retrieval applications. While tf-idf was originally proposed as a heuristic, theoretical justifications grounded in information theory, probability, and the divergence from randomness paradigm have been advanced. In this work, we present an empirical study showing that the hypergeometric test of statistical significance corresponds very nearly with a common tf-idf variant on selected real-data document retrieval and summarization tasks. These findings suggest that a fundamental mathematical connection between the tf-idf variant and the negative logarithm of the hypergeometric test P-value (i.e., a hypergeometric distribution tail probability) remains to be elucidated. We offer the empirical case study herein as a first step toward explaining the long-standing effectiveness of tf-idf from a statistical significance testing foundation.
翻译:词频-逆文档频率(简称TF-IDF)及其众多变体构成了一类广泛用于信息检索应用的词权重函数。尽管TF-IDF最初作为启发式方法提出,但基于信息论、概率论以及随机偏离范式等理论依据已得到不断补充。本研究通过实证分析表明,在选定的真实文档检索与摘要生成任务中,超几何统计显著性检验与常见的TF-IDF变体表现高度近似。这些发现提示,该TF-IDF变体与超几何检验P值(即超几何分布尾部概率)的负对数之间仍存在未被阐明的根本数学联系。本文提供的实证案例研究,旨在从统计显著性检验基础出发,迈出解释TF-IDF长久以来有效性的第一步。