Term frequency-inverse document frequency, or TF-IDF for short, and its many variants form a class of term weighting functions the members of which are widely used in text analysis applications. While TF-IDF was originally proposed as a heuristic, theoretical justifications grounded in information theory, probability, and the divergence from randomness paradigm have been advanced. In this work, we present an empirical study showing that TF-IDF corresponds very nearly with the hypergeometric test of statistical significance on selected real-data document retrieval, summarization, and classification tasks. These findings suggest that a fundamental mathematical connection between TF-IDF and the negative logarithm of the hypergeometric test P-value (i.e., a hypergeometric distribution tail probability) remains to be elucidated. We advance the empirical analyses herein as a first step toward explaining the long-standing effectiveness of TF-IDF from a statistical significance testing lens. It is our aspiration that these results will open the door to the systematic evaluation of significance testing derived term weighting functions in text analysis applications.
翻译:词频-逆文档频率(简称TF-IDF)及其众多变体构成一类广泛应用于文本分析任务的词权重函数。尽管TF-IDF最初作为启发式方法提出,但基于信息论、概率论及随机偏离范式的理论解释已逐步发展。本研究通过实证分析表明,在选定的真实数据文档检索、摘要生成和分类任务中,TF-IDF与超几何检验的统计显著性结果高度吻合。这些发现暗示TF-IDF与超几何检验P值(即超几何分布尾概率)的负对数之间可能存在基础数学关联。我们提出本文的实证分析作为第一步,旨在从统计显著性检验视角阐释TF-IDF长期有效性的机理。期望这些结果能为系统评估基于显著性检验的词权重函数在文本分析中的应用打开新方向。