Discrete entropy estimation is a classic information theory problem, wherein the average information content of a discrete random variable is estimated from samples alone. Naive approaches, such as the plugin method, fail to account for the probability mass associated with members of the random variable's support that are unobserved in a given sample, known as the "missing mass." The resulting systemic underestimation is particularly problematic when data is time-consuming or costly to gather. We propose SENECA, an entropy estimation scheme based on a novel ``self-consistent'' missing mass calculation. Extensive numerical experiments indicate that our approach outperforms many state-of-the-art alternatives overall in the small-sample setting. We then apply SENECA to two practical use cases, namely biodiversity estimation and the detection of incorrect large language model responses, where our method is competitive with domain-specific approaches. Our work advances SENECA as an effective drop-in replacement for small-sample entropy estimation, with broad utility across several domains.
翻译:离散熵估计是经典的信息论问题,其目标是通过样本估计离散随机变量的平均信息量。朴素方法(如插件法)未能考虑给定样本中未观测到的随机变量支持成员对应的概率质量(即“缺失质量”),导致系统性的低估现象。当数据采集耗时或成本高昂时,这一问题尤为突出。我们提出SENECA,一种基于新型“自洽”缺失质量计算的熵估计方案。大量数值实验表明,在样本量较小的情况下,我们的方法总体上优于许多现有最优替代方案。随后,我们将SENECA应用于两个实际场景:生物多样性估计与大型语言模型错误响应检测。在这些场景中,我们的方法与领域专用方法具有竞争力。本研究将SENECA推广为一种有效的小样本熵估计即插即用替代方案,在多个领域具有广泛适用性。