This paper addresses the challenge of identifying a minimal subset of discrete, independent variables that best predicts a binary class. We propose an efficient iterative method that sequentially selects variables based on which one provides the most statistically significant reduction in conditional entropy, using confidence bounds to account for finite-sample uncertainty. Tests on simulated data demonstrate the method's ability to correctly identify influential variables while minimizing spurious selections, even with small sample sizes, offering a computationally tractable solution to this NP-complete problem.
翻译:本文针对从离散独立变量中识别能够最佳预测二元类别的最小子集这一挑战性问题,提出一种高效的迭代方法。该方法基于置信区间考虑有限样本不确定性,通过顺序选择能够提供最显著条件熵统计性降低的变量。在模拟数据上的测试表明,即使在小样本条件下,该方法也能准确识别关键变量并最小化伪选择,为这一NP完全问题提供了计算可行的解决方案。