It might seem counter-intuitive at first: We find that, in expectation, the proportion of data points in an unknown population-that belong to classes that do not appear in the training data-is almost entirely determined by the number $f_k$ of classes that do appear in the training data the same number of times. While in theory we show that the difference of the induced estimator decays exponentially in the size of the sample, in practice the high variance prevents us from using it directly for an estimator of the sample coverage. However, our precise characterization of the dependency between $f_k$'s induces a large search space of different representations of the expected value, which can be deterministically instantiated as estimators. Hence, we turn to optimization and develop a genetic algorithm that, given only the sample, searches for an estimator with minimal mean-squared error (MSE). In our experiments, our genetic algorithm discovers estimators that have a substantially smaller MSE than the state-of-the-art Good-Turing estimator. This holds for over 96% of runs when there are at least as many samples as classes. Our estimators' MSE is roughly 80% of the Good-Turing estimator's.
翻译:乍看之下或许有违直觉:我们发现,在期望意义上,未知总体中属于训练数据中未出现类别的那部分数据点比例,几乎完全由训练数据中出现相同次数的类别数量 $f_k$ 所决定。理论上,我们证明了由此导出的估计量之差随样本规模呈指数衰减,但在实践中,高方差阻止了我们直接将其用于样本覆盖率的估计。然而,我们对 $f_k$ 之间依赖关系的精确刻画,催生了一个庞大的搜索空间,其中包含期望值的多种不同表示形式,这些表示形式可被确定性地实例化为估计量。因此,我们转向优化方法,开发出一种遗传算法,该算法仅基于样本即可搜索出具有最小均方误差(MSE)的估计量。在我们的实验中,遗传算法所发现的估计量,其MSE显著小于当前最先进的Good-Turing估计量。当样本数量不少于类别数量时,这一优势在超过96%的运行轮次中成立。我们的估计量MSE约为Good-Turing估计量的80%。