We examine rules for predicting whether a point in $\mathbb{R}$ generated from a 50-50 mixture of two different probability distributions came from one distribution or the other, given limited (or no) information on the two distributions, and, as clues, one point generated randomly from each of the two distributions. We prove that nearest-neighbor prediction does better than chance when we know the two distributions are Gaussian densities without knowing their parameter values. We conjecture that this result holds for general probability distributions and, furthermore, that the nearest-neighbor rule is optimal in this setting, i.e., no other rule can do better than it if we do not know the distributions or do not know their parameters, or both.
翻译:我们研究一种预测规则,用于判断从两个不同概率分布以50-50混合生成的$\mathbb{R}$中的点究竟来自哪一个分布,前提是对这两个分布的信息知之甚少(或完全未知),且作为线索,每个分布各随机生成一个点。我们证明,当已知两个分布为高斯密度但参数值未知时,最近邻预测的效果优于随机猜测。我们猜想这一结论对一般概率分布成立,并且在此设定下最近邻规则是最优的——即,若我们不知道分布本身、或不知道其参数、或两者均不知,则没有任何其他规则能比最近邻规则表现更好。