We are interested in testing properties of distributions with systematically mislabeled samples. Our goal is to make decisions about unknown probability distributions, using a sample that has been collected by a confused collector, such as a machine-learning classifier that has not learned to distinguish all elements of the domain. The confused collector holds an unknown clustering of the domain and an input distribution $\mu$, and provides two oracles: a sample oracle which produces a sample from $\mu$ that has been labeled according to the clustering; and a label-query oracle which returns the label of a query point $x$ according to the clustering. Our first set of results shows that identity, uniformity, and equivalence of distributions can be tested efficiently, under the earth-mover distance, with remarkably weak conditions on the confused collector, even when the unknown clustering is adversarial. This requires defining a variant of the distribution testing task (inspired by the recent testable learning framework of Rubinfeld & Vasilyan), where the algorithm should test a joint property of the distribution and its clustering. As an example, we get efficient testers when the distribution tester is allowed to reject if it detects that the confused collector clustering is "far" from being a decision tree. The second set of results shows that we can sometimes do significantly better when the clustering is random instead of adversarial. For certain one-dimensional random clusterings, we show that uniformity can be tested under the TV distance using $\widetilde O\left(\frac{\sqrt n}{\rho^{3/2} \epsilon^2}\right)$ samples and zero queries, where $\rho \in (0,1]$ controls the "resolution" of the clustering. We improve this to $O\left(\frac{\sqrt n}{\rho \epsilon^2}\right)$ when queries are allowed.
翻译:我们研究的是存在系统性误标注样本时的分布性质测试问题。目标是在一个由困惑收集者(例如未能学会区分域中所有元素的机器学习分类器)收集的样本基础上,对未知概率分布做出决策。该困惑收集者持有域的一个未知聚类以及一个输入分布 $\mu$,并提供两种预言机:一种样本预言机,可生成来自 $\mu$ 且已根据该聚类标注的样本;一种标签查询预言机,可根据该聚类返回查询点 $x$ 的标签。我们的第一组结果表明,在推土机距离下,即使聚类是敌对的且未知,只需对困惑收集者施以极弱条件,就能高效测试分布的同一性、均匀性和等价性。这需要定义分布测试任务的一种变体(受Rubinfeld与Vasilyan近期可测试学习框架的启发),其中算法应测试分布及其聚类的联合性质。例如,当分布测试器在检测到困惑收集者的聚类"远离"决策树时允许其拒绝,我们就能得到高效的测试器。第二组结果表明,当聚类是随机而非敌对时,我们有时能取得显著更好的效果。针对特定一维随机聚类,我们证明在总变差距离下,使用 $\widetilde O\left(\frac{\sqrt n}{\rho^{3/2} \epsilon^2}\right)$ 个样本且零查询即可测试均匀性,其中 $\rho \in (0,1]$ 控制聚类的"分辨率"。当允许查询时,我们将其改进为 $O\left(\frac{\sqrt n}{\rho \epsilon^2}\right)$。