Given $n$ observations from two balanced classes, consider the task of labeling an additional $m$ inputs that are known to all belong to \emph{one} of the two classes. Special cases of this problem are well-known: with complete knowledge of class distributions ($n=\infty$) the problem is solved optimally by the likelihood-ratio test; when $m=1$ it corresponds to binary classification; and when $m\approx n$ it is equivalent to two-sample testing. The intermediate settings occur in the field of likelihood-free inference, where labeled samples are obtained by running forward simulations and the unlabeled sample is collected experimentally. In recent work it was discovered that there is a fundamental trade-off between $m$ and $n$: increasing the data sample $m$ reduces the amount $n$ of training/simulation data needed. In this work we (a) introduce a generalization where unlabeled samples come from a mixture of the two classes -- a case often encountered in practice; (b) study the minimax sample complexity for non-parametric classes of densities under \textit{maximum mean discrepancy} (MMD) separation; and (c) investigate the empirical performance of kernels parameterized by neural networks on two tasks: detection of the Higgs boson and detection of planted DDPM generated images amidst CIFAR-10 images. For both problems we confirm the existence of the theoretically predicted asymmetric $m$ vs $n$ trade-off.
翻译:给定来自两个平衡类别的 $n$ 个观测值,考虑对额外 $m$ 个已知全部属于其中一类的输入进行标记的任务。该问题的特殊情形已广为人知:当完全已知类别分布时($n=\infty$),似然比检验可给出最优解;当 $m=1$ 时,此问题退化为二分类任务;当 $m\approx n$ 时,则等价于两样本检验。中间情形出现在无似然推断领域,其中标记样本通过正向模拟获取,而未标记样本则通过实验收集。近期研究发现,$m$ 与 $n$ 之间存在根本性权衡:增加数据样本 $m$ 会减少所需训练/模拟数据量 $n$。本文中,我们:(a) 提出一种推广形式,允许未标记样本来自两个类别的混合分布——这一情形在实践中经常出现;(b) 在基于\textit{最大均值差异}(MMD)分离的非参数密度类别下,研究极小极大样本复杂度;(c) 在两项任务中考察神经网络参数化核函数的实证性能:希格斯玻色子检测,以及从CIFAR-10图像中检测植入的DDPM生成图像。对于这两项问题,我们均验证了理论预测的不对称 $m$ 与 $n$ 权衡关系的存在性。