Given $n$ observations from two balanced classes, consider the task of labeling an additional $m$ inputs that are known to all belong to \emph{one} of the two classes. Special cases of this problem are well-known: with complete knowledge of class distributions ($n=\infty$) the problem is solved optimally by the likelihood-ratio test; when $m=1$ it corresponds to binary classification; and when $m\approx n$ it is equivalent to two-sample testing. The intermediate settings occur in the field of likelihood-free inference, where labeled samples are obtained by running forward simulations and the unlabeled sample is collected experimentally. In recent work it was discovered that there is a fundamental trade-off between $m$ and $n$: increasing the data sample $m$ reduces the amount $n$ of training/simulation data needed. In this work we (a) introduce a generalization where unlabeled samples come from a mixture of the two classes -- a case often encountered in practice; (b) study the minimax sample complexity for non-parametric classes of densities under \textit{maximum mean discrepancy} (MMD) separation; and (c) investigate the empirical performance of kernels parameterized by neural networks on two tasks: detection of the Higgs boson and detection of planted DDPM generated images amidst CIFAR-10 images. For both problems we confirm the existence of the theoretically predicted asymmetric $m$ vs $n$ trade-off.
翻译:给定来自两个平衡类别的$n$个观测值,考虑为已知全部属于**其中一类**的额外$m$个输入进行标记的任务。该问题的特例广为人知:在完全掌握类别分布信息的情况下($n=\infty$),似然比检验可给出最优解;当$m=1$时对应二元分类问题;当$m\approx n$时等价于双样本检验。中间情形出现在无似然推断领域,其中标记样本通过正向模拟获得,而未标记样本则通过实验采集。近期研究发现$m$与$n$之间存在基本权衡:增加数据样本$m$可减少训练/模拟数据所需量$n$。本文中我们(a) 提出一种泛化情形——未标记样本来自两个类别的混合分布(实践中常见的情况);(b) 在**最大均值差异**(MMD)分离条件下研究非参数密度类的极小化极大样本复杂度;(c) 通过神经网络参数化核函数,在希格斯玻色子检测和CIFAR-10图像中植入DDPM生成图像检测两项任务中实证探究其性能。针对这两个问题,我们均验证了理论上预测的非对称$m$与$n$权衡关系的存在性。