Discovery problems often require deciding whether additional sampling is needed to detect all categories whose prevalence exceeds a prespecified threshold. We study this question under a Bernoulli product (incidence) model, where categories are observed only through presence--absence across sampling units. Our inferential target is the \emph{maximum unseen probability}, the largest prevalence among categories not yet observed. We develop nonasymptotic, distribution-free upper confidence bounds for this quantity in two regimes: bounded alphabets (finite and known number of categories) and unbounded alphabets (countably infinite under a mild summability condition). We characterise the limits of data-independent worst-case bounds, showing that in the unbounded regime no nontrivial data-independent procedure can be uniformly valid. We then propose data-dependent bounds in both regimes and establish matching lower bounds demonstrating their near-optimality. We compare empirically the resulting procedures in both simulated and real datasets. Finally, we use these bounds to construct sequential stopping rules with finite-sample guarantees, and demonstrate robustness to contamination that introduces spurious low-prevalence categories.
翻译:发现类问题通常需要判断是否需要额外抽样以检测出所有出现率超过预设阈值的类别。我们在伯努利乘积(发生率)模型下研究该问题,其中类别仅通过抽样单元的存在-缺失模式被观测到。我们的推断目标是**最大未观测概率**,即尚未观测到的类别中最大的出现率。我们在两种情形下为该量构建了非渐近、无分布的上置信界:有界字母表(类别数量有限且已知)与无界字母表(在温和可和性条件下为可数无限)。我们刻画了与数据无关的最坏情形界的极限,证明在无界情形下不存在非平凡的、具有一致有效性的与数据无关方法。随后我们在两种情形下提出数据依赖的置信界,并建立匹配的下界证明其近乎最优性。我们在模拟和真实数据集中对所得方法进行实证比较。最后,我们利用这些界构建具有有限样本保证的序贯停止规则,并证明其对引入虚假低出现率类别的污染具有鲁棒性。