基于扩展Bregman散度的互信息鲁棒非参数双样本检验 (Robust Nonparametric Two-Sample Tests via Mutual Information using Extended Bregman Divergence)

We introduce a generalized formulation of mutual information (MI) based on the extended Bregman divergence, a framework that subsumes the generalized S-Bregman (GSB) divergence family. The GSB divergence unifies two important classes of statistical distances, namely the S-divergence and the Bregman exponential divergence (BED), thereby encompassing several widely used subfamilies, including the power divergence (PD), density power divergence (DPD), and S-Hellinger distance (S-HD). In parametric inference, minimum divergence estimators are well known to balance robustness with high asymptotic efficiency relative to the maximum likelihood estimator. However, nonparametric tests based on such statistical distances have been relatively less explored. In this paper, we construct a class of consistent and robust nonparametric two-sample tests for the equality of two absolutely continuous distributions using the generalized MI. We establish the asymptotic normality of the proposed test statistics under the null and contiguous alternatives. The robustness properties of the generalized MI are rigorously studied through the influence function and the breakdown point, demonstrating that stability of the generalized MI translates into stability of the associated tests. Extensive simulation studies show that divergences beyond the PD family often yield superior robustness under contamination while retaining high asymptotic power. A data-driven scheme for selecting optimal tuning parameters is also proposed. Finally, the methodology is illustrated with applications to real data.

翻译：本文基于扩展Bregman散度提出了互信息（MI）的广义化表述，该框架涵盖了广义S-Bregman（GSB）散度族。GSB散度统一了两类重要的统计距离——S-散度与Bregman指数散度（BED），从而涵盖了包括幂散度（PD）、密度幂散度（DPD）和S-Hellinger距离（S-HD）在内的多个广泛使用的子族。在参数推断中，最小散度估计量在保持鲁棒性的同时相对于最大似然估计量具有较高的渐近效率，这一特性已广为人知。然而，基于此类统计距离的非参数检验方法尚未得到充分探索。本文利用广义互信息构建了一类用于检验两个绝对连续分布一致性的、具有一致性与鲁棒性的非参数双样本检验方法。我们建立了所提检验统计量在原假设及局部备择假设下的渐近正态性。通过影响函数与崩溃点对广义互信息的鲁棒性进行了严格研究，证明广义互信息的稳定性可转化为对应检验的稳定性。大量仿真研究表明，超越PD族范围的散度在保持高渐近功效的同时，往往能在污染条件下展现出更优越的鲁棒性。本文还提出了数据驱动的优化调参方案选择方法。最后，通过实际数据应用展示了该方法的具体实践。