Universal outlier hypothesis testing refers to a hypothesis testing problem where one observes a large number of length-$n$ sequences -- the majority of which are distributed according to the typical distribution $π$ and a small number are distributed according to the outlier distribution $μ$ -- and one wishes to decide, which of these sequences are outliers without having knowledge of $π$ and $μ$. In contrast to previous works, in this paper it is assumed that both the number of observation sequences and the number of outlier sequences grow with the sequence length. In this case, the typical distribution $π$ can be estimated by computing the mean over all observation sequences, provided that the number of outlier sequences is sublinear in the total number of sequences. It is demonstrated that, in this case, one can achieve the error exponent of the maximum likelihood test that has access to both $π$ and $μ$. However, this mean-based test performs poorly when the number of outlier sequences is proportional to the total number of sequences. For this case, a median-based test is proposed that estimates $π$ as the median of all observation sequences. It is demonstrated that the median-based test achieves again the error exponent of the maximum likelihood test that has access to both $π$ and $μ$, but only with probability approaching one. To formalize this case, the typical error exponent -- similar to the typical random coding exponent introduced in the context of random coding for channel coding -- is proposed.
翻译:通用离群值假设检验指一类假设检验问题:观测到大量长度为 $n$ 的序列——其中绝大多数服从典型分布 $π$,少量服从离群分布 $μ$——目标是在未知 $π$ 和 $μ$ 的情况下判定哪些序列属于离群值。与先前研究不同,本文假设观测序列数与离群序列数均随序列长度增长。在此情况下,若离群序列数相对于总序列数为次线性增长,则可通过计算所有观测序列的均值来估计典型分布 $π$。研究证明,此时可达到已知 $π$ 和 $μ$ 的最大似然检验的错误指数。然而,当离群序列数与总序列数成比例时,这种基于均值的检验方法表现较差。针对此情形,本文提出基于中位数的检验方法,通过计算所有观测序列的中位数来估计 $π$。研究证明,该中位数检验方法能以概率趋近于一的特性,再次达到已知 $π$ 和 $μ$ 的最大似然检验的错误指数。为严格描述此情形,本文提出了典型错误指数的概念——其思想类似于信道编码随机编码理论中引入的典型随机编码指数。