We revisit outlier hypothesis testing, propose exponentially consistent low complexity fixed-length and sequential tests and show that our tests achieve better tradeoff between detection performance and computational complexity than existing tests that use exhaustive search. Specifically, in outlier hypothesis testing, one is given a list of observed sequences, most of which are generated i.i.d. from a nominal distribution while the rest sequences named outliers are generated i.i.d. from another anomalous distribution. The task is to identify all outliers when both the nominal and anomalous distributions are unknown. There are two basic settings: fixed-length and sequential. In the fixed-length setting, the sample size of each observed sequence is fixed a priori while in the sequential setting, the sample size is a random number that can be determined by the test designer to ensure reliable decisions. For the fixed-length setting, we strengthen the results of Bu \emph{et. al} (TSP 2019) by i) allowing for scoring functions beyond KL divergence and further simplifying the test design when the number of outliers is known and ii) proposing a new test, explicitly bounding the detection performance of the test and characterizing the tradeoff among exponential decay rates of three error probabilities when the number of outliers is unknown. For the sequential setting, our tests for both cases are novel and enable us to reveal the benefit of sequentiality. Finally, for both fixed-length and sequential settings, we demonstrate the penalty of not knowing the number of outliers on the detection performance.
翻译:本文重新审视异常假设检验问题,提出具有指数一致性的低复杂度定长与序贯检验方法,并证明相较于采用穷举搜索的现有检验方法,我们的方法在检测性能与计算复杂度之间取得了更优的权衡。具体而言,在异常假设检验中,给定一组观测序列,其中大部分序列独立同分布于名义分布,其余被称作异常值的序列则独立同分布于另一个异常分布。当名义分布与异常分布均未知时,任务在于识别所有异常值。该问题存在两种基本设定:定长检验与序贯检验。在定长设定下,每个观测序列的样本量预先固定;而在序贯设定下,样本量为随机变量,可由检验设计者根据可靠决策需求自主确定。针对定长设定,我们通过以下方式强化了Bu等人(TSP 2019)的研究成果:i) 允许使用KL散度之外的评分函数,并在已知异常值数量时进一步简化检验设计;ii) 提出一种新检验方法,明确界定其检测性能边界,并在未知异常值数量时刻画三类错误概率指数衰减率间的权衡关系。针对序贯设定,我们针对两种情形提出的检验方法均具有创新性,能够揭示序贯性带来的优势。最后,对于定长与序贯两种设定,我们论证了未知异常值数量对检测性能造成的损失。