Disease control experts inspect public health data streams daily for outliers worth investigating, like those corresponding to data quality issues or disease outbreaks. However, they can only examine a few of the thousands of maximally-tied outliers returned by univariate outlier detection methods applied to large-scale public health data streams. To help experts distinguish the most important outliers from these thousands of tied outliers, we propose a new task for algorithms to rank the outputs of any univariate method applied to each of many streams. Our novel algorithm for this task, which leverages hierarchical networks and extreme value analysis, performed the best across traditional outlier detection metrics in a human-expert evaluation using public health data streams. Most importantly, experts have used our open-source Python implementation since April 2023 and report identifying outliers worth investigating 9.1x faster than their prior baseline. Other organizations can readily adapt this implementation to create rankings from the outputs of their tailored univariate methods across large-scale streams.
翻译:疾病控制专家每天检查公共卫生数据流,以发现值得调查的异常值,例如与数据质量问题或疾病爆发相关的异常值。然而,在大规模公共卫生数据流应用中,单变量异常检测方法会返回数千个最大程度并列的异常值,专家们只能检查其中少数几个。为了帮助专家从这些数以千计的并列异常值中区分出最重要的异常值,我们提出一项新任务:对应用于多个数据流的任意单变量方法的输出进行排序。针对该任务,我们提出一种新颖算法,该算法利用分层网络和极值分析,在使用公共卫生数据流的人类专家评估中,在传统异常检测指标上表现最佳。最重要的是,专家自2023年4月以来一直使用我们的开源Python实现,并报告识别值得调查的异常值速度比先前基线快9.1倍。其他组织可以很容易地调整该实现,以便从其定制的单变量方法在大规模数据流上的输出中创建排序。