As the issue of robustness in AI systems becomes vital, statistical learning techniques that are reliable even in presence of partly contaminated data have to be developed. Preference data, in the form of (complete) rankings in the simplest situations, are no exception and the demand for appropriate concepts and tools is all the more pressing given that technologies fed by or producing this type of data (e.g. search engines, recommending systems) are now massively deployed. However, the lack of vector space structure for the set of rankings (i.e. the symmetric group $\mathfrak{S}_n$) and the complex nature of statistics considered in ranking data analysis make the formulation of robustness objectives in this domain challenging. In this paper, we introduce notions of robustness, together with dedicated statistical methods, for Consensus Ranking the flagship problem in ranking data analysis, aiming at summarizing a probability distribution on $\mathfrak{S}_n$ by a median ranking. Precisely, we propose specific extensions of the popular concept of breakdown point, tailored to consensus ranking, and address the related computational issues. Beyond the theoretical contributions, the relevance of the approach proposed is supported by an experimental study.
翻译:随着人工智能系统中鲁棒性问题愈发重要,必须开发即使在数据部分污染情况下仍可靠的统计学习技术。偏好数据(在最简单情况下以完整排序形式呈现)也不例外,鉴于由这类数据驱动或产生此类数据的技术(如搜索引擎、推荐系统)现已大规模部署,对适当概念和工具的需求更为迫切。然而,排序集合(即对称群$\mathfrak{S}_n$)缺乏向量空间结构,且排序数据分析中所考虑统计量的复杂性质,使得该领域鲁棒性目标的制定颇具挑战。本文针对排序数据分析中的标志性问题——一致性排序,引入鲁棒性概念及专用统计方法,旨在通过中位数排序总结$\mathfrak{S}_n$上的概率分布。具体而言,我们针对一致性排序问题,提出了经典崩溃点概念的专门拓展,并探讨了相关计算问题。除理论贡献外,实验研究验证了所提方法的相关性。