Raising the Ceiling: Better Empirical Fixation Densities for Saliency Benchmarking

Empirical fixation densities, spatial distributions estimated from human eye-tracking data, are foundational to saliency benchmarking. They directly shape benchmark conclusions, leaderboard rankings, failure case analyses, and scientific claims about human visual behavior. Yet the standard estimation method, fixed-bandwidth isotropic Gaussian KDE, has gone essentially unchanged for decades. This matters now more than ever: as the field shifts toward sample-level evaluation (failure case analysis, inverse benchmarking, per-image model comparison), reliable per-image density estimates become critical. We propose a principled mixture model that combines an adaptive-bandwidth KDE based on Abramson's method, center bias and uniform components, and a state-of-the-art saliency model, to capture different spatial and semantic types of interobserver consistency, and optimize all parameters per image via leave-one-subject-out cross-validation. Our method yields substantially higher interobserver consistency estimates across multiple benchmarks, with median per-image gains of 5-15% in log-likelihood and up to 2 percentage points in AUC. For the most affected images -- precisely those most relevant to failure case analysis -- improvements exceed 25%. We leverage these improved estimates to identify and analyze remaining failure cases of state-of-the-art saliency models, demonstrating that significant headroom for model improvement remains. More broadly, our findings highlight that empirical fixation densities should not be treated as fixed ground truths but as evolving estimates that improve with better methodology.

翻译：经验注视密度（即根据人类眼动数据估计的空间分布）是显著性基准测试的基础。它们直接影响基准测试结论、排行榜排名、失败案例分析以及关于人类视觉行为的科学论断。然而，标准的估计方法——固定带宽各向同性高斯核密度估计——已在数十年间基本保持不变。如今这一点比以往任何时候都更为关键：随着该领域转向样本级评估（失败案例分析、逆基准测试、逐图像模型比较），可靠的逐图像密度估计变得至关重要。我们提出了一种基于原理的混合模型，该模型结合了基于Abramson方法的自适应带宽KDE、中心偏置与均匀分量，以及最先进的显著性模型，以捕捉不同空间和语义类型的观测者间一致性，并通过留一受试者交叉验证为每幅图像优化所有参数。我们的方法在多个基准测试中获得了显著更高的观测者间一致性估计值，每幅图像的对数似然中位数增益为5%-15%，AUC提升高达2个百分点。对于受影响最严重的图像（正是那些与失败案例分析最相关的图像），改进幅度超过25%。我们利用这些改进的估计值来识别和分析最先进显著性模型中仍存在的失败案例，表明模型改进仍有显著空间。更广泛地说，我们的发现强调，经验注视密度不应被视为固定的真实值，而应视为随方法改进而不断演进的估计值。