Most models of visual attention are aimed at predicting either top-down or bottom-up control, as studied using different visual search and free-viewing tasks. We propose Human Attention Transformer (HAT), a single model predicting both forms of attention control. HAT is the new state-of-the-art (SOTA) in predicting the scanpath of fixations made during target-present and target-absent search, and matches or exceeds SOTA in the prediction of taskless free-viewing fixation scanpaths. HAT achieves this new SOTA by using a novel transformer-based architecture and a simplified foveated retina that collectively create a spatio-temporal awareness akin to the dynamic visual working memory of humans. Unlike previous methods that rely on a coarse grid of fixation cells and experience information loss due to fixation discretization, HAT features a dense-prediction architecture and outputs a dense heatmap for each fixation, thus avoiding discretizing fixations. HAT sets a new standard in computational attention, which emphasizes both effectiveness and generality. HAT's demonstrated scope and applicability will likely inspire the development of new attention models that can better predict human behavior in various attention-demanding scenarios.
翻译:大多数视觉注意力模型旨在预测自上而下或自下而上的控制机制,这些机制通常通过不同的视觉搜索和自由观看任务进行研究。我们提出了人类注意力Transformer(HAT),这是一个能够同时预测两种注意力控制形式的单一模型。HAT在预测目标存在与目标缺失搜索过程中的注视路径方面达到了新的最优水平(SOTA),并在无任务自由观看注视路径的预测中与SOTA持平或超越。HAT通过采用新颖的基于Transformer的架构和简化的中央凹视网膜实现这一新SOTA,这两者共同产生了类似于人类动态视觉工作记忆的时空感知能力。与以往依赖粗粒度注视细胞网格且因注视离散化导致信息损失的方法不同,HAT采用密集预测架构,并为每次注视输出密集热力图,从而避免了注视的离散化处理。HAT在计算注意力领域树立了强调有效性与普适性的新标准。其展现出的应用范围和适用性,将有望激励开发出能更好预测各类注意力需求场景下人类行为的新型注意力模型。