Most models of visual attention are aimed at predicting either top-down or bottom-up control, as studied using different visual search and free-viewing tasks. We propose Human Attention Transformer (HAT), a single model predicting both forms of attention control. HAT is the new state-of-the-art (SOTA) in predicting the scanpath of fixations made during target-present and target-absent search, and matches or exceeds SOTA in the prediction of taskless free-viewing fixation scanpaths. HAT achieves this new SOTA by using a novel transformer-based architecture and a simplified foveated retina that collectively create a spatio-temporal awareness akin to the dynamic visual working memory of humans. Unlike previous methods that rely on a coarse grid of fixation cells and experience information loss due to fixation discretization, HAT features a dense-prediction architecture and outputs a dense heatmap for each fixation, thus avoiding discretizing fixations. HAT sets a new standard in computational attention, which emphasizes both effectiveness and generality. HAT's demonstrated scope and applicability will likely inspire the development of new attention models that can better predict human behavior in various attention-demanding scenarios.
翻译:大多数视觉注意力模型旨在预测自上而下或自下而上的控制,这些控制通常通过不同的视觉搜索和自由观看任务进行研究。我们提出了人类注意力Transformer(HAT),这是一个能够同时预测两种注意力控制形式的单一模型。HAT在预测目标存在和目标缺失搜索中的注视路径方面达到了新的最优性能(SOTA),并且在预测无任务的自由观看注视路径方面达到或超越了SOTA。HAT通过采用新颖的基于Transformer的架构和简化的中央凹视网膜,共同创造了一种类似人类动态视觉工作记忆的时空感知能力,从而实现了这一新的SOTA。与以往依赖粗粒度注视单元网格且因注视离散化而导致信息损失的方法不同,HAT采用了密集预测架构,并为每次注视输出密集热图,从而避免了注视的离散化。HAT在计算注意力领域树立了新标准,强调了有效性和通用性。HAT所展示的广泛范围和适用性,将可能激励开发出能够更好地预测人类在各种注意力需求场景下行为的新型注意力模型。