Unified Dynamic Scanpath Predictors Outperform Individually Trained Neural Models

Previous research on scanpath prediction has mainly focused on group models, disregarding the fact that the scanpaths and attentional behaviors of individuals are diverse. The disregard of these differences is especially detrimental to social human-robot interaction, whereby robots commonly emulate human gaze based on heuristics or predefined patterns. However, human gaze patterns are heterogeneous and varying behaviors can significantly affect the outcomes of such human-robot interactions. To fill this gap, we developed a deep learning-based social cue integration model for saliency prediction to instead predict scanpaths in videos. Our model learned scanpaths by recursively integrating fixation history and social cues through a gating mechanism and sequential attention. We evaluated our approach on gaze datasets of dynamic social scenes, observed under the free-viewing condition. The introduction of fixation history into our models makes it possible to train a single unified model rather than the resource-intensive approach of training individual models for each set of scanpaths. We observed that the late neural integration approach surpasses early fusion when training models on a large dataset, in comparison to a smaller dataset with a similar distribution. Results also indicate that a single unified model, trained on all the observers' scanpaths, performs on par or better than individually trained models. We hypothesize that this outcome is a result of the group saliency representations instilling universal attention in the model, while the supervisory signal and fixation history guide it to learn personalized attentional behaviors, providing the unified model a benefit over individual models due to its implicit representation of universal attention.

翻译：先前关于扫描路径预测的研究主要聚焦于群体模型，忽视了个体间扫描路径及注意行为的差异性。这种忽视对社交型人机交互尤其不利——机器人通常基于启发式规则或预设模式模仿人类注视行为，但人类注视模式具有异质性，不同行为会显著影响人机交互效果。为弥补这一不足，我们开发了基于深度学习的社交线索整合模型用于显著性预测，进而实现视频中的扫描路径预测。该模型通过门控机制与序列注意力递归整合注视历史与社交线索来学习扫描路径。我们采用自由观看条件下动态社交场景的注视数据集评估该方法。将注视历史引入模型后，我们得以训练单一统一模型，而非为每组扫描路径分别训练高资源消耗的独立模型。实验表明，在处理分布相似的大规模数据集时，相较于小规模数据集，采用晚期神经整合方法的模型性能优于早期融合方法。结果还显示，基于所有观察者扫描路径训练的单一统一模型，其表现与单独训练的模型相当甚至更优。我们推测这一结果源于群体显著性表征赋予模型通用注意力机制，而监督信号与注视历史引导其学习个性化注意行为，使统一模型凭借对通用注意力的隐式表征获得优于个体模型的优势。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日