Bodily behavioral language is an important social cue, and its automated analysis helps in enhancing the understanding of artificial intelligence systems. Furthermore, behavioral language cues are essential for active engagement in social agent-based user interactions. Despite the progress made in computer vision for tasks like head and body pose estimation, there is still a need to explore the detection of finer behaviors such as gesturing, grooming, or fumbling. This paper proposes a multiview attention fusion method named MAGIC-TBR that combines features extracted from videos and their corresponding Discrete Cosine Transform coefficients via a transformer-based approach. The experiments are conducted on the BBSI dataset and the results demonstrate the effectiveness of the proposed feature fusion with multiview attention. The code is available at: https://github.com/surbhimadan92/MAGIC-TBR
翻译:肢体行为语言是一种重要的社交线索,其自动化分析有助于提升人工智能系统的理解能力。此外,行为语言线索对于社交智能体驱动的用户交互中的积极参与至关重要。尽管计算机视觉在头部和身体姿态估计等任务上取得了进展,但仍需探索更精细行为(如手势、梳理或摸索)的检测。本文提出了一种名为MAGIC-TBR的多视图注意力融合方法,该方法通过基于Transformer的方式,融合从视频中提取的特征及其对应的离散余弦变换系数。实验在BBSI数据集上进行,结果证明了所提出的多视图注意力特征融合方法的有效性。代码已公开于:https://github.com/surbhimadan92/MAGIC-TBR