Egocentric temporal action segmentation in videos is a crucial task in computer vision with applications in various fields such as mixed reality, human behavior analysis, and robotics. Although recent research has utilized advanced visual-language frameworks, transformers remain the backbone of action segmentation models. Therefore, it is necessary to improve transformers to enhance the robustness of action segmentation models. In this work, we propose two novel ideas to enhance the state-of-the-art transformer for action segmentation. First, we introduce a dual dilated attention mechanism to adaptively capture hierarchical representations in both local-to-global and global-to-local contexts. Second, we incorporate cross-connections between the encoder and decoder blocks to prevent the loss of local context by the decoder. Additionally, we utilize state-of-the-art visual-language representation learning techniques to extract richer and more compact features for our transformer. Our proposed approach outperforms other state-of-the-art methods on the Georgia Tech Egocentric Activities (GTEA) and HOI4D Office Tools datasets, and we validate our introduced components with ablation studies. The source code and supplementary materials are publicly available on https://www.sail-nu.com/dxformer.
翻译:自我中心视频中的时间动作分割是计算机视觉中的一项关键任务,在混合现实、人类行为分析和机器人学等多个领域具有广泛应用。尽管近期研究采用了先进的视觉-语言框架,但Transformer仍然是动作分割模型的主干网络。因此,有必要改进Transformer以增强动作分割模型的鲁棒性。在本工作中,我们提出了两种新颖的思路来提升当前最优的Transformer以进行动作分割。首先,我们引入了一种双膨胀注意力机制,以自适应地捕获局部到全局和全局到局部上下文中的层次化表征。其次,我们在编码器和解码器模块之间加入跨连接,以防止解码器丢失局部上下文。此外,我们利用最先进的视觉-语言表征学习技术,为我们的Transformer提取更丰富、更紧凑的特征。我们的方法在佐治亚理工学院自我中心活动(GTEA)和HOI4D办公工具数据集上优于其他现有最优方法,并通过消融实验验证了我们引入的组件。源代码和补充材料公开在https://www.sail-nu.com/dxformer上。