In the task of emotion recognition from videos, a key improvement has been to focus on emotions over time rather than a single frame. There are many architectures to address this task such as GRUs, LSTMs, Self-Attention, Transformers, and Temporal Convolutional Networks (TCNs). However, these methods suffer from high memory usage, large amounts of operations, or poor gradients. We propose a method known as Neighborhood Attention with Convolutions TCN (NAC-TCN) which incorporates the benefits of attention and Temporal Convolutional Networks while ensuring that causal relationships are understood which results in a reduction in computation and memory cost. We accomplish this by introducing a causal version of Dilated Neighborhood Attention while incorporating it with convolutions. Our model achieves comparable, better, or state-of-the-art performance over TCNs, TCAN, LSTMs, and GRUs while requiring fewer parameters on standard emotion recognition datasets. We publish our code online for easy reproducibility and use in other projects.
翻译:在视频情感识别任务中,关键改进在于聚焦于随时间变化的情感而非单一帧。已有多种架构可应对此任务,例如GRU、LSTM、自注意力机制、Transformer和时序卷积网络(TCN)。然而,这些方法存在内存占用高、运算量大或梯度不良等问题。我们提出一种名为"邻域注意力与卷积TCN"(NAC-TCN)的方法,该方法融合了注意力机制与时序卷积网络的优点,同时确保能理解因果关系,从而降低计算和内存成本。我们通过引入因果版本的膨胀邻域注意力,并将其与卷积相结合来实现这一目标。在标准情感识别数据集上,我们的模型在参数更少的情况下,取得了与TCN、TCAN、LSTM和GRU相当、更优或最先进的性能。我们将代码公开发布,以便轻松复现并应用于其他项目。