Modeling long-term context in videos is crucial for many fine-grained tasks including temporal action segmentation. An interesting question that is still open is how much long-term temporal context is needed for optimal performance. While transformers can model the long-term context of a video, this becomes computationally prohibitive for long videos. Recent works on temporal action segmentation thus combine temporal convolutional networks with self-attentions that are computed only for a local temporal window. While these approaches show good results, their performance is limited by their inability to capture the full context of a video. In this work, we try to answer how much long-term temporal context is required for temporal action segmentation by introducing a transformer-based model that leverages sparse attention to capture the full context of a video. We compare our model with the current state of the art on three datasets for temporal action segmentation, namely 50Salads, Breakfast, and Assembly101. Our experiments show that modeling the full context of a video is necessary to obtain the best performance for temporal action segmentation.
翻译:视频中的长程上下文建模对时间动作分割等细粒度任务至关重要。一个尚未解决的课题是:为达到最优性能,需要多长的时序上下文。尽管Transformer能有效建模视频的长程上下文,但在处理长视频时会导致计算成本过高。近期的时间动作分割研究因此将时序卷积网络与仅针对局部时间窗口计算的自注意力机制相结合。这类方法虽取得良好效果,但由于无法捕获视频的完整上下文而限制了性能。本研究通过引入基于稀疏注意力的Transformer模型,旨在探索时间动作分割所需的长程时序上下文长度。我们在三个时间动作分割数据集(50Salads、Breakfast和Assembly101)上,将所提模型与当前最先进方法进行对比。实验结果表明,为获得时间动作分割的最佳性能,必须对视频完整上下文进行建模。