Temporal Action Segmentation (TAS) from video is a kind of frame recognition task for long video with multiple action classes. As an video understanding task for long videos, current methods typically combine multi-modality action recognition models with temporal models to convert feature sequences to label sequences. This approach can only be applied to offline scenarios, which severely limits the TAS application. Therefore, this paper proposes an end-to-end Streaming Video Temporal Action Segmentation with Reinforce Learning (SVTAS-RL). The end-to-end SVTAS which regard TAS as an action segment clustering task can expand the application scenarios of TAS; and RL is used to alleviate the problem of inconsistent optimization objective and direction. Through extensive experiments, the SVTAS-RL model achieves a competitive performance to the state-of-the-art model of TAS on multiple datasets, and shows greater advantages on the ultra-long video dataset EGTEA. This indicates that our method can replace all current TAS models end-to-end and SVTAS-RL is more suitable for long video TAS. Code is availabel at https://github.com/Thinksky5124/SVTAS.
翻译:时序动作分割(Temporal Action Segmentation,TAS)是一项针对包含多种动作类别的长视频的帧识别任务。作为长视频理解任务,现有方法通常将多模态动作识别模型与时序模型相结合,将特征序列转换为标签序列。这种方法仅适用于离线场景,严重限制了TAS的应用。为此,本文提出了一种端到端流式视频时序动作分割强化学习方法(SVTAS-RL)。端到端流式视频时序动作分割(SVTAS)将TAS视为动作片段聚类任务,可扩展TAS的应用场景;而强化学习(RL)用于缓解优化目标与方向不一致的问题。通过大量实验,SVTAS-RL模型在多个数据集上取得了与当前最先进TAS模型相媲美的性能,并在超长视频数据集EGTEA上展现出更大优势。这表明该方法能以端到端方式替代所有现有TAS模型,且SVTAS-RL更适用于长视频TAS任务。代码已开源:https://github.com/Thinksky5124/SVTAS。