Sequence prediction on temporal data requires the ability to understand compositional structures of multi-level semantics beyond individual and contextual properties. The task of temporal action segmentation, which aims at translating an untrimmed activity video into a sequence of action segments, remains challenging for this reason. This paper addresses the problem by introducing an effective activity grammar to guide neural predictions for temporal action segmentation. We propose a novel grammar induction algorithm that extracts a powerful context-free grammar from action sequence data. We also develop an efficient generalized parser that transforms frame-level probability distributions into a reliable sequence of actions according to the induced grammar with recursive rules. Our approach can be combined with any neural network for temporal action segmentation to enhance the sequence prediction and discover its compositional structure. Experimental results demonstrate that our method significantly improves temporal action segmentation in terms of both performance and interpretability on two standard benchmarks, Breakfast and 50 Salads.
翻译:时间序列数据的序列预测需要理解超越个体和上下文属性的多层语义组合结构。时间动作分割任务旨在将未修剪的活动视频转化为动作片段的序列,但这一任务因上述原因仍具挑战性。本文通过引入有效的活动语法来指导时间动作分割的神经预测,从而解决该问题。我们提出了一种新颖的语法归纳算法,可从动作序列数据中提取强大的上下文无关语法。同时,我们开发了一种高效的通用解析器,根据包含递归规则的归纳语法,将帧级概率分布转化为可靠的行动序列。所提方法可与任意用于时间动作分割的神经网络结合,以增强其序列预测能力并发现其组合结构。实验结果表明,我们的方法在两个标准基准(Breakfast和50 Salads)上显著提升了时间动作分割的性能与可解释性。