Humans perceive actions through key transitions that structure actions across multiple abstraction levels, whereas machines, relying on visual features, tend to over-segment. This highlights the difficulty of enabling hierarchical reasoning in video understanding. Interestingly, we observe that lower-level visual and high-level action latent variables evolve at different rates, with low-level visual variables changing rapidly, while high-level action variables evolve more slowly, making them easier to identify. Building on this insight, we propose the Hierarchical Action Learning (\textbf{HAL}) model for weakly-supervised action segmentation. Our approach introduces a hierarchical causal data generation process, where high-level latent action governs the dynamics of low-level visual features. To model these varying timescales effectively, we introduce deterministic processes to align these latent variables over time. The \textbf{HAL} model employs a hierarchical pyramid transformer to capture both visual features and latent variables, and a sparse transition constraint is applied to enforce the slower dynamics of high-level action variables. This mechanism enhances the identification of these latent variables over time. Under mild assumptions, we prove that these latent action variables are strictly identifiable. Experimental results on several benchmarks show that the \textbf{HAL} model significantly outperforms existing methods for weakly-supervised action segmentation, confirming its practical effectiveness in real-world applications.
翻译:人类通过关键转换来感知动作,这些转换在多个抽象层次上构建动作结构,而依赖视觉特征的机器则倾向于过度分割。这凸显了在视频理解中实现分层推理的困难。有趣的是,我们观察到低层视觉和高层动作潜变量以不同的速率演化:低层视觉变量变化迅速,而高层动作变量演化较慢,使其更易于识别。基于这一洞察,我们提出了用于弱监督动作分割的分层动作学习(\textbf{HAL})模型。我们的方法引入了一个分层因果数据生成过程,其中高层潜动作支配着低层视觉特征的动态变化。为了有效建模这些不同的时间尺度,我们引入了确定性过程来对齐这些潜变量随时间的变化。\textbf{HAL}模型采用分层金字塔Transformer来捕获视觉特征和潜变量,并应用稀疏转换约束以强制高层动作变量具有较慢的动态特性。这一机制增强了这些潜变量随时间推移的识别能力。在温和的假设下,我们证明了这些潜动作变量是严格可识别的。在多个基准测试上的实验结果表明,\textbf{HAL}模型在弱监督动作分割方面显著优于现有方法,证实了其在实际应用中的有效性。