Temporal action segmentation is a critical task in video understanding, where the goal is to assign action labels to each frame in a video. While recent advances leverage iterative refinement-based strategies, they fail to explicitly utilize the hierarchical nature of human actions. In this work, we propose HybridTAS - a novel framework that incorporates a hybrid of Euclidean and hyperbolic geometries into the denoising process of diffusion models to exploit the hierarchical structure of actions. Hyperbolic geometry naturally provides tree-like relationships between embeddings, enabling us to guide the action label denoising process in a coarse-to-fine manner: higher diffusion timesteps are influenced by abstract, high-level action categories (root nodes), while lower timesteps are refined using fine-grained action classes (leaf nodes). Extensive experiments on three benchmark datasets, GTEA, 50Salads, and Breakfast, demonstrate that our method achieves state-of-the-art performance, validating the effectiveness of hyperbolic-guided denoising for the temporal action segmentation task.
翻译:时序动作分割是视频理解中的关键任务,其目标是为视频中的每一帧分配动作标签。尽管近期研究利用基于迭代细化的策略取得了进展,但这些方法未能显式利用人类动作的层次化特性。本文提出HybridTAS——一种新颖的框架,将欧几里得几何与双曲几何的混合结构融入扩散模型的去噪过程,以挖掘动作的层次化结构。双曲几何天然地为嵌入表示提供树状关系,使我们能够以从粗到细的方式引导动作标签的去噪过程:较高的扩散时间步受抽象的高层动作类别(根节点)影响,而较低的时间步则通过细粒度动作类别(叶节点)进行细化。在GTEA、50Salads和Breakfast三个基准数据集上的大量实验表明,我们的方法取得了最先进的性能,验证了双曲引导去噪策略在时序动作分割任务中的有效性。