Action segmentation is a core challenge in high-level video understanding, aiming to partition untrimmed videos into segments and assign each a label from a predefined action set. Existing methods primarily address single-person activities with fixed action sequences, overlooking multi-person scenarios. In this work, we pioneer textual reference-guided human action segmentation in multi-person settings, where a textual description specifies the target person for segmentation. We introduce the first dataset for Referring Human Action Segmentation, i.e., RHAS133, built from 133 movies and annotated with 137 fine-grained actions with 33h video data, together with textual descriptions for this new task. Benchmarking existing action segmentation methods on RHAS133 using VLM-based feature extractors reveals limited performance and poor aggregation of visual cues for the target person. To address this, we propose a holistic-partial aware Fourier-conditioned diffusion framework, i.e., HopaDIFF, leveraging a novel cross-input gate attentional xLSTM to enhance holistic-partial long-range reasoning and a novel Fourier condition to introduce more fine-grained control to improve the action segmentation generation. HopaDIFF achieves state-of-the-art results on RHAS133 in diverse evaluation settings. The dataset and code are available at https://github.com/KPeng9510/HopaDIFF.
翻译:动作分割是高层级视频理解中的核心挑战,其目标是将未修剪的视频划分为片段,并为每个片段分配预定义动作集中的标签。现有方法主要针对具有固定动作序列的单人活动,忽视了多人场景。在本工作中,我们率先在多人场景中实现了文本指代引导的人体动作分割,其中文本描述指定了待分割的目标人物。我们为此新任务引入了首个指代人体动作分割数据集RHAS133,该数据集基于133部电影构建,标注了137个细粒度动作,包含33小时的视频数据,并配有相应的文本描述。使用基于视觉语言模型的特征提取器在RHAS133上对现有动作分割方法进行基准测试发现,其性能有限且对目标人物的视觉线索聚合能力较差。为解决此问题,我们提出了一种全局部感知的傅里叶条件扩散框架HopaDIFF,该框架利用新颖的跨输入门控注意力xLSTM来增强全局部长程推理能力,并采用新颖的傅里叶条件引入更细粒度的控制以改进动作分割的生成。HopaDIFF在RHAS133上的多种评估设置中均取得了最先进的结果。数据集与代码公开于https://github.com/KPeng9510/HopaDIFF。