Complex activities in real-world audio unfold over extended durations and exhibit hierarchical structure, yet most prior work focuses on short clips and isolated events. To bridge this gap, we introduce MultiAct, a new dataset and benchmark for multi-level structured understanding of human activities from long-form audio. MultiAct comprises long-duration kitchen recordings annotated at three semantic levels (activities, sub-activities and events) and paired with fine-grained captions and high-level summaries. We further propose a unified hierarchical model that jointly performs classification, detection, sequence prediction and multi-resolution captioning. Experiments on MultiAct establish strong baselines and reveal key challenges in modelling hierarchical and compositional structure of long-form audio. A promising direction for future work is the exploration of methods better suited to capturing the complex, long-range relationships in long-form audio.
翻译:现实世界音频中的复杂活动往往持续较长时间且呈现层次化结构,然而现有研究大多聚焦于短片段和孤立事件。为弥补这一空白,我们提出了MultiAct——一个用于从长时音频中多层次结构化理解人类活动的新数据集与基准。MultiAct包含长时间厨房录音,标注涵盖三个语义层级(活动、子活动与事件),并配有细粒度描述文本与高层摘要。我们进一步提出一个统一的层次化模型,能够联合执行分类、检测、序列预测及多分辨率描述生成。在MultiAct上的实验建立了强基线结果,并揭示了建模长时音频层次化组合结构的关键挑战。未来工作的一个可行方向是探索更适合捕捉长时音频中复杂长程关系的方法。