Following step-by-step procedures is an essential component of various activities carried out by individuals in their daily lives. These procedures serve as a guiding framework that helps to achieve goals efficiently, whether it is assembling furniture or preparing a recipe. However, the complexity and duration of procedural activities inherently increase the likelihood of making errors. Understanding such procedural activities from a sequence of frames is a challenging task that demands an accurate interpretation of visual information and the ability to reason about the structure of the activity. To this end, we collect a new egocentric 4D dataset, CaptainCook4D, comprising 384 recordings (94.5 hours) of people performing recipes in real kitchen environments. This dataset consists of two distinct types of activity: one in which participants adhere to the provided recipe instructions and another in which they deviate and induce errors. We provide 5.3K step annotations and 10K fine-grained action annotations and benchmark the dataset for the following tasks: supervised error recognition, multistep localization, and procedure learning
翻译:按照分步程序进行操作是个人日常活动中不可或缺的组成部分。无论是组装家具还是准备食谱,这些程序都作为指导框架,帮助高效地达成目标。然而,程序性活动的复杂性和持续时间本质上增加了出错的可能性。从帧序列中理解此类程序性活动是一项具有挑战性的任务,需要准确解读视觉信息并具备对活动结构进行推理的能力。为此,我们收集了一个新的自我中心4D数据集——CaptainCook4D,包含人们在真实厨房环境中执行食谱的384个记录(共94.5小时)。该数据集包含两种不同类型的活动:一种遵循提供的食谱指令,另一种则偏离指令并诱导错误。我们提供了5.3K个步骤标注和10K个细粒度动作标注,并针对以下任务对数据集进行了基准测试:监督式错误识别、多步骤定位和程序学习。