We introduce LEAP (illustrated in Figure 1), a novel method for generating video-grounded action programs through use of a Large Language Model (LLM). These action programs represent the motoric, perceptual, and structural aspects of action, and consist of sub-actions, pre- and post-conditions, and control flows. LEAP's action programs are centered on egocentric video and employ recent developments in LLMs both as a source for program knowledge and as an aggregator and assessor of multimodal video information. We apply LEAP over a majority (87\%) of the training set of the EPIC Kitchens dataset, and release the resulting action programs as a publicly available dataset here (https://drive.google.com/drive/folders/1Cpkw_TI1IIxXdzor0pOXG3rWJWuKU5Ex?usp=drive_link). We employ LEAP as a secondary source of supervision, using its action programs in a loss term applied to action recognition and anticipation networks. We demonstrate sizable improvements in performance in both tasks due to training with the LEAP dataset. Our method achieves 1st place on the EPIC Kitchens Action Recognition leaderboard as of November 17 among the networks restricted to RGB-input (see Supplementary Materials).
翻译:我们提出LEAP(图1所示),一种通过利用大语言模型生成视频驱动的动作程序的新方法。这些动作程序表征了动作的运动、感知和结构层面,包含子动作、前置/后置条件及控制流。LEAP的动作程序以第一人称视角视频为核心,利用大语言模型的最新进展,既作为程序知识的来源,又作为多模态视频信息的聚合器与评估器。我们将LEAP应用于EPIC Kitchens数据集大部分(87%)训练集,并将生成的程序作为公开数据集在此发布(链接)。我们通过将LEAP作为辅助监督源,在其动作程序中提取损失项应用于动作识别与预测网络。实验表明,使用LEAP数据集训练后,两项任务性能均显著提升。截至11月17日,在仅限RGB输入的网络中,我们的方法在EPIC Kitchens动作识别排行榜上取得第一名(见补充材料)。