Egocentric action recognition is gaining significant attention in the field of human action recognition. In this paper, we address data scarcity issue in egocentric action recognition from a compositional generalization perspective. To tackle this problem, we propose a free-form composition network (FFCN) that can simultaneously learn disentangled verb, preposition, and noun representations, and then use them to compose new samples in the feature space for rare classes of action videos. First, we use a graph to capture the spatial-temporal relations among different hand/object instances in each action video. We thus decompose each action into a set of verb and preposition spatial-temporal representations using the edge features in the graph. The temporal decomposition extracts verb and preposition representations from different video frames, while the spatial decomposition adaptively learns verb and preposition representations from action-related instances in each frame. With these spatial-temporal representations of verbs and prepositions, we can compose new samples for those rare classes in a free-form manner, which is not restricted to a rigid form of a verb and a noun. The proposed FFCN can directly generate new training data samples for rare classes, hence significantly improve action recognition performance. We evaluated our method on three popular egocentric action recognition datasets, Something-Something V2, H2O, and EPIC-KITCHENS-100, and the experimental results demonstrate the effectiveness of the proposed method for handling data scarcity problems, including long-tailed and few-shot egocentric action recognition.
翻译:第一人称视角动作识别在人类动作识别领域正受到广泛关注。本文从组合泛化的角度探讨第一人称动作识别中的数据稀缺问题。为解决该问题,我们提出一种自由形式组合网络(FFCN),该网络能够同时学习解耦的动词、介词和名词表示,并利用这些表示为稀有类别的动作视频在特征空间中合成新样本。首先,我们使用图来捕捉每个动作视频中不同手/物体实例之间的时空关系。通过利用图中的边特征,我们将每个动作分解为一组动词和介词的时空表示。时间分解从不同视频帧中提取动词和介词表示,而空间分解则从每帧中与动作相关的实例自适应地学习动词和介词表示。凭借这些动词和介词的时空表示,我们能够以自由形式(而非局限于动词与名词的刻板结构)为稀有类别合成新样本。所提出的FFCN可直接为稀有类别生成新的训练数据样本,从而显著提升动作识别性能。我们在三个主流的第一人称视角动作识别数据集(Something-Something V2、H2O和EPIC-KITCHENS-100)上评估了所提方法,实验结果表明该方法在处理数据稀缺问题(包括长尾和少样本第一人称动作识别)时的有效性。