Multi-label multi-view action recognition aims to recognize multiple concurrent or sequential actions from untrimmed videos captured by multiple cameras. Existing work has focused on multi-view action recognition in a narrow area with strong labels available, where the onset and offset of each action are labeled at the frame-level. This study focuses on real-world scenarios where cameras are distributed to capture a wide-range area with only weak labels available at the video-level. We propose the method named MultiASL (Multi-view Action Selection Learning), which leverages action selection learning to enhance view fusion by selecting the most useful information from different viewpoints. The proposed method includes a Multi-view Spatial-Temporal Transformer video encoder to extract spatial and temporal features from multi-viewpoint videos. Action Selection Learning is employed at the frame-level, using pseudo ground-truth obtained from weak labels at the video-level, to identify the most relevant frames for action recognition. Experiments in a real-world office environment using the MM-Office dataset demonstrate the superior performance of the proposed method compared to existing methods.
翻译:多标签多视角动作识别旨在从多个摄像机拍摄的未剪辑视频中识别多个并发或顺序的动作。现有研究主要集中在具有强标注的狭窄区域内的多视角动作识别,其中每个动作的起始和结束都在帧级别进行了标注。本研究关注现实世界场景,其中摄像机分布广泛以捕获大范围区域,且仅在视频级别提供弱标注。我们提出了名为MultiASL(多视角动作选择学习)的方法,该方法利用动作选择学习,通过从不同视角选择最有用的信息来增强视角融合。所提出的方法包括一个多视角时空Transformer视频编码器,用于从多视角视频中提取空间和时间特征。动作选择学习在帧级别进行,利用从视频级别弱标注获得的伪真实标签,以识别与动作识别最相关的帧。在现实世界办公环境中使用MM-Office数据集进行的实验表明,与现有方法相比,所提方法具有优越的性能。