Temporal grounding is crucial in multimodal learning, but it poses challenges when applied to animal behavior data due to the sparsity and uniform distribution of moments. To address these challenges, we propose a novel Positional Recovery Training framework (Port), which prompts the model with the start and end times of specific animal behaviors during training. Specifically, \port{} enhances the baseline model with a Recovering branch to reconstruct corrupted label sequences and align distributions via a Dual-alignment method. This allows the model to focus on specific temporal regions prompted by ground-truth information. Extensive experiments on the Animal Kingdom dataset demonstrate the effectiveness of \port{}, achieving an IoU@0.3 of 38.52. It emerges as one of the top performers in the sub-track of MMVRAC in ICME 2024 Grand Challenges.
翻译:时序定位在多模态学习中至关重要,但将其应用于动物行为数据时,由于行为时刻的稀疏性和均匀分布,带来了挑战。为应对这些挑战,我们提出了一种新颖的位置恢复训练框架(Port),该框架在训练过程中通过特定动物行为的开始和结束时间来提示模型。具体而言,Port通过一个恢复分支来增强基线模型,以重建被破坏的标签序列,并通过一种双重对齐方法来对齐分布。这使得模型能够专注于由真实信息提示的特定时间区域。在Animal Kingdom数据集上进行的大量实验证明了Port的有效性,其IoU@0.3达到了38.52。该框架成为ICME 2024 Grand Challenges中MMVRAC子赛道表现最佳的方法之一。