The goal of building a benchmark (suite of datasets) is to provide a unified protocol for fair evaluation and thus facilitate the evolution of a specific area. Nonetheless, we point out that existing protocols of action recognition could yield partial evaluations due to several limitations. To comprehensively probe the effectiveness of spatiotemporal representation learning, we introduce BEAR, a new BEnchmark on video Action Recognition. BEAR is a collection of 18 video datasets grouped into 5 categories (anomaly, gesture, daily, sports, and instructional), which covers a diverse set of real-world applications. With BEAR, we thoroughly evaluate 6 common spatiotemporal models pre-trained by both supervised and self-supervised learning. We also report transfer performance via standard finetuning, few-shot finetuning, and unsupervised domain adaptation. Our observation suggests that current state-of-the-art cannot solidly guarantee high performance on datasets close to real-world applications, and we hope BEAR can serve as a fair and challenging evaluation benchmark to gain insights on building next-generation spatiotemporal learners. Our dataset, code, and models are released at: https://github.com/AndongDeng/BEAR
翻译:构建基准(数据集套件)的目标在于提供统一的公平评估协议,从而推动特定领域的发展。然而,我们指出现有的动作识别协议因存在若干局限性,可能产生偏颇评估。为全面探究时空表征学习的有效性,我们引入了BEAR——一个新的视频动作识别基准。BEAR汇集了18个视频数据集,划分为5个类别(异常、手势、日常、体育和教学),覆盖多样化的真实世界应用场景。借助BEAR,我们系统评估了6种通过监督学习和自监督学习预训练的常见时空模型。我们还通过标准微调、少样本微调和无监督域自适应报告了迁移性能。研究表明,当前最先进的方法无法稳定保证在贴近真实应用的数据集上取得高性能,我们期望BEAR能作为一个公平且富有挑战性的评估基准,为构建下一代时空学习模型提供洞见。我们的数据集、代码和模型已发布于:https://github.com/AndongDeng/BEAR