Efficient Masked AutoEncoder for Video Object Counting and A Large-Scale Benchmark

The dynamic imbalance of the fore-background is a major challenge in video object counting, which is usually caused by the sparsity of foreground objects. This often leads to severe under- and over-prediction problems and has been less studied in existing works. To tackle this issue in video object counting, we propose a density-embedded Efficient Masked Autoencoder Counting (E-MAC) framework in this paper. To effectively capture the dynamic variations across frames, we utilize an optical flow-based temporal collaborative fusion that aligns features to derive multi-frame density residuals. The counting accuracy of the current frame is boosted by harnessing the information from adjacent frames. More importantly, to empower the representation ability of dynamic foreground objects for intra-frame, we first take the density map as an auxiliary modality to perform $\mathtt{D}$ensity-$\mathtt{E}$mbedded $\mathtt{M}$asked m$\mathtt{O}$deling ($\mathtt{DEMO}$) for multimodal self-representation learning to regress density map. However, as $\mathtt{DEMO}$ contributes effective cross-modal regression guidance, it also brings in redundant background information and hard to focus on foreground regions. To handle this dilemma, we further propose an efficient spatial adaptive masking derived from density maps to boost efficiency. In addition, considering most existing datasets are limited to human-centric scenarios, we first propose a large video bird counting dataset $\textit{DroneBird}$, in natural scenarios for migratory bird protection. Extensive experiments on three crowd datasets and our $\textit{DroneBird}$ validate our superiority against the counterparts.

翻译：前景与背景的动态不平衡是视频目标计数中的主要挑战，通常由前景目标的稀疏性引起。这往往导致严重的预测不足与预测过度问题，而在现有研究中较少被探讨。为解决视频目标计数中的这一难题，本文提出一种密度嵌入的高效掩码自编码器计数框架。为有效捕捉帧间动态变化，我们采用基于光流的时序协同融合方法，通过对齐特征来推导多帧密度残差。通过利用相邻帧的信息，当前帧的计数精度得以提升。更重要的是，为增强帧内动态前景目标的表征能力，我们首次将密度图作为辅助模态，执行密度嵌入掩码建模进行多模态自表征学习以回归密度图。然而，尽管密度嵌入掩码建模提供了有效的跨模态回归指导，它也引入了冗余的背景信息且难以聚焦于前景区域。为解决这一困境，我们进一步提出一种基于密度图的高效空间自适应掩码机制以提升效率。此外，考虑到现有数据集多局限于以人为中心的场景，我们首次提出了适用于自然场景候鸟保护的大规模视频鸟类计数数据集。在三个人群数据集及我们数据集上的大量实验验证了本方法相较于对比方案的优越性。