Despite recent advances in video-based action recognition and robust spatio-temporal modeling, most of the proposed approaches rely on the abundance of computational resources to afford running huge and computation-intensive convolutional or transformer-based neural networks to obtain satisfactory results. This limits the deployment of such models on edge devices with limited power and computing resources. In this work we investigate an important smart home application, video based delivery detection, and present a simple and lightweight pipeline for this task that can run on resource-constrained doorbell cameras. Our method relies on motion cues to generate a set of coarse activity proposals followed by their classification with a mobile-friendly 3DCNN network. To train we design a novel semi-supervised attention module that helps the network to learn robust spatio-temporal features and adopt an evidence-based optimization objective that allows for quantifying the uncertainty of predictions made by the network. Experimental results on our curated delivery dataset shows the significant effectiveness of our pipeline and highlights the benefits of our training phase novelties to achieve free and considerable inference-time performance gains.
翻译:尽管近年来基于视频的动作识别和稳健的时空建模取得了进展,大多数提出的方法仍依赖于充足的计算资源来运行庞大且计算密集的卷积或基于Transformer的神经网络以获得满意结果。这限制了此类模型在功率和计算资源有限的边缘设备上的部署。在本工作中,我们研究了一项重要的智能家居应用——基于视频的包裹检测,并提出了一种简单且轻量化的处理流程,可在资源受限的门铃摄像头上运行。我们的方法依赖运动线索生成一组粗略的活动提议,随后通过移动端友好的3DCNN网络对其进行分类。为进行训练,我们设计了一种新颖的半监督注意力模块,帮助网络学习稳健的时空特征,并采用基于证据的优化目标,从而能够量化网络预测的不确定性。在我们整理的包裹数据集上的实验结果表明,我们的处理流程具有显著的有效性,并突显了训练阶段创新带来的益处,可在推理时实现免费且可观的性能提升。