Objects in videos are typically characterized by continuous smooth motion. We exploit continuous smooth motion in three ways. 1) Improved accuracy by using object motion as an additional source of supervision, which we obtain by anticipating object locations from a static keyframe. 2) Improved efficiency by only doing the expensive feature computations on a small subset of all frames. Because neighboring video frames are often redundant, we only compute features for a single static keyframe and predict object locations in subsequent frames. 3) Reduced annotation cost, where we only annotate the keyframe and use smooth pseudo-motion between keyframes. We demonstrate computational efficiency, annotation efficiency, and improved mean average precision compared to the state-of-the-art on four datasets: ImageNet VID, EPIC KITCHENS-55, YouTube-BoundingBoxes, and Waymo Open dataset. Our source code is available at https://github.com/L-KID/Videoobject-detection-by-location-anticipation.
翻译:视频中的物体通常具有连续平滑的运动特征。我们通过三种方式利用这一特性:1)通过将物体运动作为额外监督信号来提升检测精度——从静态关键帧中预测目标位置;2)通过仅在少量帧上执行高代价的特征计算来提升效率——由于相邻视频帧常存在冗余信息,我们仅计算单个静态关键帧的特征,并预测后续帧中的目标位置;3)通过仅标注关键帧并利用关键帧间的平滑伪运动来降低标注成本。在ImageNet VID、EPIC KITCHENS-55、YouTube-BoundingBoxes和Waymo Open dataset四个数据集上的实验表明,相比现有最优方法,本方法在计算效率、标注效率及平均精度均值方面均取得提升。源代码见https://github.com/L-KID/Videoobject-detection-by-location-anticipation。