Video comprises the vast majority of bits that are generated daily, and is the primary signal driving current innovations in robotics, remote sensing, and wearable technology. Yet, the most powerful video understanding models are too expensive for the resource-constrained platforms used in these applications. One approach is to offload inference to the cloud; this gives access to GPUs capable of processing high-resolution videos in real time. But even with reliable, high-bandwidth communication channels, the combined latency of video encoding, model inference, and round-trip communication prohibits use for certain real-time applications. The alternative is to use fully local inference; but this places extreme constraints on computational and power costs, requiring smaller models and lower resolution, leading to degraded accuracy. To address these challenges, we propose Dedelayed, a real-time inference system that divides computation between a remote model operating on delayed video frames and a local model with access to the current frame. The remote model is trained to make predictions on anticipated future frames, which the local model incorporates into its prediction for the current frame. The local and remote models are jointly optimized with an autoencoder that limits the transmission bitrate required by the available downlink communication channel. We evaluate Dedelayed on the task of real-time streaming video segmentation using the BDD100k driving dataset. For a round trip delay of 100 ms, Dedelayed improves performance by 6.4 mIoU compared to fully local inference and 9.8 mIoU compared to remote inference -- an equivalent improvement to using a model ten times larger. We release our training code, pretrained models, and python library at https://github.com/InterDigitalInc/dedelayed .
翻译:视频占每日生成比特流的绝大多数,也是推动机器人技术、遥感及可穿戴设备领域当前创新的主要信号。然而,最高效的视频理解模型对于这些应用场景中资源受限的平台而言过于昂贵。一种解决方案是将推理任务卸载到云端,这样可访问能够实时处理高分辨率视频的GPU。但即便具备可靠的高带宽通信信道,视频编码、模型推理与往返通信的累积延迟仍会阻碍其在某些实时场景中的应用。另一种方案是采用完全本地推理,但这会对计算能力和功耗施加极端限制,迫使用户使用更小的模型和更低的分辨率,从而导致精度下降。为应对这些挑战,我们提出Dedelayed系统——一种在延迟视频帧上运行的远程模型与可访问当前帧的本地模型之间分配计算任务的实时推理系统。远程模型经过训练可预测未来帧,而本地模型将其预测结果融入当前帧的预测中。本地模型与远程模型通过一个自编码器进行联合优化,该自编码器可限制下行通信信道所需的传输码率。我们使用BDD100k驾驶数据集在实时流式视频分割任务上评估了Dedelayed。在100毫秒往返延迟条件下,相比完全本地推理和远程推理,Dedelayed分别提升了6.4 mIoU和9.8 mIoU——相当于使用十倍参数量的模型获得的提升效果。我们在https://github.com/InterDigitalInc/dedelayed发布了训练代码、预训练模型及Python库。