Always-on edge cameras generate continuous video streams where redundant frames degrade cross-modal retrieval by crowding correct results out of top-k search. This paper presents a streaming retrieval architecture: an on-device epsilon-net filter retains only semantically novel frames, building a denoised embedding index; a cross-modal adapter and cloud re-ranker compensate for the compact encoder's weak alignment. A single-pass streaming filter outperforms offline alternatives (k-means, farthest-point, uniform, random) across eight vision-language models (8M-632M) on two egocentric datasets (AEA, EPIC-KITCHENS). Combined, the architecture reaches 45.6% Hit@5 on held-out data using an 8M on-device encoder at an estimated 2.7 mW.
翻译:持续运行的边缘摄像头生成连续视频流,其中冗余帧通过将正确结果挤出top-k搜索来降低跨模态检索性能。本文提出一种流式检索架构:设备端epsilon-净过滤器仅保留语义新颖的帧,构建去噪嵌入索引;跨模态适配器与云端重排序器弥补紧凑编码器对齐能力不足。单次流式过滤器在两项自我中心数据集(AEA、EPIC-KITCHENS)上,与离线替代方案(k均值、最远点、均匀采样、随机)相比,在八个视觉语言模型(8M-632M)上均表现更优。结合该架构,在设备端使用8M参数的编码器时,对保留数据的Hit@5达到45.6%,估计功耗为2.7毫瓦。