The explosive growth of user-generated video content on online platforms is accompanied by the emergence of numerous near-duplicate videos--videos that are identical or highly similar but differ by partial edits. These duplicates degrade user experience and increase storage and bandwidth costs, making large-scale video deduplication a critical task. Existing video deduplication frameworks face a fundamental challenge in retrieving sufficient high-quality candidates under a limited index budget, as well as trade-offs between efficiency and precision. To address these issues, we propose MLT-Dedup, an efficient large-scale online video deduplication framework with Multi-Level representations and spatial-Temporal matching. Our approach employs a Multi-Level Video Encoder (ML-VE) to extract both fine-grained frame-level and sparse clip-level embeddings: sparse embeddings support efficient candidate retrieval, while fine-grained embeddings are loaded for precise pairwise matching. During matching, we introduce DiF-SiM, a Differential Feature-enhanced Similarity Module capable of locating duplicated temporal segments and providing reliable similarity evidence to support policy-driven deduplication decisions. Extensive experiments on a real-world large-scale platform demonstrate that MLT-Dedup reduces online repetition rates by 91% at 90% precision. Furthermore, our sparse retrieval design achieves a 5x increase in indexing capacity, enabling broader candidate coverage in real-world deployment.
翻译:在线平台用户生成视频内容的爆发式增长伴随着大量近重复视频的出现——这些视频内容相同或高度相似,但存在部分编辑差异。此类重复内容不仅损害用户体验,还增加了存储和带宽成本,使得大规模视频去重成为关键任务。现有视频去重框架面临根本性挑战:在有限索引预算下获取足够的高质量候选集,同时需要在效率与精度之间做出权衡。针对这些问题,我们提出MLT-Dedup,一种基于多层级表征与时空匹配的高效大规模在线视频去重框架。该框架采用多层级视频编码器(ML-VE)提取细粒度帧级表示与稀疏片段级表示:稀疏表示支持高效候选检索,细粒度表示则用于精确成对匹配。在匹配阶段,我们引入DiF-SiM(差分特征增强相似度模块),该模块能够定位重复的时间片段,并提供可靠的相似性证据以支持策略驱动的去重决策。在真实大规模平台上的实验表明,MLT-Dedup在90%精度条件下可将在线重复率降低91%。此外,我们的稀疏检索设计实现了5倍的索引容量提升,在实际部署中可覆盖更广泛的候选集。