Global-scale video moderation faces a dual challenge: the need for fine-grained multi-modal reasoning and the demand for interpretable outputs to support downstream enforcement. Traditional moderation systems often rely on fragmented black-box classifiers that are difficult to maintain and lack transparency. In this paper, we present UNIVID, a UNIfied VIsion-language model for video moDeration. Unlike standard classification models, UNIVID generates policy-aware captions that serve as an interpretable intermediate representation, enabling human-verifiable decisions and multi-task reusability. While existing open-source and commercial VLMs often suffer from safety-guardrail refusals and lack fine-grained policy alignment, we develop a specialized training data recipe that combines expert human-refined labels with synthetic data to align the model with our safety guidelines. By integrating UNIVID as the core captioner, we design a novel end-to-end video moderation system that reduces violation leakage by 42.7% and overkill rate by 37.0% relatively. Meanwhile, by replacing over 1,000 policy-specific models with a single UNIVID backbone, we recycled extensive computation resources while reducing engineering maintenance overhead. To our knowledge, this is one of the first reports of a high-efficiency captioning VLM successfully supporting industrial-scale moderation and cross-functional business.
翻译:全球规模的视频审核面临双重挑战:既需要细粒度的多模态推理能力,又需要可解释的输出结果以支撑下游执法。传统审核系统往往依赖碎片化的黑盒分类器,这类系统难以维护且缺乏透明度。本文提出UNIVID——一种面向视频审核的统一视觉语言模型。与标准分类模型不同,UNIVID生成符合策略描述的中间表征作为可解释载体,既支持人工可验证的决策,又具备多任务复用能力。针对现有开源及商业视觉语言模型普遍存在安全护栏拒答机制且缺乏细粒度策略对齐的问题,我们开发了融合专家精标数据与合成数据的专项训练方案,使模型对齐安全准则。通过将UNIVID作为核心描述生成器,我们设计了全新的端到端视频审核系统,使违规漏检率相对降低42.7%,过度拦截率相对降低37.0%。同时,采用单一UNIVID主干网络替代逾千个策略专用模型,在降低工程维护开销的同时实现了计算资源的循环利用。据我们所知,这是首批成功支撑工业级审核及跨职能业务的高效描述型视觉语言模型报道之一。