ContextIQ: A Multimodal Expert-Based Video Retrieval System for Contextual Advertising

Contextual advertising serves ads that are aligned to the content that the user is viewing. The rapid growth of video content on social platforms and streaming services, along with privacy concerns, has increased the need for contextual advertising. Placing the right ad in the right context creates a seamless and pleasant ad viewing experience, resulting in higher audience engagement and, ultimately, better ad monetization. From a technology standpoint, effective contextual advertising requires a video retrieval system capable of understanding complex video content at a very granular level. Current text-to-video retrieval models based on joint multimodal training demand large datasets and computational resources, limiting their practicality and lacking the key functionalities required for ad ecosystem integration. We introduce ContextIQ, a multimodal expert-based video retrieval system designed specifically for contextual advertising. ContextIQ utilizes modality-specific experts-video, audio, transcript (captions), and metadata such as objects, actions, emotion, etc.-to create semantically rich video representations. We show that our system, without joint training, achieves better or comparable results to state-of-the-art models and commercial solutions on multiple text-to-video retrieval benchmarks. Our ablation studies highlight the benefits of leveraging multiple modalities for enhanced video retrieval accuracy instead of using a vision-language model alone. Furthermore, we show how video retrieval systems such as ContextIQ can be used for contextual advertising in an ad ecosystem while also addressing concerns related to brand safety and filtering inappropriate content.

翻译：上下文广告旨在投放与用户观看内容相匹配的广告。随着社交平台和流媒体服务中视频内容的快速增长，加之隐私问题的日益凸显，对上下文广告的需求不断增加。在恰当的上下文中投放合适的广告能够创造流畅愉悦的广告观看体验，从而提高受众参与度，最终实现更优的广告变现。从技术角度来看，有效的上下文广告需要一个能够在极细粒度上理解复杂视频内容的视频检索系统。当前基于联合多模态训练的文本-视频检索模型需要大规模数据集和大量计算资源，这限制了其实用性，且缺乏广告生态系统集成所需的关键功能。本文提出ContextIQ，一种专为上下文广告设计的基于多模态专家系统的视频检索系统。ContextIQ利用特定模态的专家模块——视频、音频、文本（字幕）以及元数据（如物体、动作、情感等）——来创建语义丰富的视频表征。我们证明，无需联合训练，我们的系统在多个文本-视频检索基准测试中取得了优于或媲美最先进模型及商业解决方案的结果。消融研究突显了利用多模态提升视频检索准确性的优势，而非仅依赖视觉-语言模型。此外，我们展示了如ContextIQ这类视频检索系统如何在广告生态系统中应用于上下文广告，同时解决与品牌安全及过滤不当内容相关的顾虑。