When retrieving a person from a video archive by voice and face, should the system be multimodal or not? In real-world broadcast archives, unlike curated benchmarks, a target may be heard but unseen, seen but unheard, or both. Fusing scores from an absent modality injects noise, degrading precision below the best unimodal system. We propose a query-adaptive framework that detects active modalities via cross-modal score consistency: when both modalities are active, files retrieved by one also score highly on the other; this agreement breaks down when a modality is absent. Classifiers driven by these cross-modal features achieve 89% detection accuracy. On the BBC Rewind corpus (with over 12,000 broadcast videos) the adaptive system attains 94.2% P@1, outperforming speaker-only (82.9%), face-only (93.4%), and fixed fusion (90.0%), recovering 64% of the gap to an oracle with ground-truth modality labels (96.6%).
翻译:当通过语音和面部从视频档案中检索人物时,系统应采用多模态还是单一模态?在实际广播档案中,与精心策划的基准测试不同,目标人物可能只闻其声不见其人,或只见其人未闻其声,或两者兼有。融合缺失模态的分数会引入噪声,导致检索精度低于最优单一模态系统。我们提出一种查询自适应框架,通过跨模态分数一致性检测主动模态:当两种模态均活跃时,通过一种模态检索到的文件在另一种模态上也获得高分;当某一模态缺失时,这种一致性被破坏。基于这些跨模态特征训练的检测器实现了89%的检测准确率。在BBC Rewind语料库(包含超过12,000个广播视频)上,自适应系统达到了94.2%的P@1,优于仅依赖说话人(82.9%)、仅依赖面部(93.4%)和固定融合(90.0%)方法,将系统与具备真实模态标签的基准系统(96.6%)之间的差距缩小了64%。