Every day, millions absorb claims from podcasts and streams that no fact-checker ever sees. Spoken misinformation is built through conversation, where credibility comes not from facts alone but from how claims are framed, reinforced, or left unchallenged across turns. Yet fact-checking has focused on isolated text, leaving dialogue audio under-studied. We introduce MAD2, a new Multi-turn Audio Dialogues benchmark for spoken claim verification, containing 1,000 two-speaker dialogues with 3,368 check-worthy claims and approximately 10 hours of audio, and propose calibrated multimodal fusion of a context-aware audio encoder and a dialogue-aware text model. Across settings, adding dialogue context improves verification, but the gains depend on scenario type. Using only preceding context often matches offline performance, supporting live-moderation settings, and audio contributes most when transcript-based models are destabilized by additional context. Overall, conversational structure matters more for verification than misinformation framing.
翻译:每日有数百万用户从播客和流媒体中获取信息,而其中绝大多数内容从未经过事实核查人员的审核。口语化的虚假信息通过对话逐步构建,其可信度不仅取决于事实本身,更依赖于信息在不同对话轮次中的呈现方式、强化或未被质疑的状态。然而,现有事实验证研究主要聚焦于孤立文本,对对话音频场景关注不足。我们提出MAD2——一个面向口语事实验证的新型多轮音频对话基准数据集,包含1,000段双人对话、3,368条值得核查的声明及约10小时音频。同时创新性地提出校准多模态融合框架,融合了上下文感知音频编码器与对话感知文本模型。实验表明:在不同场景中引入对话上下文均能提升验证效果,但提升幅度取决于场景类型;仅采用前文上下文即可达到离线性能水平,这支持了实时审核场景的应用;当基于转录的模型因额外上下文导致不稳定性时,音频模态贡献最为显著。总体而言,对话结构对验证效果的影响大于虚假信息的叙述方式本身。