Detecting YouTube Scam Videos via Multimodal Signals and Policy Reasoning

YouTube has emerged as a dominant platform for both information dissemination and entertainment. However, its vast accessibility has also made it a target for scammers, who frequently upload deceptive or malicious content. Prior research has documented a range of scam types, and detection approaches rely primarily on textual or statistical metadata. Although effective to some extent, these signals are easy to evade and potentially overlook other modalities, such as visual cues. In this study, we present the first systematic investigation of multimodal approaches for YouTube scam detection. Our dataset consolidates established scam categories and augments them with full length video content and policy grounded reasoning annotations. Our experimental evaluation demonstrates that a text-only model using video titles and descriptions (fine-tuned BERT) achieves moderate effectiveness (76.61% F1), with modest improvements when incorporating audio transcripts (77.98% F1). In contrast, visual analysis using a fine-tuned LLaVA-Video model yields stronger results (79.61% F1). Finally, a multimodal framework that integrates titles, descriptions, and video frames achieves the highest performance (80.53% F1). Beyond improving detection accuracy, our multimodal framework produces interpretable reasoning grounded in YouTube content policies, thereby enhancing transparency and supporting potential applications in automated moderation. Moreover, we validate our approach on in-the-wild YouTube data by analyzing 6,374 videos, thereby contributing a valuable resource for future research on scam detection.

翻译：YouTube已成为信息传播与娱乐的主导平台。然而，其广泛的开放性也使其成为诈骗者的目标，他们频繁上传具有欺骗性或恶意的内容。已有研究记录了一系列诈骗类型，检测方法主要依赖于文本或统计元数据。尽管这些信号在一定程度上有效，但它们易于规避，并可能忽略其他模态（如视觉线索）。本研究首次对YouTube诈骗检测的多模态方法进行了系统性探究。我们的数据集整合了既有的诈骗类别，并通过完整视频内容及基于政策推理的标注进行了扩充。实验评估表明，仅使用视频标题和描述的纯文本模型（微调BERT）取得了中等效果（76.61% F1），加入音频转录文本后仅有小幅提升（77.98% F1）。相比之下，使用微调LLaVA-Video模型进行视觉分析获得了更强结果（79.61% F1）。最终，整合标题、描述和视频帧的多模态框架实现了最佳性能（80.53% F1）。除了提升检测准确率外，我们的多模态框架能生成基于YouTube内容政策的可解释推理，从而增强透明度并支持自动化内容审核的潜在应用。此外，我们通过分析6,374个真实环境中的YouTube视频验证了该方法，为未来诈骗检测研究提供了宝贵资源。