Rapid advances in speech synthesis and audio editing have made realistic forgeries increasingly accessible, yet existing detection methods remain vulnerable to tampering or depend on visual/wearable sensors. In this paper, we present VoxAnchor, a system that physically grounds audio authentication in vocal dynamics by leveraging the inherent coherence between speech acoustics and radar-sensed throat vibrations. VoxAnchor uses contactless millimeter-wave radar to capture fine-grained throat vibrations that are tightly coupled with human speech production, establishing a hard-to-forge anchor rooted in human physiology. The design comprises three main components: (1) a cross-modal frame-work that uses modality-specific encoders and contrastive learning to detect subtle mismatches at word granularity; (2) a phase-aware pipeline that extracts physically consistent, temporally faithful throat vibrations; and (3) a dual-stage strategy that combines signal-level onset detection and semantic-level coherence to align asynchronous radar and audio streams. Unlike liveness detection, which only confirms whether speech occurred, VoxAnchor verifies what was spoken through word-level content consistency, exposing localized edits that preserve identity and global authenticity cues. Extensive evaluations show that VoxAnchor achieves robust, fine-grained detection across diverse forgeries (editing, splicing, replay, deepfake) and conditions, with an overall EER of 0.017, low latency, and modest computational cost.
翻译:语音合成与音频编辑技术的快速发展使得逼真的伪造内容日益普及,然而现有检测方法仍易受篡改攻击,或依赖于视觉/可穿戴传感器。本文提出VoxAnchor系统,通过利用语音声学与雷达感知的喉部振动之间的固有相干性,将音频认证物理锚定于发声动态。该系统使用非接触式毫米波雷达捕获与人类语音产生过程紧密耦合的精细喉部振动,建立基于人体生理学的难以伪造的锚点。其设计包含三个主要组件:(1) 跨模态框架,通过模态特定编码器与对比学习检测词语粒度的细微不匹配;(2) 相位感知流水线,提取物理一致且时间保真的喉部振动信号;(3) 双阶段策略,结合信号级起始检测与语义级相干性以对齐异步雷达与音频流。与仅确认语音是否发生的活体检测不同,VoxAnchor通过词语级内容一致性验证所讲内容,可揭露保留身份特征与全局真实性线索的局部编辑。大量评估表明,VoxAnchor能在多种伪造类型(编辑、拼接、重放、深度伪造)及条件下实现鲁棒的细粒度检测,总体等错误率(EER)达0.017,具有低延迟与适中的计算成本。