We present CallShield, the first caller identity authentication system that operates entirely at the audio layer, without relying on speech transcription, internet connectivity, or trusted infrastructure. CallShield introduces a real-time neural watermarking technique that enables per-bit embedding and recovery within 40-millisecond frames of live 8 kHz speech. This capability allows CallShield to transform the real-time audio channel into a noisy serial communication medium. To ensure reliable data transmission, CallShield implements a low-bitrate data link protocol that provides basic frame synchronization along with error detection, correction, and recovery. For caller authentication, CallShield adopts a secure and lightweight symmetric-key protocol that relies on pairwise shared secrets among trusted contacts. The system completes the full authentication process in an average of 63 seconds, including up to three retransmission attempts, making it suitable for real-time deployment. Extensive experiments under realistic telephony conditions demonstrate that CallShield achieves an overall authentication success rates exceeding 99.2% on clean audio and over 95% under common distortions, aided by selective retransmission of failed messages. Additionally, CallShield maintains high audio quality, achieving PESQ scores above 4.2 and STOI scores above 0.94 on clean speech, and exhibits robustness across a wide range of channel distortions, validating its practical viability for secure, real-time caller authentication.
翻译:本文提出CallShield,这是首个完全在音频层运行、不依赖语音转录、互联网连接或可信基础设施的来电身份认证系统。CallShield引入了一种实时神经水印技术,能够在8 kHz实时语音的40毫秒帧内实现逐比特嵌入与恢复。该技术使CallShield能够将实时音频通道转化为带噪声的串行通信媒介。为确保可靠数据传输,CallShield实现了低比特率数据链路协议,提供基本帧同步以及错误检测、纠正与恢复功能。在来电认证方面,CallShield采用安全轻量的对称密钥协议,依托可信联系人之间的成对共享密钥。系统平均在63秒内完成完整认证流程(包含最多三次重传尝试),满足实时部署需求。在真实电话环境下的广泛实验表明:借助失败消息的选择性重传机制,CallShield在纯净音频中的整体认证成功率超过99.2%,在常见失真条件下仍保持95%以上成功率。此外,CallShield保持了高音频质量(纯净语音的PESQ评分高于4.2,STOI评分高于0.94),并在多种通道失真条件下表现出强鲁棒性,验证了其在安全实时来电认证场景中的实际可行性。