MerkleSpeech: Public-Key Verifiable, Chunk-Localised Speech Provenance via Perceptual Fingerprints and Merkle Commitments

Speech provenance goes beyond detecting whether a watermark is present. Real workflows involve splicing, quoting, trimming, and platform-level transforms that may preserve some regions while altering others. Neural watermarking systems have made strides in robustness and localised detection, but most deployments produce outputs with no third-party verifiable cryptographic proof tying a time segment to an issuer-signed original. Provenance standards like C2PA adopt signed manifests and Merkle-based fragment validation, yet their bindings target encoded assets and break under re-encoding or routine processing. We propose MerkleSpeech, a system for public-key verifiable, chunk-localised speech provenance offering two tiers of assurance. The first, a robust watermark attribution layer (WM-only), survives common distribution transforms and answers "was this chunk issued by a known party?". The second, a strict cryptographic integrity layer (MSv1), verifies Merkle inclusion of the chunk's fingerprint under an issuer signature. The system computes perceptual fingerprints over short speech chunks, commits them in a Merkle tree whose root is signed with an issuer key, and embeds a compact in-band watermark payload carrying a random content identifier and chunk metadata sufficient to retrieve Merkle inclusion proofs from a repository. Once the payload is extracted, all subsequent verification steps (signature check, fingerprint recomputation, Merkle inclusion) use only public information. The result is a splice-aware timeline indicating which regions pass each tier and why any given region fails. We describe the protocol, provide pseudocode, and present experiments targeting very low false positive rates under resampling, bandpass filtering, and additive noise, informed by recent audits identifying neural codecs as a major stressor for post-hoc audio watermarks.

翻译：语音溯源超越了单纯检测水印是否存在。实际工作流程涉及拼接、引用、裁剪及平台级转换，这些操作可能保留部分区域同时改变其他区域。神经网络水印系统在鲁棒性与局部化检测方面已取得进展，但多数部署方案生成的输出缺乏将时间段与签发方签名的原始文件相关联的第三方可验证密码学证明。C2PA等溯源标准采用签名清单和基于默克尔树的片段验证，但其绑定机制针对编码资产，在重新编码或常规处理下会失效。我们提出MerkleSpeech系统，通过公开密钥可验证、分块定位的语音溯源提供两级保障。第一级是鲁棒水印归属层（仅WM），能经受常见分发转换并回答"该语音块是否由已知方签发？"。第二级是严格密码学完整性层（MSv1），可验证语音块指纹在签发方签名下的默克尔包含性。该系统计算短语音块的感知指纹，将其提交至默克尔树（其根由签发方密钥签名），并嵌入紧凑的带内水印载荷——该载荷携带随机内容标识符和足以从存储库检索默克尔包含证明的块元数据。提取载荷后，所有后续验证步骤（签名检查、指纹重计算、默克尔包含验证）仅需公开信息即可完成。最终生成可感知拼接的时间线，指示哪些区域通过各级验证及特定区域失败的原因。我们详细描述了协议架构，提供伪代码实现，并针对重采样、带通滤波和加性噪声场景开展实验，以极低误报率为目标。实验设计参考了近期审计研究，该研究指出神经编解码器是事后音频水印面临的主要压力源。