Automatic deepfake detection has received considerable research attention, yet the socio-technical environment in which humans actually encounter synthetic speech remains poorly understood. We investigate voice deepfake detection as a perceptual and contextual process, presenting a localization task in which 47 participants marked suspected synthetic segments across authentic, fully synthetic, and partially synthetic utterances under three manipulated trust cues: instructional framing, affective priming, and provenance labeling. Participants provided quality ratings on mechanicalness, expressiveness, intelligibility, clarity, calmness, and confidence of evaluation. Utterance class was the primary determinant of detection accuracy and perceptual quality; trust cues produced no main effects but motivated detection behavior. Fully synthetic speech was detected at below-chance levels. Quality ratings tracked utterance type, indicating implicit discrimination where overt detection failed.
翻译:自動深度偽造檢測已獲得大量研究關注,然而人類實際遭遇合成語音的社會技術環境仍未被充分理解。我們將語音深度偽造檢測視為一種感知與情境推理過程,設計了一項定位任務:在三種操縱信任線索(指示框架、情感啟動與來源標籤)的條件下,47名參與者標記真實語音、完全合成語音及部分合成語音中的疑似合成片段。參與者從機械性、表現力、可理解性、清晰度、平靜度及評估信心等維度進行品質評分。語音類別是檢測準確率與感知品質的主要決定因素;信任線索未產生主效應,但影響了檢測行為。完全合成語音的檢測率低於隨機水平。品質評分與語音類型相關聯,表明在顯性檢測失敗的情況下存在隱性區分能力。