While Audio Large Language Models (ALLMs) have achieved remarkable progress in understanding and generation, their potential privacy implications remain largely unexplored. This paper takes the first step to investigate whether ALLMs inadvertently leak user privacy solely through acoustic voiceprints and introduces $\textit{HearSay}$, a comprehensive benchmark constructed from over 22,000 real-world audio clips. To ensure data quality, the benchmark is meticulously curated through a rigorous pipeline involving automated profiling and human verification, guaranteeing that all privacy labels are grounded in factual records. Extensive experiments on $\textit{HearSay}$ yield three critical findings: $\textbf{Significant Privacy Leakage}$: ALLMs inherently extract private attributes from voiceprints, reaching 92.89% accuracy on gender and effectively profiling social attributes. $\textbf{Insufficient Safety Mechanisms}$: Alarmingly, existing safeguards are severely inadequate; most models fail to refuse privacy-intruding requests, exhibiting near-zero refusal rates for physiological traits. $\textbf{Reasoning Amplifies Risk}$: Chain-of-Thought (CoT) reasoning exacerbates privacy risks in capable models by uncovering deeper acoustic correlations. These findings expose critical vulnerabilities in ALLMs, underscoring the urgent need for targeted privacy alignment. The codes and dataset are available at https://github.com/JinWang79/HearSay_Benchmark
翻译:尽管音频大语言模型(ALLMs)在理解与生成方面取得了显著进展,其潜在的隐私影响仍基本未被探索。本文首次系统研究ALLMs是否仅通过声学声纹就会无意泄露用户隐私,并提出了基于22,000余条真实世界音频片段构建的综合基准测试集$\textit{HearSay}$。为确保数据质量,该基准集通过自动化特征分析与人工核验的严格流程进行精细构建,保证所有隐私标签均基于事实记录。在$\textit{HearSay}$上的大量实验得出三个关键发现:$\textbf{显著的隐私泄露}$:ALLMs本质上能从声纹中提取隐私属性,性别识别准确率达92.89%,并能有效刻画社会属性。$\textbf{安全机制不足}$:现有防护措施严重不足,多数模型无法拒绝隐私侵犯请求,在生理特征相关查询中拒绝率接近零。$\textbf{推理放大风险}$:思维链(CoT)推理会通过挖掘更深层的声学关联,加剧高性能模型的隐私风险。这些发现揭示了ALLMs的关键脆弱性,凸显了针对性隐私对齐的迫切需求。代码与数据集已发布于https://github.com/JinWang79/HearSay_Benchmark