Modern speaker verification (SV) systems rely on speaker embeddings that are effective but difficult to interpret or query in natural language. Most existing speech-text corpora target controllable synthesis or utterance-level captioning, and provide limited speaker-level supervision for in-the-wild speaker recognition. This paper introduces SpeakerCard-1M, a bilingual speaker-centric resource for evidence-grounded SV, derived from VoxCeleb1/2 and CN-Celeb1/2, where the "-1M" suffix refers to the 1.78M utterance-level captions contained in the release. We adopt a tool-first, LLM-last approach: ten acoustic probes produce field-level evidence, the evidence is aggregated into speaker profiles under a schema that separates relatively stable traits from utterance-level states, and bilingual Speaker Cards are rendered by a constrained LLM that sees only the structured fields. The release includes 56.7K Speaker Card records over 10.2K speakers, 1.78M utterance-level captions, and speaker-ID-disjoint hard-negative triplets. We further define two SV-oriented cross-modal protocols, bidirectional Speaker-Text Retrieval (T2S-R / S2T-R) and Attribute-Conditioned Verification (AC-Verify), and compare a dual-encoder baseline against recent audio language models under a zero-shot forced-choice setting. Joint audio-text training increases VoxCeleb1-O EER by 0.31% absolute over the audio-only baseline. Under a style-symmetric LLM-generated counterfactual protocol, eight recent audio language models (7B-30B+ parameters, both open- and closed-source) score 49-77% on pitch-level AC-Verify under two-way forced choice, compared with 88.66% reached by our dual encoder.
翻译:现代说话人验证系统依赖于有效但难以解释或通过自然语言查询的说话人嵌入。现有大多数语音-文本语料库面向可控合成或语句级描述,并为野外说话人识别提供的说话人级监督有限。本文提出SpeakerCard-1M,一个基于VoxCeleb1/2和CN-Celeb1/2构建的双语说话人中心资源,用于基于证据的说话人验证,其中"-1M"后缀指发布版本包含的178万条语句级描述。我们采用"工具优先,大语言模型最后"的方法:十个声学探针生成字段级证据,该证据在分离相对稳定特征与语句级状态的框架下聚合为说话人档案,最终由仅访问结构化字段的受限大语言模型渲染为双语说话人卡。发布版本包含涵盖10200个说话人的5.67万条说话人卡记录、178万条语句级描述及说话人ID不重叠的难负例三元组。我们进一步定义两个面向说话人验证的跨模态协议——双向说话人-文本检索与属性条件验证,并在零样本强制选择设置下比较双编码器基线模型与近期音频语言模型。联合音频-文本训练使VoxCeleb1-O的等错误率相比纯音频基线绝对提升0.31%。在风格对称的大语言模型生成反事实协议下,八个近期音频语言模型在双路强制选择的音高级属性条件验证中得分为49-77%,而我们的双编码器达到88.66%。