As audio-first agents become increasingly common in physical AI, conversational robots, and screenless wearables, audio large language models (audio-LLMs) must integrate speaker-specific understanding to support user authorization, personalization, and context-aware interaction. This requires modeling who is speaking, how the voice sounds, and how recording conditions affect speaker cues. Conventional speaker verification systems provide strong scalar scores but little linguistic evidence, while current audio-LLMs and speaker-aware language models have limited ability to organize speaker information beyond binary labels or descriptive profiles. We present SpeakerLLM, a speaker-specialized audio-LLM framework that unifies single-utterance speaker profiling, recording-condition understanding, utterance-pair speaker comparison, and evidence-organized verification reasoning within a natural-language interface. We construct verification-reasoning targets and a decision-composition policy that separate profile-level evidence from the final same-or-different decision and organize recording condition, profile evidence, and the decision into a structured trace. At its core, SpeakerLLM uses a hierarchical speaker tokenizer designed to capture multiple granularities of speaker evidence. Utterance-level speaker embeddings summarize identity and profile-level cues, whereas frame-level speaker features preserve fine-grained acoustic descriptors. Experiments show that SpeakerLLM-Base improves speaker-profile and recording-condition understanding over general audio-LLMs, while SpeakerLLM-VR preserves strong generated-verdict accuracy and produces decision traces grounded in the supervised verification reasoning schema. We will release the metadata-enriched supervision dataset and target-construction code for reproducibility.
翻译:随着音频优先智能体在物理AI、对话机器人及无屏可穿戴设备中日益普及,音频大语言模型需整合说话人特异性理解能力,以支持用户授权、个性化定制及上下文感知交互。这要求模型能够识别说话人身份、理解语音声学特征,并分析录音条件对说话人线索的影响。传统说话人验证系统虽能生成强判别性标量分数,但缺乏语言解释能力;现有音频-大语言模型及说话人感知语言模型在组织说话人信息时,仍局限于二元标签或描述性轮廓。本文提出SpeakerLLM——一种说话人专用的音频-大语言模型框架,它将单语句说话人轮廓建模、录音条件理解、语句对说话人比较及基于证据的验证推理统一于自然语言接口之中。我们构建了验证推理目标与决策组合策略,将轮廓级证据与最终"同源/异源"判别分离,并组织录音条件、轮廓证据及最终决策形成结构化推演链。核心创新在于层次化说话人分词器,该分词器可捕获多粒度说话人证据:语句级说话人嵌入总结身份与轮廓线索,而帧级说话人特征保留细粒度声学描述符。实验表明,SpeakerLLM-Base在说话人轮廓与录音条件理解任务上优于通用音频-大语言模型,而SpeakerLLM-VR在保持生成判断准确性的同时,能产生基于监督验证推理模式的决策推演链。为保障可复现性,我们将发布携带元数据的监督数据集及目标构建代码。