Accurate transcription and speaker diarization of child-adult spoken interactions are crucial for developmental and clinical research. However, manual annotation is time-consuming and challenging to scale. Existing automated systems typically rely on cascaded speaker diarization and speech recognition pipelines, which can lead to error propagation. This paper presents a unified end-to-end framework that extends the Whisper encoder-decoder architecture to jointly model ASR and child-adult speaker role diarization. The proposed approach integrates: (i) a serialized output training scheme that emits speaker tags and start/end timestamps, (ii) a lightweight frame-level diarization head that enhances speaker-discriminative encoder representations, (iii) diarization-guided silence suppression for improved temporal precision, and (iv) a state-machine-based forced decoding procedure that guarantees structurally valid outputs. Comprehensive evaluations on two datasets demonstrate consistent and substantial improvements over two cascaded baselines, achieving lower multi-talker word error rates and demonstrating competitive diarization accuracy across both Whisper-small and Whisper-large models. These findings highlight the effectiveness and practical utility of the proposed joint modeling framework for generating reliable, speaker-attributed transcripts of child-adult interactions at scale. The code and model weights are publicly available
翻译:儿童与成人语音交互的准确转写和说话人二值化对于发展心理学与临床研究至关重要。然而,人工标注耗时且难以规模化。现有自动化系统通常依赖级联的说话人二值化与语音识别流水线,可能导致误差传播。本文提出了一种统一的端到端框架,通过扩展Whisper编码器-解码器架构,联合建模ASR与儿童-成人说话人角色二值化。所提方法整合了:(i) 输出说话人标签及起止时间戳的序列化训练方案,(ii) 增强说话人判别性编码器表征的轻量级帧级二值化头部,(iii) 提升时序精度的二值化引导静音抑制机制,以及(iv) 基于状态机的强制解码程序以保证结构有效的输出。在两个数据集上的综合评估表明,相较于两种级联基线方法,本方法取得了持续且显著的改进,实现了更低的多说话人词错误率,并在Whisper-small和Whisper-large模型上均展现出具有竞争力的二值化准确率。这些发现凸显了所提联合建模框架在规模化生成可靠、带说话人属性的儿童-成人交互转录文本方面的有效性与实用价值。代码与模型权重已公开。