Despite recent advances, efficient and robust turn-taking detection remains a significant challenge in industrial-grade Voice AI agent deployments. Many existing systems rely solely on acoustic or semantic cues, leading to suboptimal accuracy and stability, while recent attempts to endow large language models with full-duplex capabilities require costly full-duplex data and incur substantial training and deployment overheads, limiting real-time performance. In this paper, we propose JAL-Turn, a lightweight and efficient speech-only turn-taking framework that adopts a joint acoustic-linguistic modeling paradigm, in which a cross-attention module adaptively integrates pre-trained acoustic representations with linguistic features to support low-latency prediction of hold vs shift states. By sharing a frozen ASR encoder, JAL-Turn enables turn-taking prediction to run fully in parallel with speech recognition, introducing no additional end-to-end latency or computational overhead. In addition, we introduce a scalable data construction pipeline that automatically derives reliable turn-taking labels from large-scale real-world dialogue corpora. Extensive experiments on public multilingual benchmarks and an in-house Japanese customer-service dataset show that JAL-Turn consistently outperforms strong state-of-the-art baselines in detection accuracy while maintaining superior real-time performance.
翻译:尽管近期取得了进展,但在工业级语音AI代理部署中,高效且稳健的话轮转换检测仍是一项重大挑战。许多现有系统仅依赖声学或语义线索,导致准确性和稳定性欠佳;而近期赋予大语言模型全双工能力的尝试需要昂贵的数据支持,并带来显著的训练和部署开销,从而限制了实时性能。本文提出JAL-Turn——一种轻量级高效的纯语音话轮转换框架,采用声学-语言联合建模范式,通过交叉注意力模块自适应融合预训练声学表征与语言特征,以支持低延迟的保持/转移状态预测。通过共享冻结的ASR编码器,JAL-Turn使话轮转换预测能与语音识别完全并行运行,不引入额外的端到端延迟或计算开销。此外,我们提出了一种可扩展的数据构建流程,能从大规模真实对话语料库中自动推导可靠的话轮转换标签。在公开多语言基准数据集及内部日语客服数据集上的大量实验表明,JAL-Turn在保持优异实时性能的同时,其检测准确率持续优于强基线方法的最新成果。