Detecting dialogue breakdown in real time is critical for conversational AI systems, because it enables taking corrective action to successfully complete a task. In spoken dialog systems, this breakdown can be caused by a variety of unexpected situations including high levels of background noise, causing STT mistranscriptions, or unexpected user flows. In particular, industry settings like healthcare, require high precision and high flexibility to navigate differently based on the conversation history and dialogue states. This makes it both more challenging and more critical to accurately detect dialog breakdown. To accurately detect breakdown, we found it requires processing audio inputs along with downstream NLP model inferences on transcribed text in real time. In this paper, we introduce a Multimodal Contextual Dialogue Breakdown (MultConDB) model. This model significantly outperforms other known best models by achieving an F1 of 69.27.
翻译:实时检测对话中断对于对话式AI系统至关重要,因为它能够采取纠正措施以成功完成任务。在口语对话系统中,这种中断可能由各种意外情况引发,包括高水平背景噪声导致的语音转文本误转录,或非预期的用户行为流。特别是在医疗等工业场景中,系统需要根据对话历史和对话状态进行高精度、高灵活性的导航调整,这使得准确检测对话中断既更具挑战性也更为关键。为准确检测中断,我们发现需要实时处理音频输入以及基于转录文本的下游NLP模型推理结果。本文提出了一种多模态上下文对话中断模型(MultConDB),其F1分数达到69.27,显著优于其他已知最优模型。