The majority of voice-based conversational agents still rely on pause-and-respond turn-taking, leaving interactions sounding stiff and robotic. We present RESPOND (Responsive Engagement Strategy for Predictive Orchestration and Dialogue), a framework that brings two staples of human conversation to agents: timely backchannels ("mm-hmm," "right") and proactive turn claims that can contribute relevant content before the speaker yields the conversational floor. Built on streaming ASR (Automatic Speech Recognition) and incremental semantics, RESPOND continuously predicts both when and how to interject, enabling fluid, listener-aware dialogue. A defining feature is its designer-facing controllability: two orthogonal dials, Backchannel Intensity (frequency of acknowledgments) and Turn Claim Aggressiveness (depth and assertiveness of early contributions), can be tuned to match the etiquette of contexts ranging from rapid ideation to reflective counseling. By coupling predictive orchestration with explicit control, RESPOND offers a practical path toward conversational agents that adapt their conversational footprint to social expectations, advancing the design of more natural and engaging voice interfaces.
翻译:大多数基于语音的对话智能体仍然依赖"暂停-响应"式话轮转换,导致交互显得生硬且机械。我们提出RESPOND框架(面向预测性编排与对话的响应式交互策略),该框架为智能体引入了人类对话的两个核心要素:及时的反馈信号("嗯哼"、"对")以及在说话者尚未让出话轮时主动插入相关内容的抢话机制。基于流式自动语音识别与增量语义分析,RESPOND可持续预测插入时机与方式,从而实现流畅且具有听者感知的对话。其核心设计特色在于面向开发者的可控性:通过"反馈强度"(确认回应的频率)和"抢话激进程度"(早期插入内容的深度与果断性)两个正交调节旋钮,可灵活适配从快速头脑风暴到反思性咨询等不同场景的礼仪规范。通过将预测性编排与显式控制相结合,RESPOND为对话智能体提供了一条实用路径,使其能够根据社会期望动态调整对话参与度,推动更自然、更具吸引力的语音界面设计。