Conversational assistants are increasingly popular across diverse real-world applications, highlighting the need for advanced multimodal speech modeling. Speech, as a natural mode of communication, encodes rich user-specific characteristics such as speaking rate and pitch, making it critical for effective interaction. Our work introduces a data-centric customization approach for efficiently enhancing multimodal understanding in conversational speech modeling. Central to our contributions is a novel multi-task learning paradigm that involves designing auxiliary tasks to utilize a small amount of speech data. Our approach achieves state-of-the-art performance on the Spoken-SQuAD benchmark, using only 10% of the training data with open-weight models, establishing a robust and efficient framework for audio-centric conversational modeling. We also introduce ASK-QA, the first dataset for multi-turn spoken dialogue with ambiguous user requests and dynamic evaluation inputs. Code and data forthcoming.
翻译:随着对话助手在各类实际应用中的日益普及,对先进多模态语音建模的需求日益凸显。语音作为一种自然的交流方式,编码了丰富的用户特定特征(如语速和音高),这对实现有效交互至关重要。本研究提出一种以数据为中心的定制化方法,用于高效提升对话语音建模中的多模态理解能力。我们贡献的核心在于一种新颖的多任务学习范式,该范式通过设计辅助任务来有效利用少量语音数据。我们的方法在Spoken-SQuAD基准测试中取得了最先进的性能,仅使用10%的训练数据和开源权重模型,为以音频为中心的对话建模建立了稳健高效的框架。同时,我们提出了ASK-QA数据集——首个包含模糊用户请求和动态评估输入的多轮次口语对话数据集。代码与数据即将发布。