The rise of Internet of Things (IoT) devices in the physical world necessitates voice-based interfaces capable of handling complex user experiences. While modern Large Language Models (LLMs) already demonstrate strong tool-usage capabilities, modeling real-world IoT devices presents a difficult, understudied challenge which combines modeling spatiotemporal constraints with speech inputs, dynamic state tracking, and mixed-initiative interaction patterns. We introduce MIST (the Multimodal Interactive Speech-based Tool-calling Dataset), a synthetic multi-turn, voice-driven code generation task that operates over IoT devices. We find that there is a significant gap between open- and closed-weight multimodal LLMs on MIST, and that even frontier closed-weight LLMs have substantial headroom. We release MIST and an extensible data generation framework to build related datasets in order to facilitate research on mixed-initiative voice assistants which reason about physical world constraints.
翻译:物联网设备在物理世界中的兴起促使我们需要能够处理复杂用户体验的语音接口。尽管现代大型语言模型已展现出强大的工具使用能力,但对现实世界物联网设备进行建模仍是一项极具挑战且尚未充分研究的课题——它需要结合时空约束与语音输入、动态状态追踪以及混合主动交互模式。本文提出MIST(多模态交互式语音工具调用数据集),这是一个基于物联网设备运行的合成多轮语音驱动代码生成任务。我们发现,在MIST任务上,开源与闭源多模态大语言模型之间存在显著性能差距,即便是最前沿的闭源大语言模型也存在较大的改进空间。为促进针对需推理物理世界约束的混合主动语音助手研究,我们开源了MIST数据集及可扩展的数据生成框架。