WearVox: An Egocentric Multichannel Voice Assistant Benchmark for Wearables

Zhaojiang Lin,Yong Xu,Kai Sun,Jing Zheng,Yin Huang,Surya Teja Appini,Krish Narang,Renjie Tao,Ishan Kapil Jain,Siddhant Arora,Ruizhi Li,Yiteng Huang,Kaushik Patnaik,Wenfang Xu,Suwon Shon,Yue Liu,Ahmed A Aly,Anuj Kumar,Florian Metze,Xin Luna Dong

Wearable devices such as AI glasses are transforming voice assistants into always-available, hands-free collaborators that integrate seamlessly with daily life, but they also introduce challenges like egocentric audio affected by motion and noise, rapid micro-interactions, and the need to distinguish device-directed speech from background conversations. Existing benchmarks largely overlook these complexities, focusing instead on clean or generic conversational audio. To bridge this gap, we present WearVox, the first benchmark designed to rigorously evaluate voice assistants in realistic wearable scenarios. WearVox comprises 3,842 multi-channel, egocentric audio recordings collected via AI glasses across five diverse tasks including Search-Grounded QA, Closed-Book QA, Side-Talk Rejection, Tool Calling, and Speech Translation, spanning a wide range of indoor and outdoor environments and acoustic conditions. Each recording is accompanied by rich metadata, enabling nuanced analysis of model performance under real-world constraints. We benchmark leading proprietary and open-source speech Large Language Models (SLLMs) and find that most real-time SLLMs achieve accuracies on WearVox ranging from 29% to 59%, with substantial performance degradation on noisy outdoor audio, underscoring the difficulty and realism of the benchmark. Additionally, we conduct a case study with two new SLLMs that perform inference with single-channel and multi-channel audio, demonstrating that multi-channel audio inputs significantly enhance model robustness to environmental noise and improve discrimination between device-directed and background speech. Our results highlight the critical importance of spatial audio cues for context-aware voice assistants and establish WearVox as a comprehensive testbed for advancing wearable voice AI research.

翻译：人工智能眼镜等可穿戴设备正将语音助手转变为随时可用、免提操作、无缝融入日常生活的协作伙伴，但同时也带来了诸多挑战，例如受运动和噪声影响的以自我为中心音频、快速的微交互，以及需要区分设备指向性语音与背景对话。现有基准大多忽视了这些复杂性，转而关注纯净或通用的对话音频。为弥补这一差距，我们提出了WearVox，这是首个专为严格评估现实可穿戴场景下语音助手性能而设计的基准。WearVox包含3,842条通过人工智能眼镜采集的多通道、以自我为中心音频记录，涵盖搜索增强问答、闭卷问答、侧谈拒识、工具调用和语音翻译五项多样化任务，涉及广泛的室内外环境与声学条件。每条记录均附有丰富的元数据，支持在现实约束条件下对模型性能进行细致分析。我们对领先的专有及开源语音大语言模型进行了基准测试，发现多数实时SLLM在WearVox上的准确率介于29%至59%之间，且在嘈杂的室外音频上性能显著下降，这凸显了该基准的难度与真实性。此外，我们通过两个新型SLLM进行了案例研究，分别使用单通道和多通道音频进行推理，结果表明多通道音频输入能显著增强模型对环境噪声的鲁棒性，并提升设备指向性语音与背景语音的区分能力。我们的研究结果揭示了空间音频线索对于情境感知语音助手的至关重要性，并将WearVox确立为推动可穿戴语音人工智能研究的综合性测试平台。