Voice assistants help users make phone calls, send messages, create events, navigate, and do a lot more. However, assistants have limited capacity to understand their users' context. In this work, we aim to take a step in this direction. Our work dives into a new experience for users to refer to phone numbers, addresses, email addresses, URLs, and dates on their phone screens. Our focus lies in reference understanding, which becomes particularly interesting when multiple similar texts are present on screen, similar to visual grounding. We collect a dataset and propose a lightweight general-purpose model for this novel experience. Due to the high cost of consuming pixels directly, our system is designed to rely on the extracted text from the UI. Our model is modular, thus offering flexibility, improved interpretability, and efficient runtime memory utilization.
翻译:语音助手能够帮助用户拨打电话、发送消息、创建日程、导航以及执行更多操作。然而,助手理解用户上下文的能力有限。本研究旨在朝这一方向迈出一步。我们深入探索了一种新的用户体验,使用户能够指代手机屏幕上的电话号码、地址、电子邮件地址、URL和日期。研究的核心在于指代理解,当屏幕上出现多个相似文本时,这一问题变得尤为有趣,类似于视觉定位。我们收集了一个数据集,并针对这种新体验提出了一种轻量级通用模型。由于直接处理像素的成本较高,我们的系统设计为依赖从用户界面提取的文本。该模型采用模块化架构,因此具有灵活性、更好的可解释性以及高效的内存运行时利用。