Smartphone users often find it difficult to navigate myriad menus to perform common tasks such as "How to block calls from unknown numbers?". Currently, help documents with step-by-step instructions are manually written to aid the user. The user experience can be further enhanced by grounding the instructions in the help document to the UI and overlaying a tutorial on the phone UI. To build such tutorials, several natural language processing components including retrieval, parsing, and grounding are necessary, but there isn't any relevant dataset for such a task. Thus, we introduce UGIF-DataSet, a multi-lingual, multi-modal UI grounded dataset for step-by-step task completion on the smartphone containing 4,184 tasks across 8 languages. As an initial approach to this problem, we propose retrieving the relevant instruction steps based on the user's query and parsing the steps using Large Language Models (LLMs) to generate macros that can be executed on-device. The instruction steps are often available only in English, so the challenge includes cross-modal, cross-lingual retrieval of English how-to pages from user queries in many languages and mapping English instruction steps to UI in a potentially different language. We compare the performance of different LLMs including PaLM and GPT-3 and find that the end-to-end task completion rate is 48% for English UI but the performance drops to 32% for other languages. We analyze the common failure modes of existing models on this task and point out areas for improvement.
翻译:智能手机用户通常难以在众多菜单中导航以完成常见任务,例如“如何拦截未知号码来电?”。目前,带有分步说明的帮助文档需要手动编写以辅助用户。通过将帮助文档中的指令锚定到用户界面(UI)并在手机UI上叠加教程,可以进一步提升用户体验。构建此类教程需要多个自然语言处理组件,包括检索、解析和锚定,但目前尚缺乏相关数据集。为此,我们引入了UGIF-DataSet——一个多语言、多模态的UI锚定数据集,用于智能手机上的分步任务完成,涵盖8种语言共4184个任务。作为解决该问题的初步方案,我们提出基于用户查询检索相关指令步骤,并利用大型语言模型(LLM)解析步骤以生成可在设备上执行的宏。指令步骤通常仅有英文版本,因此挑战包括跨模态、跨语言的用户查询(多种语言)到英文帮助页面的检索,以及将英文指令步骤映射到可能不同语言的UI上。我们比较了包括PaLM和GPT-3在内的不同LLM的性能,发现英文UI的端到端任务完成率为48%,而其他语言的性能下降至32%。我们分析了现有模型在该任务中的常见失败模式,并指出了改进方向。