Successfully handling context is essential for any dialog understanding task. This context maybe be conversational (relying on previous user queries or system responses), visual (relying on what the user sees, for example, on their screen), or background (based on signals such as a ringing alarm or playing music). In this work, we present an overview of MARRS, or Multimodal Reference Resolution System, an on-device framework within a Natural Language Understanding system, responsible for handling conversational, visual and background context. In particular, we present different machine learning models to enable handing contextual queries; specifically, one to enable reference resolution, and one to handle context via query rewriting. We also describe how these models complement each other to form a unified, coherent, lightweight system that can understand context while preserving user privacy.
翻译:成功处理上下文对于任何对话理解任务都至关重要。这种上下文可能是对话性的(依赖于先前的用户查询或系统响应)、视觉性的(依赖于用户所看到的内容,例如屏幕上的信息)或背景性的(基于诸如闹钟响铃或音乐播放等信号)。在本研究中,我们概述了MARRS(多模态指代消解系统),这是一个部署于自然语言理解系统中的设备端框架,负责处理对话、视觉和背景上下文。具体而言,我们提出了多种机器学习模型以支持上下文查询处理:一种用于实现指代消解,另一种则通过查询改写处理上下文。我们还描述了这些模型如何相互补充,形成统一、连贯且轻量级的系统,从而在理解上下文的同时保护用户隐私。