While most conversational agents are grounded on either free-text or structured knowledge, many knowledge corpora consist of hybrid sources. This paper presents the first conversational agent that supports the full generality of hybrid data access for large knowledge corpora, through a language we developed called SUQL (Structured and Unstructured Query Language). Specifically, SUQL extends SQL with free-text primitives (summary and answer), so information retrieval can be composed with structured data accesses arbitrarily in a formal, succinct, precise, and interpretable notation. With SUQL, we propose the first semantic parser, an LLM with in-context learning, that can handle hybrid data sources. Our in-context learning-based approach, when applied to the HybridQA dataset, comes within 8.9% exact match and 7.1% F1 of the SOTA, which was trained on 62K data samples. More significantly, unlike previous approaches, our technique is applicable to large databases and free-text corpora. We introduce a dataset consisting of crowdsourced questions and conversations on Yelp, a large, real restaurant knowledge base with structured and unstructured data. We show that our few-shot conversational agent based on SUQL finds an entity satisfying all user requirements 90.3% of the time, compared to 63.4% for a baseline based on linearization.
翻译:尽管大多数对话代理基于纯文本或结构化知识构建,但许多知识库包含混合来源。本文提出首个支持大型知识库中混合数据访问完整通用性的对话代理,该代理通过我们开发的SUQL(结构化与非结构化查询语言)实现。具体而言,SUQL通过引入自由文本原语(摘要与答案)扩展了SQL,使得信息检索能够以形式化、简洁、精确且可解释的符号表示与结构化数据访问任意组合。基于SUQL,我们提出首个能够处理混合数据源的语义解析器——一种结合上下文学习的大语言模型。当应用于HybridQA数据集时,我们基于上下文学习的方法在精确匹配率上仅落后当前最优方法8.9%,在F1分数上落后7.1%,而后者是在6.2万个数据样本上训练的。更重要的是,与以往方法不同,我们的技术适用于大型数据库和自由文本语料库。我们引入了一个包含众包问题与对话的数据集,该数据集基于Yelp——一个兼具结构化与非结构化数据的大型真实餐厅知识库。实验表明,基于SUQL的少样本对话代理在90.3%的情况下能找到满足用户所有需求的实体,而基于线性化的基线方法仅为63.4%。