Real-world data often have an open long-tailed distribution, and building a unified QA model supporting various tasks is vital for practical QA applications. However, it is non-trivial to extend previous QA approaches since they either require access to seen tasks of adequate samples or do not explicitly model samples from unseen tasks. In this paper, we define Open Long-Tailed QA (OLTQA) as learning from long-tailed distributed data and optimizing performance over seen and unseen QA tasks. We propose an OLTQA model that encourages knowledge sharing between head, tail and unseen tasks, and explicitly mines knowledge from a large pre-trained language model (LM). Specifically, we organize our model through a pool of fine-grained components and dynamically combine these components for an input to facilitate knowledge sharing. A retrieve-then-rerank frame is further introduced to select in-context examples, which guild the LM to generate text that express knowledge for QA tasks. Moreover, a two-stage training approach is introduced to pre-train the framework by knowledge distillation (KD) from the LM and then jointly train the frame and a QA model through an adaptive mutual KD method. On a large-scale OLTQA dataset we curate from 43 existing QA datasets, our model consistently outperforms the state-of-the-art. We release the code and data at \url{https://github.com/AlibabaResearch/DAMO-ConvAI/tree/main/oltqa}.
翻译:现实世界数据通常呈现开放的长尾分布,构建一个支持多种任务的统一问答模型对于实际问答应用至关重要。然而,扩展先前的问答方法并非易事,因为它们要么需要访问具有足够样本的已知任务,要么未显式地对来自未知任务的样本进行建模。本文定义了开放长尾问答(OLTQA)问题,即从长尾分布数据中学习,并优化已知和未知问答任务的性能。我们提出了一种OLTQA模型,该模型鼓励头部、尾部和未知任务之间的知识共享,并显式地从大型预训练语言模型(LM)中挖掘知识。具体而言,我们通过一组细粒度组件组织模型,并根据输入动态组合这些组件以促进知识共享。进一步引入检索-重排序框架来选择上下文示例,引导语言模型生成表达问答任务知识的文本。此外,采用两阶段训练方法:先通过知识蒸馏(KD)从语言模型预训练框架,再通过自适应互蒸馏方法联合训练框架与问答模型。在基于43个现有问答数据集构建的大规模OLTQA数据集上,我们的模型持续优于现有最优方法。代码与数据已开源至\url{https://github.com/AlibabaResearch/DAMO-ConvAI/tree/main/oltqa}。