Large language models (LLMs), typically designed as a function of next-word prediction, have excelled across extensive NLP tasks. Despite the generality, next-word prediction is often not an efficient formulation for many of the tasks, demanding an extreme scale of model parameters (10s or 100s of billions) and sometimes yielding suboptimal performance. In practice, it is often desirable to build more efficient models -- despite being less versatile, they still apply to a substantial subset of problems, delivering on par or even superior performance with much smaller model sizes. In this paper, we propose text alignment as an efficient unified model for a wide range of crucial tasks involving text entailment, similarity, question answering (and answerability), factual consistency, and so forth. Given a pair of texts, the model measures the degree of alignment between their information. We instantiate an alignment model (Align) through lightweight finetuning of RoBERTa (355M parameters) using 5.9M examples from 28 datasets. Despite its compact size, extensive experiments show the model's efficiency and strong performance: (1) On over 20 datasets of aforementioned diverse tasks, the model matches or surpasses FLAN-T5 models that have around 2x or 10x more parameters; the single unified model also outperforms task-specific models finetuned on individual datasets; (2) When applied to evaluate factual consistency of language generation on 23 datasets, our model improves over various baselines, including the much larger GPT-3.5 (ChatGPT) and sometimes even GPT-4; (3) The lightweight model can also serve as an add-on component for LLMs such as GPT-3.5 in question answering tasks, improving the average exact match (EM) score by 17.94 and F1 score by 15.05 through identifying unanswerable questions.
翻译:大语言模型(LLMs)通常被设计为基于下一词预测的函数,在广泛的自然语言处理任务中表现出色。然而,尽管具有通用性,下一词预测对于许多任务而言并非高效范式,其需要极大规模模型参数(数百亿或数千亿级),且有时会导致次优性能。在实践中,构建更高效的模型往往更具价值——尽管通用性较低,但这类模型仍适用于大部分核心问题,能以更小规模实现与大型模型相当甚至更优的性能。本文提出将文本对齐作为涵盖文本蕴含、语义相似度、问答(及可回答性)、事实一致性等关键任务的高效统一模型。给定文本对,模型衡量两者信息之间的对齐程度。我们通过轻量级微调RoBERTa(3.55亿参数),使用来自28个数据集的590万样本实例化对齐模型(Align)。尽管模型紧凑,广泛实验证明了其高效性和强大性能:(1)在涵盖上述多样任务的20余个数据集上,该模型匹配或超越参数规模约2倍或10倍的FLAN-T5模型;单一统一模型也优于基于单个数据集微调的任务专用模型;(2)在23个数据集上评估语言生成的事实一致性时,我们的模型优于包括更大规模的GPT-3.5(ChatGPT)甚至部分场景下的GPT-4在内的多种基线模型;(3)该轻量级模型可作为LLM(如GPT-3.5)在问答任务中的附加组件,通过识别不可回答问题将平均精确匹配(EM)得分提升17.94,F1分数提升15.05。