Model selection for a given target task can be costly, as it may entail extensive annotation of the quality of outputs of different models. We introduce DiffUse, an efficient method to make an informed decision between candidate text generation models based on preference annotations. DiffUse reduces the required amount of annotations, thus saving valuable time and resources in performing evaluation. DiffUse intelligently selects instances by clustering embeddings that represent the semantic differences between model outputs. Thus, it is able to identify a subset of examples that are more informative for preference decisions. Our method is model-agnostic, and can be applied to any text generation model for selecting between models, prompts and configurations. Moreover, we propose a practical iterative approach for dynamically determining how many instances to annotate. In a series of experiments over hundreds of model pairs, we demonstrate that DiffUse can dramatically reduce the required number of annotations -- by up to 75% -- while maintaining high evaluation reliability.
翻译:针对特定目标任务进行模型选择可能成本高昂,因为这需要对不同模型输出质量进行大量标注。我们提出DiffUse,一种基于偏好标注在候选文本生成模型之间做出明智决策的高效方法。DiffUse通过减少所需标注量,从而在评估过程中节省宝贵的时间和资源。该方法通过对表征模型输出间语义差异的嵌入向量进行聚类来智能选择实例,从而能够识别出对偏好决策更具信息量的示例子集。我们的方法具有模型无关性,可应用于任何文本生成模型在模型、提示和配置之间的选择。此外,我们提出了一种实用的迭代方法用于动态确定需要标注的实例数量。通过数百个模型对的一系列实验,我们证明DiffUse能够大幅减少所需标注量——最高可达75%——同时保持较高的评估可靠性。