Model selection for a given target task can be costly, as it may entail extensive annotation of the quality of outputs of different models. We introduce DiffUse, an efficient method to make an informed decision between candidate text generation models. DiffUse reduces the required amount of preference annotations, thus saving valuable time and resources in performing evaluation. DiffUse intelligently selects instances by clustering embeddings that represent the semantic differences between model outputs. Thus, it is able to identify a subset of examples that are more informative for preference decisions. Our method is model-agnostic, and can be applied to any text generation model. Moreover, we propose a practical iterative approach for dynamically determining how many instances to annotate. In a series of experiments over hundreds of model pairs, we demonstrate that DiffUse can dramatically reduce the required number of annotations -- by up to 75% -- while maintaining high evaluation reliability.
翻译:针对特定目标任务进行模型选择可能成本高昂,因为需要大量标注不同模型输出质量。我们提出DiffUse方法,一种在候选文本生成模型之间做出明智决策的高效方法。DiffUse减少了偏好标注所需的量,从而在评估过程中节省宝贵的时间和资源。该方法通过对表示模型输出间语义差异的嵌入进行聚类智能选择实例,从而能够识别出对偏好决策更具信息性的示例子集。我们的方法具有模型无关性,可应用于任意文本生成模型。此外,我们提出一种实用的迭代方法动态确定需标注的实例数量。在数百个模型对上的系列实验中,我们证明了DiffUse能够大幅减少所需标注量——最高达75%——同时保持高评估可靠性。