As general purpose vision models get increasingly effective at a wide set of tasks, it is imperative that they be consistent across the tasks they support. Inconsistent AI models are considered brittle and untrustworthy by human users and are more challenging to incorporate into larger systems that take dependencies on their outputs. Measuring consistency between very heterogeneous tasks that might include outputs in different modalities is challenging since it is difficult to determine if the predictions are consistent with one another. As a solution, we introduce a benchmark dataset, COCOCON, where we use contrast sets created by modifying test instances for multiple tasks in small but semantically meaningful ways to change the gold label, and outline metrics for measuring if a model is consistent by ranking the original and perturbed instances across tasks. We find that state-of-the-art systems suffer from a surprisingly high degree of inconsistent behavior across tasks, especially for more heterogeneous tasks. Finally, we propose using a rank correlation-based auxiliary objective computed over large automatically created cross-task contrast sets to improve the multi-task consistency of large unified models, while retaining their original accuracy on downstream tasks. Project website available at https://adymaharana.github.io/cococon/
翻译:随着通用视觉模型在广泛任务中日益高效,确保其在不同支持任务间保持一致性变得至关重要。不一致的AI模型被认为脆弱且不可信,且更难集成到依赖其输出的大规模系统中。衡量异质性任务(可能包含不同模态的输出)间的一致性具有挑战性,因为难以判断预测结果是否相互一致。为此,我们引入基准数据集COCOCON,通过以微小但语义显著的方式修改多个任务的测试实例以改变金标准标签来构建对比集,并概述通过跨任务对原始和扰动实例进行排序来衡量模型一致性的指标。我们发现,最先进的系统在跨任务中表现出惊人的高度不一致行为,尤其是对于更异质的任务。最后,我们提出基于秩相关性的辅助目标函数,该函数在自动生成的大规模跨任务对比集上计算,以提升统一大模型的多任务一致性,同时保留其在下游任务上的原始准确率。项目网站详见 https://adymaharana.github.io/cococon/