Learning to predict masked tokens in a sequence has been shown to be a helpful pretraining objective for powerful language models such as PaLM2. After training, such masked language models (MLMs) can provide distributions of tokens in the masked positions in a sequence. However, this paper shows that distributions corresponding to different masking patterns can demonstrate considerable inconsistencies, i.e., they cannot be derived from a coherent joint distribution when considered together. This fundamental flaw in MLMs can lead to self-contradictory behaviors during inference. On various benchmark datasets including MMLU, MLMs can give different predictions to the same input question. From BERT-base to UL2-20B, we show that such inconsistencies exist ubiquitously in MLMs of diverse sizes and configurations. In light of our observations, we further propose an inference-time strategy for MLMs called Ensemble of Conditionals. It jointly considers a selected range of inconsistent conditionals directly produced by the MLM for the final prediction, which often leads to considerable accuracy improvement.
翻译:学习预测序列中被掩码的 token 已被证明是训练强大语言模型(如 PaLM2)的有效预训练目标。训练完成后,这类掩码语言模型(MLM)能够提供序列中掩码位置的 token 分布。然而,本文表明,不同掩码模式对应的分布可能存在显著的不一致性,即这些分布无法从一致的联合分布中共同推导得出。MLM 的这一根本缺陷可能导致推理过程中出现自相矛盾的行为。在包括 MMLU 在内的多个基准数据集上,MLM 可能对同一输入问题给出不同的预测结果。从 BERT-base 到 UL2-20B,我们证明此类不一致性普遍存在于不同规模和配置的 MLM 中。基于上述观察,我们进一步提出一种称为“条件集成”的 MLM 推理时策略。该策略联合考虑 MLM 直接生成的一组选定的不一致条件分布以进行最终预测,从而显著提升预测准确性。