Factual probing is a method that uses prompts to test if a language model "knows" certain world knowledge facts. A problem in factual probing is that small changes to the prompt can lead to large changes in model output. Previous work aimed to alleviate this problem by optimizing prompts via text mining or fine-tuning. However, such approaches are relation-specific and do not generalize to unseen relation types. Here, we propose to use test-time augmentation (TTA) as a relation-agnostic method for reducing sensitivity to prompt variations by automatically augmenting and ensembling prompts at test time. Experiments show improved model calibration, i.e., with TTA, model confidence better reflects prediction accuracy. Improvements in prediction accuracy are observed for some models, but for other models, TTA leads to degradation. Error analysis identifies the difficulty of producing high-quality prompt variations as the main challenge for TTA.
翻译:事实探针是一种利用提示来测试语言模型是否“知晓”特定世界知识事实的方法。该方法存在一个问题:提示的微小变化可能导致模型输出的巨大差异。以往的研究通过文本挖掘或微调优化提示来缓解这一问题,但这些方法具有关系特异性,无法泛化到未见过的关系类型。本文提出将测试时增强作为关系无关的方法,通过在测试时自动增强和集成提示来降低对提示变化的敏感性。实验表明,该方法改善了模型校准效果,即使用测试时增强后,模型置信度能更好反映预测准确性。对于部分模型,预测准确性有所提升,但对其他模型,测试时增强反而导致性能下降。误差分析表明,生成高质量提示变体的困难是测试时增强面临的主要挑战。