In Natural Language Generation (NLG) tasks, for any input, multiple communicative goals are plausible, and any goal can be put into words, or produced, in multiple ways. We characterise the extent to which human production varies lexically, syntactically, and semantically across four NLG tasks, connecting human production variability to aleatoric or data uncertainty. We then inspect the space of output strings shaped by a generation system's predicted probability distribution and decoding algorithm to probe its uncertainty. For each test input, we measure the generator's calibration to human production variability. Following this instance-level approach, we analyse NLG models and decoding strategies, demonstrating that probing a generator with multiple samples and, when possible, multiple references, provides the level of detail necessary to gain understanding of a model's representation of uncertainty.
翻译:在自然语言生成(NLG)任务中,对于任何输入,都可能存在多个合理的交际目标,且每个目标可以以多种方式用文字表达或生成。我们刻画了人类在四个NLG任务中词汇、句法和语义上的生产变异性程度,并将人类生产变异性与偶然性(或数据)不确定性联系起来。随后,我们通过生成系统预测的概率分布和解码算法所形成的输出字符串空间,来探查其不确定性。针对每个测试输入,我们衡量生成器对人类生产变异性的校准度。遵循这一实例级方法,我们分析了NLG模型和解码策略,证明通过使用多个样本以及可能情况下的多个参考来探查生成器,能够提供理解模型不确定性表征所需的详细程度。