Data augmentation via back-translation is common when pretraining Vision-and-Language Navigation (VLN) models, even though the generated instructions are noisy. But: does that noise matter? We find that nonsensical or irrelevant language instructions during pretraining can have little effect on downstream performance for both HAMT and VLN-BERT on R2R, and is still better than only using clean, human data. To underscore these results, we concoct an efficient augmentation method, Unigram + Object, which generates nonsensical instructions that nonetheless improve downstream performance. Our findings suggest that what matters for VLN R2R pretraining is the quantity of visual trajectories, not the quality of instructions.
翻译:通过反向翻译进行数据增强是视觉-语言导航(VLN)模型预训练中的常见做法,即便生成的指令存在噪声。然而,这种噪声是否重要?我们发现,在预训练过程中使用无意义或无关的语言指令,对HAMT和VLN-BERT在R2R数据集上的下游性能影响甚微,且仍然优于仅使用干净的人类数据。为突出这一结果,我们设计了一种高效的增强方法——Unigram + Object,该方法可生成无意义指令,但依然能提升下游性能。我们的发现表明,VLN R2R预训练的关键在于视觉轨迹的数量,而非指令的质量。