Data augmentation via back-translation is common when pretraining Vision-and-Language Navigation (VLN) models, even though the generated instructions are noisy. But: does that noise matter? We find that nonsensical or irrelevant language instructions during pretraining can have little effect on downstream performance for both HAMT and VLN-BERT on R2R, and is still better than only using clean, human data. To underscore these results, we concoct an efficient augmentation method, Unigram + Object, which generates nonsensical instructions that nonetheless improve downstream performance. Our findings suggest that what matters for VLN R2R pretraining is the quantity of visual trajectories, not the quality of instructions.
翻译:通过反向翻译进行数据增强在视觉-语言导航(VLN)模型预训练中很常见,尽管生成的指令存在噪声。但:这种噪声重要吗?我们发现,在预训练期间使用无意义或不相关的语言指令对HAMT和VLN-BERT在R2R数据集上的下游性能影响甚微,且仍优于仅使用干净的人工数据。为了强调这一结果,我们设计了一种高效的增强方法——Unigram + Object,该方法能生成无意义指令,却仍能提升下游性能。我们的发现表明,VLN R2R预训练的关键在于视觉轨迹的数量,而非指令的质量。