We introduce a simple method that employs pre-trained CLIP encoders to enhance model generalization in the ALFRED task. In contrast to previous literature where CLIP replaces the visual encoder, we suggest using CLIP as an additional module through an auxiliary object detection objective. We validate our method on the recently proposed Episodic Transformer architecture and demonstrate that incorporating CLIP improves task performance on the unseen validation set. Additionally, our analysis results support that CLIP especially helps with leveraging object descriptions, detecting small objects, and interpreting rare words.
翻译:我们提出了一种简单方法,利用预训练的CLIP编码器来增强模型在ALFRED任务中的泛化能力。与先前文献中使用CLIP替代视觉编码器的做法不同,我们建议通过辅助物体检测目标将CLIP作为附加模块使用。我们在最近提出的Episodic Transformer架构上验证了我们的方法,并证明融入CLIP能够提升模型在未知验证集上的任务性能。此外,我们的分析结果表明,CLIP特别有助于利用物体描述、检测小物体以及理解罕见词汇。