Most of the existing work in one-stage referring expression comprehension (REC) mainly focuses on multi-modal fusion and reasoning, while the influence of other factors in this task lacks in-depth exploration. To fill this gap, we conduct an empirical study in this paper. Concretely, we first build a very simple REC network called SimREC, and ablate 42 candidate designs/settings, which covers the entire process of one-stage REC from network design to model training. Afterwards, we conduct over 100 experimental trials on three benchmark datasets of REC. The extensive experimental results not only show the key factors that affect REC performance in addition to multi-modal fusion, e.g., multi-scale features and data augmentation, but also yield some findings that run counter to conventional understanding. For example, as a vision and language (V&L) task, REC does is less impacted by language prior. In addition, with a proper combination of these findings, we can improve the performance of SimREC by a large margin, e.g., +27.12% on RefCOCO+, which outperforms all existing REC methods. But the most encouraging finding is that with much less training overhead and parameters, SimREC can still achieve better performance than a set of large-scale pre-trained models, e.g., UNITER and VILLA, portraying the special role of REC in existing V&L research.
翻译:现有单阶段指代表达理解(REC)研究主要关注多模态融合与推理,而该任务中其他因素的影响缺乏深入探索。为填补这一空白,本文开展实证研究。具体而言,我们首先构建名为SimREC的极简REC网络,并消融42种候选设计/设置,涵盖从网络设计到模型训练的单阶段REC完整流程。随后,我们在三个基准数据集上开展超过100次实验。大量实验结果不仅揭示了除多模态融合外影响REC性能的关键因素(如多尺度特征与数据增强),还得出若干与常规认知相悖的发现。例如,作为视觉语言(V&L)任务,REC受语言先验影响较小。此外,通过合理组合这些发现,我们可使SimREC性能大幅提升(如在RefCOCO+上提升27.12%),超越所有现有REC方法。但最令人振奋的发现是:即便训练开销与参数量大幅减少,SimREC仍能优于UNITER、VILLA等大规模预训练模型,凸显了REC在现有V&L研究中的特殊地位。