With promising yet saturated results in high-resource settings, low-resource datasets have gradually become popular benchmarks for evaluating the learning ability of advanced neural networks (e.g., BigBench, superGLUE). Some models even surpass humans according to benchmark test results. However, we find that there exists a set of hard examples in low-resource settings that challenge neural networks but are not well evaluated, which causes over-estimated performance. We first give a theoretical analysis on which factors bring the difficulty of low-resource learning. It then motivate us to propose a challenging benchmark hardBench to better evaluate the learning ability, which covers 11 datasets, including 3 computer vision (CV) datasets and 8 natural language process (NLP) datasets. Experiments on a wide range of models show that neural networks, even pre-trained language models, have sharp performance drops on our benchmark, demonstrating the effectiveness on evaluating the weaknesses of neural networks. On NLP tasks, we surprisingly find that despite better results on traditional low-resource benchmarks, pre-trained networks, does not show performance improvements on our benchmarks. These results demonstrate that there are still a large robustness gap between existing models and human-level performance.
翻译:在资源丰富场景下取得令人鼓舞但趋于饱和的结果后,低资源数据集逐渐成为评估先进神经网络(例如BigBench、superGLUE)学习能力的流行基准。部分模型甚至在基准测试结果上超越了人类。然而,我们发现在低资源场景中存在一组使神经网络难以应对但尚未得到充分评估的困难样本,这导致了模型性能被高估。首先,我们从理论上分析了哪些因素带来了低资源学习的难度。这进而促使我们提出一个具有挑战性的基准——hardBench,以更全面地评估学习能力,该基准涵盖11个数据集,包括3个计算机视觉(CV)数据集和8个自然语言处理(NLP)数据集。在多种模型上的实验表明,神经网络(甚至预训练语言模型)在我们的基准上性能显著下降,这证明了该基准在评估神经网络弱点方面的有效性。在NLP任务上,我们惊讶地发现,尽管在传统低资源基准上取得了更优结果,但预训练网络在我们的基准上并未表现出性能提升。这些结果表明,现有模型与人类水平性能之间仍存在巨大的鲁棒性差距。