With promising yet saturated results in high-resource settings, low-resource datasets have gradually become popular benchmarks for evaluating the learning ability of advanced neural networks (e.g., BigBench, superGLUE). Some models even surpass humans according to benchmark test results. However, we find that there exists a set of hard examples in low-resource settings that challenge neural networks but are not well evaluated, which causes over-estimated performance. We first give a theoretical analysis on which factors bring the difficulty of low-resource learning. It then motivate us to propose a challenging benchmark hardBench to better evaluate the learning ability, which covers 11 datasets, including 3 computer vision (CV) datasets and 8 natural language process (NLP) datasets. Experiments on a wide range of models show that neural networks, even pre-trained language models, have sharp performance drops on our benchmark, demonstrating the effectiveness on evaluating the weaknesses of neural networks. On NLP tasks, we surprisingly find that despite better results on traditional low-resource benchmarks, pre-trained networks, does not show performance improvements on our benchmarks. These results demonstrate that there are still a large robustness gap between existing models and human-level performance.
翻译:在资源丰富场景下取得显著但趋于饱和的成果后,低资源数据集逐渐成为评估先进神经网络(如BigBench、superGLUE)学习能力的流行基准。部分模型在基准测试结果中甚至超越人类表现。然而,我们发现低资源场景中存在一组对神经网络具有挑战性但未被充分评估的困难样本,导致性能被高估。本文首先从理论上分析影响低资源学习难度的因素,进而提出具有挑战性的基准hardBench以更有效地评估学习能力。该基准涵盖11个数据集,包括3个计算机视觉数据集和8个自然语言处理数据集。在多种模型上的实验表明,神经网络(包括预训练语言模型)在此基准上性能显著下降,验证了该基准在评估神经网络弱点方面的有效性。在自然语言处理任务中,我们发现尽管预训练网络在传统低资源基准上表现更优,但在本基准中并未显示出性能提升。这些结果表明,现有模型与人类水平性能之间仍存在巨大鲁棒性差距。