We introduce the Dutch Model Benchmark: DUMB. The benchmark includes a diverse set of datasets for low-, medium- and high-resource tasks. The total set of eight tasks include three tasks that were previously not available in Dutch. Instead of relying on a mean score across tasks, we propose Relative Error Reduction (RER), which compares the DUMB performance of models to a strong baseline which can be referred to in the future even when assessing different sets of models. Through a comparison of 14 pre-trained models (mono- and multi-lingual, of varying sizes), we assess the internal consistency of the benchmark tasks, as well as the factors that likely enable high performance. Our results indicate that current Dutch monolingual models under-perform and suggest training larger Dutch models with other architectures and pre-training objectives. At present, the highest performance is achieved by DeBERTaV3 (large), XLM-R (large) and mDeBERTaV3 (base). In addition to highlighting best strategies for training larger Dutch models, DUMB will foster further research on Dutch. A public leaderboard is available at https://dumbench.nl.
翻译:我们提出荷兰语模型基准:DUMB。该基准涵盖低资源、中资源和高资源任务的多样化数据集,总共包含八项任务,其中三项任务此前尚无荷兰语版本。我们不依赖任务平均得分,而是提出相对误差缩减率(RER),将模型的DUMB性能与强基线进行比较,即便未来评估不同模型集时也可参照此基线。通过对比14个预训练模型(包括单语和多语模型,规模各异),我们评估了基准任务的内部一致性,以及可能促成高性能的因素。结果表明,当前荷兰语单语模型表现欠佳,建议采用其他架构和预训练目标训练更大的荷兰语模型。目前,性能最高的是DeBERTaV3(large)、XLM-R(large)和mDeBERTaV3(base)。除了为训练更大的荷兰语模型提供最优策略外,DUMB还将推动荷兰语相关研究。公开排行榜见https://dumbench.nl。