We introduce the Dutch Model Benchmark: DUMB. The benchmark includes a diverse set of datasets for low-, medium- and high-resource tasks. The total set of nine tasks includes four tasks that were previously not available in Dutch. Instead of relying on a mean score across tasks, we propose Relative Error Reduction (RER), which compares the DUMB performance of language models to a strong baseline which can be referred to in the future even when assessing different sets of language models. Through a comparison of 14 pre-trained language models (mono- and multi-lingual, of varying sizes), we assess the internal consistency of the benchmark tasks, as well as the factors that likely enable high performance. Our results indicate that current Dutch monolingual models under-perform and suggest training larger Dutch models with other architectures and pre-training objectives. At present, the highest performance is achieved by DeBERTaV3 (large), XLM-R (large) and mDeBERTaV3 (base). In addition to highlighting best strategies for training larger Dutch models, DUMB will foster further research on Dutch. A public leaderboard is available at https://dumbench.nl.
翻译:我们提出荷兰语模型基准:DUMB。该基准包含面向低资源、中资源和高资源任务的多样化数据集,总计九项任务中包括四项此前尚无荷兰语版本的任务。我们摒弃传统跨任务平均分评价方法,提出相对误差缩减率(RER),通过将语言模型的DUMB性能与可长期参照的强基线进行比较,从而在评估不同语言模型集合时仍具普适性。通过对14个预训练语言模型(单语/多语、不同规模)的比较分析,我们评估了基准任务的内部一致性,以及影响高性能的关键因素。结果表明,当前荷兰语单语模型表现欠佳,建议采用其他架构和预训练目标训练更大规模的荷兰语模型。目前性能最优的模型为DeBERTaV3(large)、XLM-R(large)和mDeBERTaV3(base)。除揭示训练更大规模荷兰语模型的最优策略外,DUMB还将推动荷兰语相关研究。公开排行榜可通过https://dumbench.nl访问。