Tobacco3482 is a widely used document classification benchmark dataset. However, our manual inspection of the entire dataset uncovers widespread ontological issues, especially large amounts of annotation label problems in the dataset. We establish data label guidelines and find that 11.7% of the dataset is improperly annotated and should either have an unknown label or a corrected label, and 16.7% of samples in the dataset have multiple valid labels. We then analyze the mistakes of a top-performing model and find that 35% of the model's mistakes can be directly attributed to these label issues, highlighting the inherent problems with using a noisily labeled dataset as a benchmark. Supplementary material, including dataset annotations and code, is available at https://github.com/gordon-lim/tobacco3482-mistakes/.
翻译:Tobacco3482是一个广泛使用的文档分类基准数据集。然而,我们对整个数据集的人工检查发现了普遍存在的本体论问题,尤其是数据集中存在大量的标注标签问题。我们建立了数据标签准则,发现数据集中有11.7%的样本标注不当,应标记为未知标签或进行标签更正,并且数据集中16.7%的样本具有多个有效标签。随后,我们分析了一个顶级性能模型的错误,发现该模型35%的错误可直接归因于这些标签问题,这突显了使用带有噪声标签的数据集作为基准所固有的问题。补充材料,包括数据集标注和代码,可在 https://github.com/gordon-lim/tobacco3482-mistakes/ 获取。