MADE: A Living Benchmark for Multi-Label Text Classification with Uncertainty Quantification of Medical Device Adverse Events

Machine learning in high-stakes domains such as healthcare requires not only strong predictive performance but also reliable uncertainty quantification (UQ) to support human oversight. Multi-label text classification (MLTC) is a central task in this domain, yet remains challenging due to label imbalances, dependencies, and combinatorial complexity. Existing MLTC benchmarks are increasingly saturated and may be affected by training data contamination, making it difficult to distinguish genuine reasoning capabilities from memorization. We introduce MADE, a living MLTC benchmark derived from {m}edical device {ad}verse {e}vent reports and continuously updated with newly published reports to prevent contamination. MADE features a long-tailed distribution of hierarchical labels and enables reproducible evaluation with strict temporal splits. We establish baselines across more than 20 encoder- and decoder-only models under fine-tuning and few-shot settings (instruction-tuned/reasoning variants, local/API-accessible). We systematically assess entropy-/consistency-based and self-verbalized UQ methods. Results show clear trade-offs: smaller discriminatively fine-tuned decoders achieve the strongest head-to-tail accuracy while maintaining competitive UQ; generative fine-tuning delivers the most reliable UQ; large reasoning models improve performance on rare labels yet exhibit surprisingly weak UQ; and self-verbalized confidence is not a reliable proxy for uncertainty. Our work is publicly available at https://hhi.fraunhofer.de/aml-demonstrator/made-benchmark.

翻译：在高风险领域（如医疗健康）中，机器学习不仅需要强大的预测性能，还需要可靠的不确定性量化（UQ）以支持人工监督。多标签文本分类（MLTC）是该领域的核心任务，但由于标签不平衡、依赖关系和组合复杂性，仍面临挑战。现有MLTC基准日益饱和，且可能受训练数据污染影响，难以区分真正的推理能力与记忆能力。我们提出MADE——一个基于医疗设备不良事件报告的动态MLTC基准，并通过持续更新新发布的报告以防止数据污染。MADE具有层次化标签的长尾分布特征，并通过严格的时间切分实现可复现评估。我们在超过20种编码器-解码器及解码器模型上建立了基线，涵盖微调和少样本设置（指令微调/推理变体、本地/API可访问模型）。系统评估了基于熵、一致性和自我表述的UQ方法。结果表明存在明确权衡：较小的判别式微调解码器在实现从头部到尾部最高精度的同时，保持了具有竞争力的UQ；生成式微调提供了最可靠的UQ；大型推理模型改善了稀有标签的性能，但UQ表现异常薄弱；而自我表述的置信度并非不确定性的可靠代理。我们的工作公开发布于https://hhi.fraunhofer.de/aml-demonstrator/made-benchmark。