BTZSC: A Benchmark for Zero-Shot Text Classification Across Cross-Encoders, Embedding Models, Rerankers and LLMs

from arxiv, Accepted at ICLR 2026. 31 pages, 5 figures, 9 tables. Code: https://github.com/IliasAarab/btzsc ; Dataset: https://huggingface.co/datasets/btzsc/btzsc ; Leaderboard: https://huggingface.co/spaces/btzsc/btzsc-leaderboard . Proceedings of the Fourteenth International Conference on Learning Representations (ICLR 2026), 2026

Zero-shot text classification (ZSC) offers the promise of eliminating costly task-specific annotation by matching texts directly to human-readable label descriptions. While early approaches have predominantly relied on cross-encoder models fine-tuned for natural language inference (NLI), recent advances in text-embedding models, rerankers, and instruction-tuned large language models (LLMs) have challenged the dominance of NLI-based architectures. Yet, systematically comparing these diverse approaches remains difficult. Existing evaluations, such as MTEB, often incorporate labeled examples through supervised probes or fine-tuning, leaving genuine zero-shot capabilities underexplored. To address this, we introduce BTZSC, a comprehensive benchmark of 22 public datasets spanning sentiment, topic, intent, and emotion classification, capturing diverse domains, class cardinalities, and document lengths. Leveraging BTZSC, we conduct a systematic comparison across four major model families, NLI cross-encoders, embedding models, rerankers and instruction-tuned LLMs, encompassing 38 public and custom checkpoints. Our results show that: (i) modern rerankers, exemplified by Qwen3-Reranker-8B, set a new state-of-the-art with macro F1 = 0.72; (ii) strong embedding models such as GTE-large-en-v1.5 substantially close the accuracy gap while offering the best trade-off between accuracy and latency; (iii) instruction-tuned LLMs at 4--12B parameters achieve competitive performance (macro F1 up to 0.67), excelling particularly on topic classification but trailing specialized rerankers; (iv) NLI cross-encoders plateau even as backbone size increases; and (v) scaling primarily benefits rerankers and LLMs over embedding models. BTZSC and accompanying evaluation code are publicly released to support fair and reproducible progress in zero-shot text understanding.

翻译：零样本文本分类（ZSC）通过将文本直接匹配到人类可读的标签描述，有望消除成本高昂的任务特定标注。虽然早期方法主要依赖于为自然语言推理（NLI）微调的交叉编码器模型，但文本嵌入模型、重排序器和指令微调大语言模型（LLMs）的最新进展，已对基于NLI架构的主导地位提出了挑战。然而，系统性地比较这些多样化方法仍然困难。现有的评估（如MTEB）通常通过有监督探针或微调纳入标注样本，使得真正的零样本能力未得到充分探索。为此，我们引入了BTZSC，这是一个包含22个公共数据集的综合性基准，涵盖情感、主题、意图和情绪分类，捕捉了多样化的领域、类别基数和文档长度。利用BTZSC，我们对四大模型家族——NLI交叉编码器、嵌入模型、重排序器和指令微调LLMs——进行了系统比较，涵盖了38个公开和自定义检查点。我们的结果表明：（i）以Qwen3-Reranker-8B为代表的现代重排序器以宏平均F1 = 0.72的成绩创造了新的最优性能；（ii）强大的嵌入模型（如GTE-large-en-v1.5）显著缩小了准确率差距，同时在准确率与延迟之间提供了最佳权衡；（iii）参数规模在4–12B的指令微调LLMs达到了有竞争力的性能（宏平均F1最高达0.67），尤其在主题分类上表现出色，但仍落后于专门的重排序器；（iv）NLI交叉编码器即使在其骨干网络规模增大时性能也趋于停滞；（v）扩展规模主要使重排序器和LLMs受益，而非嵌入模型。BTZSC及配套评估代码已公开发布，以支持零样本文本理解领域公平且可复现的进展。

相关内容

排序

关注 313

排序是计算机内经常进行的一种操作，其目的是将一组“无序”的记录序列调整为“有序”的记录序列。分内部排序和外部排序。若整个排序过程不需要访问外存便能完成，则称此类排序问题为内部排序。反之，若参加排序的记录数量很大，整个序列的排序过程不可能在内存中完成，则称此类排序问题为外部排序。内部排序的过程是一个逐步扩大记录的有序序列长度的过程。

【ICCV2025】具有局部对齐视觉-语言模型的可解释零样本学习

专知会员服务

10+阅读 · 2025年7月1日

【博士论文】针对基于文本的基础模型的分类偏差分析与缓解

专知会员服务

15+阅读 · 2025年3月10日

【CVPR2024】渐进式语义引导视觉变换器用于零样本学习

专知会员服务

19+阅读 · 2024年4月13日

【CVPR2023】I2MVFormer:大语言模型生成的多视图文档监督零样本图像分类

专知会员服务

21+阅读 · 2023年3月1日