成本感知的文本分类模型选择：生产环境中微调编码器与LLM提示的多目标权衡 (Cost-Aware Model Selection for Text Classification: Multi-Objective Trade-offs Between Fine-Tuned Encoders and LLM Prompting in Production)

from arxiv, 26 pages, 12 figures. Empirical benchmark comparing fine-tuned encoders and LLM prompting for text classification under cost and latency constraints

Large language models (LLMs) such as GPT-4o and Claude Sonnet 4.5 have demonstrated strong capabilities in open-ended reasoning and generative language tasks, leading to their widespread adoption across a broad range of NLP applications. However, for structured text classification problems with fixed label spaces, model selection is often driven by predictive performance alone, overlooking operational constraints encountered in production systems. In this work, we present a systematic comparison of two contrasting paradigms for text classification: zero- and few-shot prompt-based large language models, and fully fine-tuned encoder-only architectures. We evaluate these approaches across four canonical benchmarks (IMDB, SST-2, AG News, and DBPedia), measuring predictive quality (macro F1), inference latency, and monetary cost. We frame model evaluation as a multi-objective decision problem and analyze trade-offs using Pareto frontier projections and a parameterized utility function reflecting different deployment regimes. Our results show that fine-tuned encoder-based models from the BERT family achieve competitive, and often superior, classification performance while operating at one to two orders of magnitude lower cost and latency compared to zero- and few-shot LLM prompting. Overall, our findings suggest that indiscriminate use of large language models for standard text classification workloads can lead to suboptimal system-level outcomes. Instead, fine-tuned encoders emerge as robust and efficient components for structured NLP pipelines, while LLMs are better positioned as complementary elements within hybrid architectures. We release all code, datasets, and evaluation protocols to support reproducibility and cost-aware NLP system design.

翻译：诸如GPT-4o和Claude Sonnet 4.5等大语言模型在开放式推理和生成式语言任务中展现出强大能力，促使其在广泛的自然语言处理应用中得到广泛采用。然而，对于具有固定标签空间的结构化文本分类问题，模型选择往往仅由预测性能驱动，忽视了生产系统中遇到的操作约束。在本研究中，我们对文本分类的两种对比范式进行了系统比较：基于零样本和少样本提示的大语言模型，以及完全微调的仅编码器架构。我们在四个经典基准测试（IMDB、SST-2、AG News和DBPedia）上评估了这些方法，测量了预测质量（宏观F1分数）、推理延迟和货币成本。我们将模型评估构建为一个多目标决策问题，并使用帕累托前沿投影和反映不同部署机制的参数化效用函数来分析权衡。我们的结果表明，来自BERT系列的微调编码器模型实现了具有竞争力且通常更优的分类性能，同时与零样本和少样本LLM提示相比，其成本和延迟低一到两个数量级。总体而言，我们的研究结果表明，在标准文本分类任务中不加区分地使用大语言模型可能导致次优的系统级结果。相反，微调编码器成为结构化自然语言处理流程中稳健且高效的组件，而大语言模型更适合作为混合架构中的补充元素。我们发布了所有代码、数据集和评估协议，以支持可重复性和成本感知的自然语言处理系统设计。