While reaching for NLP systems that maximize accuracy, other important metrics of system performance are often overlooked. Prior models are easily forgotten despite their possible suitability in settings where large computing resources are unavailable or relatively more costly. In this paper, we perform a broad comparative evaluation of document-level sentiment analysis models with a focus on resource costs that are important for the feasibility of model deployment and general climate consciousness. Our experiments consider different feature extraction techniques, the effect of ensembling, task-specific deep learning modeling, and domain-independent large language models (LLMs). We find that while a fine-tuned LLM achieves the best accuracy, some alternate configurations provide huge (up to 24, 283 *) resource savings for a marginal (<1%) loss in accuracy. Furthermore, we find that for smaller datasets, the differences in accuracy shrink while the difference in resource consumption grows further.
翻译:在追求最大化准确率的自然语言处理(NLP)系统开发中,其他重要的系统性能指标常被忽视。部分先前提出的模型虽在计算资源稀缺或成本高昂的场景中具有潜在适用性,却容易被遗忘。本文针对文档级情感分析模型开展广泛比较评估,重点关注与模型部署可行性及环保意识相关的资源消耗指标。实验涵盖不同特征提取技术、集成学习效果、特定任务深度学习建模以及领域无关大语言模型(LLMs)。研究发现,经过微调的LLM虽能实现最佳准确率,但部分替代性配置在准确率仅损失不足1%的情况下,可实现高达24,283倍的资源节省。此外,在较小规模数据集上,模型间的准确率差距收窄,而资源消耗的差异则进一步扩大。