While reaching for NLP systems that maximize accuracy, other important metrics of system performance are often overlooked. Prior models are easily forgotten despite their possible suitability in settings where large computing resources are unavailable or relatively more costly. In this paper, we perform a broad comparative evaluation of document-level sentiment analysis models with a focus on resource costs that are important for the feasibility of model deployment and general climate consciousness. Our experiments consider different feature extraction techniques, the effect of ensembling, task-specific deep learning modeling, and domain-independent large language models (LLMs). We find that while a fine-tuned LLM achieves the best accuracy, some alternate configurations provide huge (up to 24, 283 *) resource savings for a marginal (<1%) loss in accuracy. Furthermore, we find that for smaller datasets, the differences in accuracy shrink while the difference in resource consumption grows further.
翻译:在追求自然语言处理系统最大化准确率的过程中,其他重要的系统性能指标常被忽视。先前模型虽可能在计算资源匮乏或资源成本较高的情况下具有适用性,却往往被遗忘。本文对文档级情感分析模型进行了广泛比较评估,重点关注与模型部署可行性和通用气候意识密切相关的资源成本。我们的实验涵盖了不同特征提取技术、集成学习效果、任务特定深度学习建模以及领域无关大语言模型(LLMs)。研究发现,虽然微调后的LLM取得了最佳准确率,但某些替代配置在准确率仅下降不足1%的情况下,可节省高达24,283倍的资源。此外,我们还发现对于较小规模的数据集,不同模型的准确率差距缩小,而资源消耗差异反而进一步扩大。