The rapid expansion of large language models (LLMs) has heightened concerns about their computational and environmental costs. This study investigates the trade-offs between translation quality and efficiency by comparing full-scale, distilled, and quantized models using machine translation as a case study. We evaluated performance on the Flores+ benchmark and through human judgments of conversational translations in French, Hindi, and Kannada. Our analysis revealed that the full 3.3B FP32 model, while achieving the highest BLEU scores, incurred the largest environmental footprint (~ 0.007-0.008 kg CO2 per run). The distilled 600M FP32 model reduced inference time by 71-78% and carbon emissions by 63-65% compared with the full model, with only minimal reductions in BLEU scores. Human evaluations further showed that even aggressive quantization (INT4) preserved high levels of accuracy and fluency, with differences between models generally minor. These findings demonstrate that model compression strategies can substantially reduce computational demands and environmental impact while maintaining competitive translation quality, though trade-offs are more pronounced in low-resource settings. We argue for evaluation frameworks that integrate efficiency and sustainability alongside accuracy as central dimensions of progress in NLP.
翻译:大型语言模型(LLMs)的快速扩张加剧了人们对其计算与环境成本的担忧。本研究以机器翻译为例,通过比较完整规模、蒸馏及量化模型,探讨了翻译质量与效率之间的权衡。我们在Flores+基准测试上评估了性能,并通过人工对法语、印地语和卡纳达语会话翻译的判断进行了评估。分析表明,完整的3.3B FP32模型虽然获得了最高的BLEU分数,但其环境足迹最大(每次运行约0.007-0.008 kg CO2)。与完整模型相比,蒸馏后的600M FP32模型将推理时间减少了71-78%,碳排放降低了63-65%,而BLEU分数仅轻微下降。人工评估进一步显示,即使采用激进的量化(INT4),仍能保持较高的准确性和流畅度,模型间的差异总体较小。这些发现表明,模型压缩策略能在保持有竞争力的翻译质量的同时,显著降低计算需求与环境影响,尽管在低资源场景中权衡更为明显。我们主张建立将效率与可持续性同准确性一并作为自然语言处理进展核心维度的评估框架。