Running LLMs locally has become increasingly common, but users face a complex design space across models, quantization levels, inference engines, and serving scenarios. Existing inference benchmarks are fragmented and focus on isolated goals, offering little guidance for practical deployments. We present Bench360, a framework for evaluating local LLM inference across tasks, usage patterns, and system metrics in one place. Bench360 supports custom tasks, integrates multiple inference engines and quantization formats, and reports both task quality and system behavior (latency, throughput, energy, startup time). We demonstrate it on four NLP tasks across three GPUs and four engines, showing how design choices shape efficiency and output quality. Results confirm that tradeoffs are substantial and configuration choices depend on specific workloads and constraints. There is no universal best option, underscoring the need for comprehensive, deployment-oriented benchmarks.
翻译:本地运行大语言模型已日益普遍,但用户面临着涵盖模型选择、量化级别、推理引擎及服务场景的复杂设计空间。现有推理基准测试较为零散且聚焦于孤立目标,难以对实际部署提供有效指导。本文提出Bench360,一个能够在统一框架下跨任务、使用模式和系统指标评估本地大语言模型推理性能的平台。Bench360支持自定义任务,集成多种推理引擎与量化格式,并同时报告任务质量与系统行为指标(延迟、吞吐量、能耗、启动时间)。我们在三种GPU和四种推理引擎上对四个自然语言处理任务进行了实证评估,揭示了设计决策如何影响系统效率与输出质量。实验结果表明,不同配置方案存在显著权衡,其最优选择取决于具体工作负载与约束条件。不存在普适性的最佳方案,这凸显了建立面向实际部署的综合性基准测试体系的必要性。