The transition toward localized intelligence through Small Language Models (SLMs) has intensified the need for rigorous performance characterization on resource-constrained edge hardware. However, objectively measuring the theoretical performance ceilings of diverse architectures across heterogeneous platforms remains a formidable challenge. In this work, we propose a systematic framework based on the Roofline model that unifies architectural primitives and hardware constraints through the lens of operational intensity (OI). By defining an inference-potential region, we introduce the Relative Inference Potential as a novel metric to compare efficiency differences between Large Language Models (LLMs) on the same hardware substrate. Extensive empirical analysis across diverse compute tiers reveals that variations in performance and OI are significantly influenced by sequence length. We further identify a critical regression in OI as model depth increases. Additionally, our findings highlight an efficiency trap induced by hardware heterogeneity and demonstrate how structural refinements, such as Multi-head Latent Attention (M LA), can effectively unlock latent inference potential across various hardware substrates. These insights provide actionable directions for hardware-software co-design to align neural structures with physical constraints in on-device intelligence. The released code is available in the Appendix C.
翻译:通过小型语言模型(SLMs)实现本地化智能的转型,加强了对资源受限的边缘硬件进行严格性能表征的需求。然而,客观地衡量异构平台上不同架构的理论性能上限仍然是一项艰巨的挑战。在本工作中,我们提出一个基于屋顶线模型的系统化框架,该框架通过运算强度(OI)的视角统一了架构原语与硬件约束。通过定义推理潜力区域,我们引入了相对推理潜力这一新指标,用于比较同一硬件基底上不同大语言模型(LLMs)的效率差异。跨多个计算层级的广泛实证分析表明,性能与运算强度的变化受序列长度影响显著。我们进一步发现,随着模型深度增加,运算强度存在一个关键的衰退现象。此外,我们的研究结果突显了由硬件异构性引发的效率陷阱,并展示了诸如多头潜在注意力(M LA)等结构优化如何能有效释放不同硬件基底上的潜在推理能力。这些见解为硬件-软件协同设计提供了可行的方向,以使神经结构能与端侧智能的物理约束相匹配。发布的代码详见附录C。