High Performance and Energy Efficiency are critical requirements for Internet of Things (IoT) end-nodes. Exploiting tightly-coupled clusters of programmable processors (CMPs) has recently emerged as a suitable solution to address this challenge. One of the main bottlenecks limiting the performance and energy efficiency of these systems is the instruction cache architecture due to its criticality in terms of timing (i.e., maximum operating frequency), bandwidth, and power. We propose a hierarchical instruction cache tailored to ultra-low-power tightly-coupled processor clusters where a relatively large cache (L1.5) is shared by L1 private caches through a two-cycle latency interconnect. To address the performance loss caused by the L1 capacity misses, we introduce a next-line prefetcher with cache probe filtering (CPF) from L1 to L1.5. We optimize the core instruction fetch (IF) stage by removing the critical core-to-L1 combinational path. We present a detailed comparison of instruction cache architectures' performance and energy efficiency for parallel ultra-low-power (ULP) clusters. Focusing on the implementation, our two-level instruction cache provides better scalability than existing shared caches, delivering up to 20\% higher operating frequency. On average, the proposed two-level cache improves maximum performance by up to 17\% compared to the state-of-the-art while delivering similar energy efficiency for most relevant applications.
翻译:高计算性能与能效是物联网终端节点的关键需求。利用紧耦合可编程处理器集群(CMPs)已成为应对这一挑战的可行方案。限制此类系统性能与能效的主要瓶颈之一在于指令缓存架构,因其在时序(即最高工作频率)、带宽和功耗方面具有关键性影响。我们提出一种专为超低功耗紧耦合处理器集群设计的分级指令缓存:通过两周期延迟互连,由L1私有缓存共享一个较大容量的L1.5级缓存。为弥补L1容量缺失导致性能损失,我们引入带有缓存探测过滤(CPF)机制的下一行预取器,实现从L1到L1.5的数据预取。通过消除关键的核到L1组合逻辑路径,优化了核心指令获取(IF)阶段。本文详细对比了并行超低功耗(ULP)集群的指令缓存架构在性能与能效方面的差异。在实现层面,我们的两级指令缓存相比现有共享缓存具有更优可扩展性,最高可提升20%的工作频率。与现有最优方案相比,该两级缓存平均将最大性能提升高达17%,同时为多数典型应用提供相近的能效表现。