High performance large scale graph analytics are essential to timely analyze relationships in big data sets. Conventional processor architectures suffer from inefficient resource usage and bad scaling on those workloads. To enable efficient and scalable graph analysis, Intel developed the Programmable Integrated Unified Memory Architecture (PIUMA) as a part of the DARPA Hierarchical Identify Verify Exploit (HIVE) program. PIUMA consists of many multi-threaded cores, fine-grained memory and network accesses, a globally shared address space, powerful offload engines and a tightly integrated optical interconnection network. By utilizing co-packaged optical silicon photonics and extending the on-chip mesh protocol directly to the optical fabric, all PIUMA chips in a system are glued together in a large virtual die which allows for extremely low socket-to-socket latencies even as the system scales to thousands of sockets. Performance estimations project that a PIUMA node will outperform a conventional compute node by one to two orders of magnitude. Furthermore, PIUMA continues to scale across multiple nodes, which is a challenge in conventional multi-node setups. This paper presents the PIUMA architecture, and documents our experience in designing and building a prototype chip and its bring-up process. We summarize the methodology for our co-design of the architecture together with the software stack using simulation tools and FPGA emulation. These tools provided early performance estimations of realistic applications and allowed us to implement many optimizations across the hardware, compilers, libraries and applications. We built the PIUMA chip as a 316mm2 7nm FinFET CMOS die and constructed a 16-node system. PIUMA silicon has successfully powered on demonstrating key aspects of the architecture, some of which will be incorporated into future Intel products.
翻译:高性能大规模图分析对于及时分析大数据集中的关联关系至关重要。传统处理器架构在处理此类工作负载时存在资源利用率低和扩展性差的问题。为实现高效可扩展的图分析,英特尔开发了可编程集成统一内存架构(PIUMA),作为DARPA分层识别验证利用(HIVE)项目的一部分。PIUMA包含多线程核心集群、细粒度内存与网络访问机制、全局共享地址空间、强大卸载引擎以及紧密集成的光互连网络。通过采用共封装光学硅光子技术并将片上网格协议直接扩展至光互连架构,系统中所有PIUMA芯片被整合为大型虚拟晶片,即使在系统扩展至数千个插槽时仍能保持极低的插槽间延迟。性能预估表明PIUMA节点将比传统计算节点提升一至两个数量级。此外,PIUMA支持跨多节点持续扩展,这解决了传统多节点架构中的扩展难题。本文介绍了PIUMA架构,并记录了设计构建原型芯片及其启动过程的实践经验。我们总结了通过仿真工具与FPGA仿真相结合,实现架构与软件栈协同设计的方法论。这些工具为实际应用提供了早期性能预估,并支持我们在硬件、编译器、库和应用程序中实施多项优化。PIUMA芯片采用7纳米FinFET CMOS工艺制造,芯片面积为316平方毫米,并构建了16节点系统。PIUMA芯片已成功上电运行,验证了架构的关键特性,其中部分技术将被纳入未来英特尔产品。