Machine learning and data analytics applications increasingly suffer from the high latency and energy consumption of conventional von Neumann architectures. Recently, several in-memory and near-memory systems have been proposed to remove this von Neumann bottleneck. Platforms based on content-addressable memories (CAMs) are particularly interesting due to their efficient support for the search-based operations that form the foundation for many applications, including K-nearest neighbors (KNN), high-dimensional computing (HDC), recommender systems, and one-shot learning among others. Today, these platforms are designed by hand and can only be programmed with low-level code, accessible only to hardware experts. In this paper, we introduce C4CAM, the first compiler framework to quickly explore CAM configurations and to seamlessly generate code from high-level TorchScript code. C4CAM employs a hierarchy of abstractions that progressively lowers programs, allowing code transformations at the most suitable abstraction level. Depending on the type and technology, CAM arrays exhibit varying latencies and power profiles. Our framework allows analyzing the impact of such differences in terms of system-level performance and energy consumption, and thus supports designers in selecting appropriate designs for a given application.
翻译:机器学习和数据分析应用日益受到传统冯·诺依曼架构高延迟和高能耗的制约。近年来,多种内存内及近内存处理系统被提出以消除这一瓶颈。基于内容可寻址存储器(CAM)的平台尤为引人关注,因其高效支持搜索型运算——该运算构成包括K近邻算法(KNN)、高维计算(HDC)、推荐系统及单样本学习等众多应用的基础。当前,此类平台依赖人工设计,仅能通过底层代码编程,唯有硬件专家方可使用。本文提出C4CAM——首个可快速探索CAM配置、并从高层TorchScript代码无缝生成代码的编译器框架。C4CAM采用层级抽象机制逐步降低程序层级,允许在最适宜的抽象层进行代码变换。不同类型的CAM阵列及其技术工艺呈现迥异的延迟与功耗特性。本框架支持从系统级性能与能耗角度分析此类差异的影响,从而辅助设计者为特定应用选择合适的设计方案。