Structured, or tabular, data is the most common format in data science. While deep learning models have proven formidable in learning from unstructured data such as images or speech, they are less accurate than simpler approaches when learning from tabular data. In contrast, modern tree-based Machine Learning (ML) models shine in extracting relevant information from structured data. An essential requirement in data science is to reduce model inference latency in cases where, for example, models are used in a closed loop with simulation to accelerate scientific discovery. However, the hardware acceleration community has mostly focused on deep neural networks and largely ignored other forms of machine learning. Previous work has described the use of an analog content addressable memory (CAM) component for efficiently mapping random forests. In this work, we focus on an overall analog-digital architecture implementing a novel increased precision analog CAM and a programmable network on chip allowing the inference of state-of-the-art tree-based ML models, such as XGBoost and CatBoost. Results evaluated in a single chip at 16nm technology show 119x lower latency at 9740x higher throughput compared with a state-of-the-art GPU, with a 19W peak power consumption.
翻译:结构化数据,即表格数据,是数据科学中最常见的格式。尽管深度学习模型在从图像或语音等非结构化数据中学习时表现出强大能力,但在处理表格数据时,其准确性却不如更简单的方法。相比之下,现代基于树的机器学习模型在从结构化数据中提取相关信息方面表现出色。数据科学中的一个关键需求是降低模型推理延迟,例如当模型与仿真形成闭环以加速科学发现时。然而,硬件加速领域主要关注深度神经网络,而很大程度上忽视了其他形式的机器学习。先前的工作已描述了使用模拟内容可寻址存储器(CAM)组件来高效映射随机森林。本研究聚焦于一种整体模数混合架构,实现了新型高精度模拟CAM和可编程片上网络,从而支持最先进的基于树的机器学习模型(如XGBoost和CatBoost)的推理。在16nm工艺的单芯片评估结果显示,与最先进的GPU相比,延迟降低了119倍,吞吐量提升了9740倍,峰值功耗为19W。