Large language models (LLMs) have transformed artificial intelligence, but their computational requirements remain prohibitive for most users. Standard inference demands expensive datacenter GPUs or cloud API access, leaving over one billion personal computers underutilized for AI workloads. Ternary models offer a path forward: their weights are constrained to {-1, 0, +1}, theoretically eliminating the need for floating-point multiplication. However, existing frameworks fail to exploit this structure, treating ternary models as dense floating-point networks. We address this gap with custom SIMD kernels that replace matrix multiplication with simple addition and subtraction operations, targeting the integer dot product instructions available on modern CPUs. Our implementation, Litespark-Inference, is pip-installable and integrates directly with Hugging-Face, achieving 18.15x higher throughput, 7.15x faster time-to-first-token and 6.03x memory reduction compared to standard PyTorch inference on Apple Silicon, with comparable or higher throughput speedups up to 95.81x on Intel and AMD processors.
翻译:大语言模型(LLMs)已深刻变革人工智能领域,但其计算需求对大多数用户而言仍高不可攀。标准推理过程需要昂贵的数据中心GPU或云API接口,导致全球超过十亿台个人电脑在人工智能工作负载中得不到充分利用。三元模型提供了一条可行路径:其权重被约束为{-1, 0, +1},理论上完全消除了浮点乘法的需求。然而现有框架未能充分利用这一结构特性,仍将三元模型视为密集浮点网络进行处理。我们通过定制化SIMD内核弥补了这一空白——该内核将矩阵乘法替换为简单的加法与减法运算,并精准适配现代CPU支持的整型点积指令集。我们的实现方案Litespark-Inference支持pip安装部署,可直接集成Hugging Face生态,在Apple Silicon平台上相比标准PyTorch推理实现18.15倍的吞吐量提升、7.15倍的首令牌生成加速及6.03倍的内存削减,在Intel与AMD处理器上更实现高达95.81倍的吞吐量加速比。