ProbeLogits: Kernel-Level LLM Inference Primitives for AI-Native Operating Systems

An OS kernel that runs LLM inference internally can read logit distributions before any text is generated -- and act on them as a governance primitive. I present ProbeLogits, a kernel-level operation that performs a single forward pass and reads specific token logits to classify agent actions as safe or dangerous, with zero learned parameters. On a 260-prompt OS action benchmark (9 categories including adversarial attacks), ProbeLogits achieves F1=0.980, Precision=1.000, and Recall=0.960 using a general-purpose 7B model at 4-bit quantization. On ToxicChat (1,000 human-annotated real conversations), it achieves F1=0.790 at default calibration strength $α$=1.0, improving to F1=0.837 at $α$=0.5 -- 89% of Llama Guard 3's F1~0.939 with zero learned parameters. A key design contribution is the calibration strength $α$, which serves as a deployment-time policy knob rather than a learned hyperparameter. By adjusting $α$, the OS can enforce strict policies for privileged operations ($α\geq 0.8$, maximizing recall) or relaxed policies for conversational agents ($α$=0.5, maximizing precision). Contextual calibration improves accuracy from 64.8% to 97.3% on the custom benchmark. I implement ProbeLogits within Anima OS, a bare-metal x86_64 OS written in 80,400 lines of Rust. Because agent actions must pass through kernel-mediated host functions, ProbeLogits enforcement operates below the WASM sandbox boundary, making it significantly harder to circumvent than application-layer classifiers. Each classification costs 65ms on 7B -- fast enough for per-action governance. I also show that treating KV cache as process state enables checkpoint, restore, and fork operations analogous to traditional process management. To my knowledge, no prior system exposes LLM logit vectors as OS-level governance primitives.

翻译：操作系统内核在内部运行大语言模型推理时，可在生成任何文本之前读取logits分布——并将其作为治理原语。本文提出ProbeLogits，一种内核级操作，该操作执行单次前向传播并读取特定token的logits值，以零学习参数将智能体行为分类为安全或危险。在包含9类（含对抗攻击）共260条指令的OS行为基准测试中，ProbeLogits使用4比特量化通用7B模型实现了F1=0.980、精确率=1.000、召回率=0.960。在ToxicChat数据集（1000条人工标注真实对话）上，默认校准强度α=1.0时F1=0.790，α=0.5时提升至F1=0.837——达到Llama Guard 3的89%（后者F1≈0.939）且无需学习参数。一项关键设计贡献是校准强度α，其作为部署时策略调节旋钮而非学习超参数。通过调整α，操作系统既可对特权操作实施严格策略（α≥0.8，最大化召回率），也可对对话型智能体采用宽松策略（α=0.5，最大化精确率）。上下文校准将自定义基准测试的准确率从64.8%提升至97.3%。我在Anima OS中实现了ProbeLogits，该系统是一个由80,400行Rust语言编写的裸机x86_64操作系统。由于智能体行为必须通过内核中介的主机函数进行传递，ProbeLogits的强制执行位于WASM沙箱边界之下，因此比应用层分类器更难绕过。每个分类在7B模型上耗时65毫秒——足以支持每行为粒度治理。本文还展示了将KV缓存作为进程状态，可支持类似传统进程管理的检查点（checkpoint）、恢复（restore）和分支（fork）操作。据我所知，此前尚无系统将大语言模型logit向量作为操作系统级治理原语。