Multi-agent LLM systems usually collaborate by exchanging natural-language messages. This interface is simple and interpretable, but it forces each sender's intermediate computation to be serialized into tokens and then reprocessed by the receiver, thereby increasing the generated-token cost, prefill overhead, and KV-cache memory. We study an alternative communication interface: instead of appending a sender's message to the receiver's context, compile the sender's hidden states into a transient, receiver-specific weight perturbation. We introduce TFlow (Thought Flow), a weight-space communication framework for a known and fixed receiver architecture. For each query, frozen role-prompted sender agents process the input, and a learned parameter generator maps their internal activations into low-rank LoRA perturbations targeting the receiver's modules. These perturbations are fused and applied only during the receiver's generation, enabling instance-level adaptation without permanently changing the model or enlarging the receiver's text context. With three Qwen3-4B agents, TFlow improves over a standalone receiver by up to 8.5 accuracy points across five benchmarks while reducing processed tokens by up to 32.69%. Compared with a text-based three-agent baseline, it reduces total processed tokens by up to 83.27% and the wall-clock inference time by up to 4.6$\times$, while maintaining competitive accuracy on four of five benchmarks. These results suggest that transient low-rank weight perturbations can serve as an executable communication medium for efficient multi-agent LLM collaboration.
翻译:多智能体大语言模型系统通常通过交换自然语言消息进行协作。这种接口简单且可解释,但迫使每个发送方的中间计算被序列化为词元,再由接收方重新处理,从而增加了生成词元的成本、预填充开销和键值缓存内存。我们研究了一种替代通信接口:不将发送方的消息附加到接收方的上下文中,而是将发送方的隐藏状态编译为瞬态且针对特定接收方的权重扰动。我们提出了TFlow(思想流),一种针对已知且固定的接收方架构的权重空间通信框架。对于每个查询,冻结角色提示的发送方代理处理输入,而一个学习得到的参数生成器将其内部激活映射为针对接收方模块的低秩LoRA扰动。这些扰动在接收方生成期间被融合并应用,从而实现实例级自适应,而无需永久改变模型或扩大接收方的文本上下文。使用三个Qwen3-4B代理,TFlow在五个基准测试上相较于独立接收方提升了高达8.5个准确率点,同时将处理的词元数减少了最多32.69%。与基于文本的三代理基线相比,它将处理的词元总数减少了最多83.27%,并将实际推理时间缩短了最多4.6倍,同时在五个基准测试中的四个上保持了有竞争力的准确率。这些结果表明,瞬态低秩权重扰动可以作为高效多智能体大语言模型协作的可执行通信媒介。