Real-time multimodal agents transport raw audio and screenshots using networking stacks designed for human receivers, which optimize for perceptual fidelity and smooth playout. Yet agent models act as event-driven processors with no inherent sense of physical time, consuming task-relevant semantics rather than reconstructing signals in real time. This fundamental difference shifts the transport goal from the technical problem of signal fidelity (Shannon-Weaver Level A) to the semantic problem of meaning preservation (Level B). This mismatch imposes significant overhead. In visual pipelines, screenshot upload accounts for over 60% of end-to-end action latency on constrained uplinks, and in voice pipelines, conventional transport carries massive redundancy, sending 43-64x more data than needed to maintain task accuracy. We present Sema, a semantic transport system that combines discrete audio tokenizers with a hybrid screen representation (lossless accessibility-tree or OCR text, plus compact visual tokens) and bursty token delivery that eliminates jitter buffers. In simulations under emulated WAN conditions, Sema reduces uplink bandwidth by 64x for audio and 130-210x for screenshots while preserving task accuracy within 0.7 percentage points of the raw baseline.
翻译:实时多模态智能体使用为人类接收者设计的网络协议栈传输原始音频和截图,这些协议优先保证感知保真度与平滑播放。然而,智能体模型作为事件驱动型处理器运作,本身不具备对物理时间的感知能力,其消耗的是任务相关语义而非实时重建信号。这一根本差异将传输目标从信号保真度的技术问题(香农-韦弗A级)转变为意义保存的语义问题(B级)。这种不匹配带来了显著开销:在视觉流水线中,截图上传占上行链路受限时端到端动作延迟的60%以上;在语音流水线中,传统传输携带大量冗余数据,为维持任务准确率需发送实际所需的43-64倍数据量。我们提出Sema——一种结合离散音频分词器、混合屏幕表征(无损可访问性树结构或OCR文本,辅以紧凑视觉标记)以及消除抖动缓冲区的突发式令牌交付的语义传输系统。在模拟广域网环境下的实验中,Sema将音频上行带宽降低64倍,截图带宽降低130-210倍,同时任务准确率相较于原始基线仅下降0.7个百分点以内。