DEMON: Diffusion Engine for Musical Orchestrated Noise

We present DEMON, a real-time diffusion engine that makes the denoising process playable as a live musical instrument: a control surface both broad (many parameters shaped per-frame across the output) and responsive (each control taking effect as fast as its place in the denoising loop allows). Built on ACE-Step 1.5 and StreamDiffusion's ring-buffer architecture with TensorRT acceleration, it sustains up to 12.3 decoder completions per second for 60-second music on a single consumer GPU (RTX 5090), or 11.3 generations per second at our production ring-depth of 4. At these rates denoising parameters become viable as live performance controls, but the ring buffer propagates per-request changes only at its drain rate, a floor of S denoising steps. We contribute four mechanisms. (1) Per-slot heterogeneous denoise scheduling: each ring-buffer slot owns its timestep schedule, so a moving denoise slider is tracked without wiping the in-flight queue, where the upstream global-schedule design must rebuild and discard it. (2) Shared mutable per-step state, giving any parameter consulted at every solver step next-tick effect, bypassing ring-buffer drain. (3) Per-frame source blending: a sampling-time control on the standard SDE re-noise step, giving a framewise transformation-strength axis that complements scalar denoise scheduling. (4) Windowed VAE decode exploiting receptive-field analysis for an 8.0x decode speedup. Together these separate streaming-diffusion parameters into four propagation classes, by onset and convergence latency.

翻译：我们提出DEMON，一种实时扩散引擎，将去噪过程转化为可演奏的活态乐器：其控制界面兼具广度（每帧跨输出调整多个参数）与灵敏度（每个控制按其在去噪循环中的位置以最快速度生效）。该引擎基于ACE-Step 1.5与StreamDiffusion的环形缓冲区架构，并采用TensorRT加速，可在单张消费级GPU（RTX 5090）上对60秒音乐维持每秒12.3次解码器完整生成，或在生产级环深为4时达到每秒11.3次生成。在此速率下，去噪参数可作为现场演奏控制手段，但环形缓冲区仅以去噪步骤下限S的排出速率传播每次请求的变更。我们贡献四项机制：（1）时隙异构去噪调度：每个环形缓冲区时隙拥有独立的时间步调度，使得移动去噪滑块无需清空处理中队列即可被追踪，而传统的全局调度设计需重建并丢弃队列。（2）共享可变的每步状态：使任何在求解器每步中被调用的参数获得下一次刻的生效效果，从而绕过环形缓冲区的排出限制。（3）每帧源混合：对标准SDE重噪步骤引入采样时控制，提供帧级变换强度轴，作为标量去噪调度的补充。（4）窗口化VAE解码：利用感受野分析实现8.0倍解码加速。综合上述机制，我们将流式扩散参数按起效延迟与收敛延迟划分为四类传播模式。