Generative models often treat continuous data and discrete events as separate processes, creating a gap in modeling complex systems where they interact synchronously. To bridge this gap, we introduce JointDiff, a novel diffusion framework designed to unify these two processes by simultaneously generating continuous spatio-temporal data and synchronous discrete events. We demonstrate its efficacy in the sports domain by simultaneously modeling multi-agent trajectories and key possession events. This joint modeling is validated with non-controllable generation and two novel controllable generation scenarios: weak-possessor-guidance, which offers flexible semantic control over game dynamics through a simple list of intended ball possessors, and text-guidance, which enables fine-grained, language-driven generation. To enable the conditioning with these guidance signals, we introduce CrossGuid, an effective conditioning operation for multi-agent domains. We also share a new unified sports benchmark enhanced with textual descriptions for soccer and football datasets. JointDiff achieves state-of-the-art performance, demonstrating that joint modeling is crucial for building realistic and controllable generative models for interactive systems.
翻译:生成模型通常将连续数据与离散事件视为独立的过程,这在建模两者同步交互的复杂系统时造成了鸿沟。为弥合此鸿沟,我们提出了JointDiff,一种新颖的扩散框架,旨在通过同步生成连续时空数据与同步离散事件来统一这两个过程。我们在体育领域通过同时建模多智能体轨迹与关键控球事件,验证了其有效性。这种联合建模通过不可控生成和两种新颖的可控生成场景得到验证:弱控球者引导,它通过简单的预期持球者列表实现对比赛动态的灵活语义控制;以及文本引导,它支持细粒度的、语言驱动的生成。为实现这些引导信号的条件化,我们引入了CrossGuid,一种针对多智能体领域的有效条件化操作。我们还分享了一个新的统一体育基准测试集,该数据集为足球和美式足球数据集增强了文本描述。JointDiff实现了最先进的性能,证明联合建模对于构建逼真且可控的交互系统生成模型至关重要。