Trajectory forecasting is crucial for video surveillance analytics, as it enables the anticipation of future movements for a set of agents, e.g. basketball players engaged in intricate interactions with long-term intentions. Deep generative models offer a natural learning approach for trajectory forecasting, yet they encounter difficulties in achieving an optimal balance between sampling fidelity and diversity. We address this challenge by leveraging Vector Quantized Variational Autoencoders (VQ-VAEs), which utilize a discrete latent space to tackle the issue of posterior collapse. Specifically, we introduce an instance-based codebook that allows tailored latent representations for each example. In a nutshell, the rows of the codebook are dynamically adjusted to reflect contextual information (i.e., past motion patterns extracted from the observed trajectories). In this way, the discretization process gains flexibility, leading to improved reconstructions. Notably, instance-level dynamics are injected into the codebook through low-rank updates, which restrict the customization of the codebook to a lower dimension space. The resulting discrete space serves as the basis of the subsequent step, which regards the training of a diffusion-based predictive model. We show that such a two-fold framework, augmented with instance-level discretization, leads to accurate and diverse forecasts, yielding state-of-the-art performance on three established benchmarks.
翻译:轨迹预测对于视频监控分析至关重要,它能够预测一组智能体(例如具有长期意图且进行复杂交互的篮球运动员)的未来运动。深度生成模型为轨迹预测提供了一种自然的学习方法,但它们在采样保真度与多样性之间难以达到最佳平衡。我们通过利用向量量化变分自编码器(VQ-VAEs)来解决这一挑战,该模型利用离散潜在空间来应对后验坍缩问题。具体而言,我们引入了一个基于实例的码本,允许为每个样本定制潜在表示。简而言之,码本的行被动态调整以反映上下文信息(即从观测轨迹中提取的过去运动模式)。通过这种方式,离散化过程获得了灵活性,从而改善了重建效果。值得注意的是,实例级别的动态信息通过低秩更新注入到码本中,这限制了码本在低维空间中的定制化。由此产生的离散空间作为后续步骤的基础,该步骤涉及基于扩散的预测模型的训练。我们证明,这种通过实例级离散化增强的双重框架能够实现准确且多样化的预测,在三个公认的基准测试中取得了最先进的性能。