Another Vertical View: A Hierarchical Network for Heterogeneous Trajectory Prediction via Spectrums

With the fast development of AI-related techniques, the applications of trajectory prediction are no longer limited to easier scenes and trajectories. More and more heterogeneous trajectories with different representation forms, such as 2D or 3D coordinates, 2D or 3D bounding boxes, and even high-dimensional human skeletons, need to be analyzed and forecasted. Among these heterogeneous trajectories, interactions between different elements within a frame of trajectory, which we call the ``Dimension-Wise Interactions'', would be more complex and challenging. However, most previous approaches focus mainly on a specific form of trajectories, which means these methods could not be used to forecast heterogeneous trajectories, not to mention the dimension-wise interaction. Besides, previous methods mostly treat trajectory prediction as a normal time sequence generation task, indicating that these methods may require more work to directly analyze agents' behaviors and social interactions at different temporal scales. In this paper, we bring a new ``view'' for trajectory prediction to model and forecast trajectories hierarchically according to different frequency portions from the spectral domain to learn to forecast trajectories by considering their frequency responses. Moreover, we try to expand the current trajectory prediction task by introducing the dimension $M$ from ``another view'', thus extending its application scenarios to heterogeneous trajectories vertically. Finally, we adopt the bilinear structure to fuse two factors, including the frequency response and the dimension-wise interaction, to forecast heterogeneous trajectories via spectrums hierarchically in a generic way. Experiments show that the proposed model outperforms most state-of-the-art methods on ETH-UCY, Stanford Drone Dataset and nuScenes with heterogeneous trajectories, including 2D coordinates, 2D and 3D bounding boxes.

翻译：随着人工智能相关技术的快速发展，轨迹预测的应用已不再局限于简单场景和轨迹。越来越多的异构轨迹——具有不同表示形式，如二维或三维坐标、二维或三维边界框，甚至高维人体骨架——需要被分析与预测。在这些异构轨迹中，同一轨迹帧内不同元素之间的相互作用（我们称之为“维度间交互”）将更为复杂且具有挑战性。然而，以往的方法主要关注特定形式的轨迹，这意味着这些方法无法用于预测异构轨迹，更无法处理维度间交互。此外，以往方法通常将轨迹预测视为普通的时间序列生成任务，表明这些方法在直接分析不同时间尺度上的智能体行为与社会交互时仍需更多工作。本文提出了一种新的轨迹预测“视角”，通过从频域出发，根据不同的频率分量，以层次化方式建模与预测轨迹，从而基于频率响应来学习预测轨迹。同时，我们尝试从“另一视角”引入维度$M$以扩展当前轨迹预测任务，从而将其应用场景纵向延伸至异构轨迹。最后，我们采用双线性结构融合两个因素（频率响应与维度间交互），以通用层次化方式通过频谱预测异构轨迹。实验表明，所提出方法在包含二维坐标、二维及三维边界框等异构轨迹的ETH-UCY、Stanford Drone Dataset和nuScenes数据集上优于多数最先进方法。