Language models have demonstrated impressive ability in context understanding and generative performance. Inspired by the recent success of language foundation models, in this paper, we propose LMTraj (Language-based Multimodal Trajectory predictor), which recasts the trajectory prediction task into a sort of question-answering problem. Departing from traditional numerical regression models, which treat the trajectory coordinate sequence as continuous signals, we consider them as discrete signals like text prompts. Specially, we first transform an input space for the trajectory coordinate into the natural language space. Here, the entire time-series trajectories of pedestrians are converted into a text prompt, and scene images are described as text information through image captioning. The transformed numerical and image data are then wrapped into the question-answering template for use in a language model. Next, to guide the language model in understanding and reasoning high-level knowledge, such as scene context and social relationships between pedestrians, we introduce an auxiliary multi-task question and answering. We then train a numerical tokenizer with the prompt data. We encourage the tokenizer to separate the integer and decimal parts well, and leverage it to capture correlations between the consecutive numbers in the language model. Lastly, we train the language model using the numerical tokenizer and all of the question-answer prompts. Here, we propose a beam-search-based most-likely prediction and a temperature-based multimodal prediction to implement both deterministic and stochastic inferences. Applying our LMTraj, we show that the language-based model can be a powerful pedestrian trajectory predictor, and outperforms existing numerical-based predictor methods. Code is publicly available at https://github.com/inhwanbae/LMTrajectory .
翻译:语言模型在上下文理解和生成性能方面展现了令人印象深刻的能力。受最近语言基础模型成功的启发,本文提出了LMTraj(基于语言的多模态轨迹预测器),将轨迹预测任务重新定义为一种问答问题。与传统的数值回归模型将轨迹坐标序列视为连续信号不同,我们将其视为类似文本提示的离散信号。具体地,我们首先将轨迹坐标的输入空间转换为自然语言空间。在这里,行人的整个时间序列轨迹被转换为文本提示,场景图像通过图像描述被描述为文本信息。转换后的数值和图像数据随后被封装到问答模板中,以供语言模型使用。接着,为了引导语言模型理解和推理高级知识(如场景上下文和行人之间的社会关系),我们引入了一个辅助的多任务问答。然后,我们使用提示数据训练一个数值分词器。我们鼓励这个分词器很好地分离整数和小数部分,并利用它来捕获语言模型中连续数字之间的相关性。最后,我们使用数值分词器和所有问答提示来训练语言模型。在此,我们提出了一种基于波束搜索的最可能预测和一种基于温度的多模态预测,以实现确定性和随机推理。通过应用我们的LMTraj,我们展示了基于语言的模型可以成为强大的行人轨迹预测器,并且优于现有的基于数值的预测方法。代码已在https://github.com/inhwanbae/LMTrajectory公开。