We introduce RacketVision, a novel dataset and benchmark for advancing computer vision in sports analytics, covering table tennis, tennis, and badminton. The dataset is the first to provide large-scale, fine-grained annotations for racket pose alongside traditional ball positions, enabling research into complex human-object interactions. It is designed to tackle three interconnected tasks: fine-grained ball tracking, articulated racket pose estimation, and predictive ball trajectory forecasting. Our evaluation of established baselines reveals a critical insight for multi-modal fusion: while naively concatenating racket pose features degrades performance, a CrossAttention mechanism is essential to unlock their value, leading to trajectory prediction results that surpass strong unimodal baselines. RacketVision provides a versatile resource and a strong starting point for future research in dynamic object tracking, conditional motion forecasting, and multimodal analysis in sports. Project page at https://github.com/OrcustD/RacketVision
翻译:我们提出了RacketVision,这是一个用于推进体育分析中计算机视觉研究的新型数据集与基准,涵盖乒乓球、网球和羽毛球。该数据集首次在传统球体位置标注基础上,提供了大规模、细粒度的球拍姿态标注,为研究复杂的人-物交互关系创造了条件。其设计旨在解决三个相互关联的任务:细粒度球体追踪、关节式球拍姿态估计以及预测性球体轨迹预报。通过对现有基线的评估,我们揭示了多模态融合的关键发现:简单拼接球拍姿态特征会降低性能,而CrossAttention机制对于释放其价值至关重要,该机制使轨迹预测结果超越了强大的单模态基线。RacketVision为动态目标追踪、条件运动预测及体育多模态分析等未来研究提供了通用资源和坚实起点。项目页面详见https://github.com/OrcustD/RacketVision