In this paper, we present a Transformer-based architecture for 3D radar object detection that uses a novel Transformer Decoder as the prediction head to directly regress 3D bounding boxes and class scores from radar feature representations. To bridge multi-scale radar features and the decoder, we propose Pyramid Token Fusion (PTF), a lightweight module that converts a feature pyramid into a unified, scale-aware token sequence. By formulating detection as a set prediction problem with learnable object queries and positional encodings, our design models long-range spatial-temporal correlations and cross-feature interactions. This approach eliminates dense proposal generation and heuristic post-processing such as extensive non-maximum suppression (NMS) tuning. We evaluate the proposed framework on the RADDet, where it achieves significant improvements over state-of-the-art radar-only baselines.
翻译:本文提出了一种基于Transformer的三维雷达目标检测架构,该架构采用一种新颖的Transformer解码器作为预测头,直接从雷达特征表示中回归三维边界框与类别分数。为桥接多尺度雷达特征与解码器,我们提出了金字塔令牌融合(PTF),这是一个轻量级模块,可将特征金字塔转换为统一且具有尺度感知的令牌序列。通过将检测任务构建为带有可学习目标查询与位置编码的集合预测问题,我们的设计能够建模长程时空关联与跨特征交互。该方法消除了密集候选框生成以及启发式后处理(如大量非极大值抑制(NMS)调参)。我们在RADDet数据集上评估了所提出的框架,结果表明其相较于最先进的纯雷达基线方法取得了显著提升。