Dense 3D convolutions provide high accuracy for perception but are too computationally expensive for real-time robotic systems. Existing tri-plane methods rely on 2D image features with interpolation, point-wise queries, and implicit MLPs, which makes them computationally heavy and unsuitable for embedded 3D inference. As an alternative, we propose TriLift, a novel interpolation-free tri-plane lifting and volumetric fusion framework that directly projects 3D voxels into plane features and reconstructs a feature volume through broadcast and summation. This shifts nonlinearity to 2D convolutions, reducing complexity while remaining fully parallelizable. To mitigate spatial information loss inherent in projections, we incorporate a lightweight adaptive positional encoding module that helps bridge the spatial information gap, dynamically recovering fine geometric details with negligible overhead. To capture global context, we add a low-resolution volumetric branch fused with the lifted features through a lightweight integration layer, yielding a design that is both efficient and end-to-end GPU-accelerated. To validate the effectiveness of the proposed method, we conduct experiments on classification, completion, segmentation, and detection, and we map the trade-off between efficiency and accuracy across tasks. Results show that classification and completion retain or improve accuracy, while segmentation and detection show a trade-off, significantly reducing computational demand with only a slight decrease in accuracy. On-device benchmarks on an NVIDIA Jetson Orin Nano confirm robust real-time throughput, demonstrating the suitability of the approach for embedded robotic perception.
翻译:暂无翻译