Bin picking in real industrial environments remains challenging due to severe clutter, occlusions, and the high cost of traditional 3D sensing setups. We present Pickalo, a modular 6D pose-based bin-picking pipeline built entirely on low-cost hardware. A wrist-mounted RGB-D camera actively explores the scene from multiple viewpoints, while raw stereo streams are processed with BridgeDepth to obtain refined depth maps suitable for accurate collision reasoning. Object instances are segmented with a Mask-RCNN model trained purely on photorealistic synthetic data and localized using the zero-shot SAM-6D pose estimator. A pose buffer module fuses multi-view observations over time, handling object symmetries and significantly reducing pose noise. Offline, we generate and curate large sets of antipodal grasp candidates per object; online, a utility-based ranking and fast collision checking are queried for the grasp planning. Deployed on a UR5e with a parallel-jaw gripper and an Intel RealSense D435i, Pickalo achieves up to 600 mean picks per hour with 96-99% grasp success and robust performance over 30-minute runs on densely filled euroboxes. Ablation studies demonstrate the benefits of enhanced depth estimation and of the pose buffer for long-term stability and throughput in realistic industrial conditions. Videos are available at https://mesh-iit.github.io/project-jl2-camozzi/
翻译:在真实工业环境中,由于严重遮拦、遮挡以及传统3D传感设备的高成本,分拣任务仍面临挑战。本文提出Pickalo——一种完全基于低成本硬件的模块化6D姿态分拣流程。腕部安装的RGB-D相机从多视角主动探索场景,同时通过BridgeDepth处理原始立体图像流,获得适用于精确碰撞推理的精细化深度图。基于纯光度合成数据训练的Mask-RCNN模型完成物体实例分割,并采用零样本SAM-6D姿态估计器进行定位。姿态缓冲模块随时间融合多视角观测结果,处理物体对称性并显著降低姿态噪声。离线阶段,我们为每个物体生成并筛选大量对位抓取候选;在线阶段,通过效用排序与快速碰撞检测进行抓取规划查询。在配备平行夹爪的UR5e机械臂及Intel RealSense D435i深度相机上部署后,Pickalo在密集填充欧标箱中实现了长达30分钟稳定运行,平均每小时抓取600次,抓取成功率96-99%。消融实验证明了增强深度估计与姿态缓冲模块在真实工业场景中对长期稳定性与吞吐量的提升效果。视频见https://mesh-iit.github.io/project-jl2-camozzi/