Despite the impressive performance of vision-based pose estimators, they generally fail to perform well under adverse vision conditions and often don't satisfy the privacy demands of customers. As a result, researchers have begun to study tactile sensing systems as an alternative. However, these systems suffer from noisy and ambiguous recordings. To tackle this problem, we propose a novel solution for pose estimation from ambiguous pressure data. Our method comprises a spatio-temporal vision transformer with an encoder-decoder architecture. Detailed experiments on two popular public datasets reveal that our model outperforms existing solutions in the area. Moreover, we observe that increasing the number of temporal crops in the early stages of the network positively impacts the performance while pre-training the network in a self-supervised setting using a masked auto-encoder approach also further improves the results.
翻译:尽管基于视觉的姿态估计器表现出色,但在不良视觉条件下通常难以有效工作,且往往无法满足用户的隐私需求。因此,研究人员开始研究触觉传感系统作为替代方案。然而,这类系统存在噪声大、记录模糊的问题。为解决这一难题,我们提出了一种从模糊压力数据中估计姿态的创新方案。该方法采用编码器-解码器架构的时空视觉Transformer。在两个广泛使用的公开数据集上的详细实验表明,我们的模型优于现有同类方案。此外,我们观察到,在网络早期阶段增加时间裁剪数量能有效提升性能,而采用掩码自编码器方法进行自监督预训练也能进一步改善结果。