Robust Single-view Cone-beam X-ray Pose Estimation with Neural Tuned Tomography (NeTT) and Masked Neural Radiance Fields (mNeRF)

Chaochao Zhou,Syed Hasib Akhter Faruqui,Abhinav Patel,Ramez N. Abdalla,Michael C. Hurley,Ali Shaibani,Matthew B. Potts,Babak S. Jahromi,Leon Cho,Sameer A. Ansari,Donald R. Cantrell

Many tasks performed in image-guided, mini-invasive, medical procedures can be cast as pose estimation problems, where an X-ray projection is utilized to reach a target in 3D space. Expanding on recent advances in the differentiable rendering of optically reflective materials, we introduce new methods for pose estimation of radiolucent objects using X-ray projections, and we demonstrate the critical role of optimal view synthesis in performing this task. We first develop an algorithm (DiffDRR) that efficiently computes Digitally Reconstructed Radiographs (DRRs) and leverages automatic differentiation within TensorFlow. Pose estimation is performed by iterative gradient descent using a loss function that quantifies the similarity of the DRR synthesized from a randomly initialized pose and the true fluoroscopic image at the target pose. We propose two novel methods for high-fidelity view synthesis, Neural Tuned Tomography (NeTT) and masked Neural Radiance Fields (mNeRF). Both methods rely on classic Cone-Beam Computerized Tomography (CBCT); NeTT directly optimizes the CBCT densities, while the non-zero values of mNeRF are constrained by a 3D mask of the anatomic region segmented from CBCT. We demonstrate that both NeTT and mNeRF distinctly improve pose estimation within our framework. By defining a successful pose estimate to be a 3D angle error of less than 3 deg, we find that NeTT and mNeRF can achieve similar results, both with overall success rates more than 93%. However, the computational cost of NeTT is significantly lower than mNeRF in both training and pose estimation. Furthermore, we show that a NeTT trained for a single subject can generalize to synthesize high-fidelity DRRs and ensure robust pose estimations for all other subjects. Therefore, we suggest that NeTT is an attractive option for robust pose estimation using fluoroscopic projections.

翻译：在图像引导微创医疗过程中，许多任务可归结为姿态估计问题，即利用X射线投影实现三维空间中的目标定位。基于可微分光学反射材料渲染领域的最新进展，我们提出了针对X射线可透射物体姿态估计的新方法，并论证了最优视角合成在此任务中的关键作用。首先开发的DiffDRR算法能够高效计算数字重建放射影像（DRR），并利用TensorFlow框架中的自动微分机制。通过损失函数量化随机初始姿态合成的DRR与目标姿态真实荧光影像的相似度，采用迭代梯度下降法实现姿态估计。我们提出两种高保真视角合成方法：神经调谐断层成像（NeTT）与掩膜神经辐射场（mNeRF）。两种方法均基于经典锥束计算机断层成像（CBCT）——NeTT直接优化CBCT密度值，而mNeRF的非零值受限于从CBCT分割出的解剖区域三维掩膜。实验证明，两种方法均能显著提升框架中的姿态估计精度。以三维角度误差小于3度作为成功估计标准，NeTT与mNeRF均能达到93%以上的总体成功率，且NeTT在训练与姿态估计过程中的计算成本显著低于mNeRF。进一步研究表明，针对单个对象训练的NeTT可泛化至其他全部对象的高保真DRR合成与鲁棒姿态估计。因此，我们认为NeTT是利用荧光投影实现鲁棒姿态估计的理想方案。