Deep-Learning-based Fast and Accurate 3D CT Deformable Image Registration in Lung Cancer

Yuzhen Ding,Hongying Feng,Yunze Yang,Jason Holmes,Zhengliang Liu,David Liu,William W. Wong,Nathan Y. Yu,Terence T. Sio,Steven E. Schild,Baoxin Li,Wei Liu

from arxiv, 9 figures

Purpose: In some proton therapy facilities, patient alignment relies on two 2D orthogonal kV images, taken at fixed, oblique angles, as no 3D on-the-bed imaging is available. The visibility of the tumor in kV images is limited since the patient's 3D anatomy is projected onto a 2D plane, especially when the tumor is behind high-density structures such as bones. This can lead to large patient setup errors. A solution is to reconstruct the 3D CT image from the kV images obtained at the treatment isocenter in the treatment position. Methods: An asymmetric autoencoder-like network built with vision-transformer blocks was developed. The data was collected from 1 head and neck patient: 2 orthogonal kV images (1024x1024 voxels), 1 3D CT with padding (512x512x512) acquired from the in-room CT-on-rails before kVs were taken and 2 digitally-reconstructed-radiograph (DRR) images (512x512) based on the CT. We resampled kV images every 8 voxels and DRR and CT every 4 voxels, thus formed a dataset consisting of 262,144 samples, in which the images have a dimension of 128 for each direction. In training, both kV and DRR images were utilized, and the encoder was encouraged to learn the jointed feature map from both kV and DRR images. In testing, only independent kV images were used. The full-size synthetic CT (sCT) was achieved by concatenating the sCTs generated by the model according to their spatial information. The image quality of the synthetic CT (sCT) was evaluated using mean absolute error (MAE) and per-voxel-absolute-CT-number-difference volume histogram (CDVH). Results: The model achieved a speed of 2.1s and a MAE of <40HU. The CDVH showed that <5% of the voxels had a per-voxel-absolute-CT-number-difference larger than 185 HU. Conclusion: A patient-specific vision-transformer-based network was developed and shown to be accurate and efficient to reconstruct 3D CT images from kV images.

翻译：目的：在某些质子治疗设施中，患者摆位依赖于两张固定斜角采集的二维正交KV图像，因为缺乏床上三维成像能力。由于患者三维解剖结构被投影到二维平面，尤其当肿瘤位于骨骼等高密度结构后方时，KV图像中肿瘤的可视性有限，可能导致较大的患者摆位误差。解决方案是依据治疗等中心处KV图像重建三维CT图像。方法：开发了一种基于视觉Transformer模块的非对称自编码器网络。数据来源于1例头颈部患者：2张正交KV图像（1024×1024体素）、KV采集前由室内轨道CT获取的1个三维CT（带填充，512×512×512）以及基于该CT生成的2张数字重建放射影像（DRR，512×512）。我们将KV图像每8个体素重采样，DRR和CT每4个体素重采样，形成包含262，144个样本的数据集，其中各方向图像维度均为128。训练阶段同时使用KV和DRR图像，通过编码器学习两者的联合特征图；测试阶段仅使用独立KV图像。全尺寸合成CT通过按空间信息拼接模型生成的子块实现。使用平均绝对误差和每体素绝对CT数值差体积直方图评估合成CT图像质量。结果：模型推理速度达2.1秒，MAE低于40HU。CDVH显示仅不到5%体素的每体素绝对CT数值差超过185HU。结论：开发了一种基于视觉Transformer的患者特异性网络，该网络可从KV图像准确高效地重建三维CT图像。