When taking images of some occluded content, one is often faced with the problem that every individual image frame contains unwanted artifacts, but a collection of images contains all relevant information if properly aligned and aggregated. In this paper, we attempt to build a deep learning pipeline that simultaneously aligns a sequence of distorted images and reconstructs them. We create a dataset that contains images with image distortions, such as lighting, specularities, shadows, and occlusion. We create perspective distortions with corresponding ground-truth homographies as labels. We use our dataset to train Swin transformer models to analyze sequential image data. The attention maps enable the model to detect relevant image content and differentiate it from outliers and artifacts. We further explore using neural feature maps as alternatives to classical key point detectors. The feature maps of trained convolutional layers provide dense image descriptors that can be used to find point correspondences between images. We utilize this to compute coarse image alignments and explore its limitations.
翻译:在拍摄部分遮挡内容时,常面临一个难题:每一帧图像都包含不必要的伪影,但若能正确对齐与聚合,图像集合则包含所有相关信息。本文尝试构建一个深度学习流水线,能够同时对齐一组畸变图像并对其进行重建。我们创建了一个包含光照、镜面反射、阴影和遮挡等图像畸变的数据集,并生成了带有对应真实单应性矩阵作为标签的透视畸变。利用该数据集,我们训练了Swin Transformer模型以分析序列图像数据。注意力机制使模型能够检测相关图像内容,并将其与异常值和伪影区分。我们进一步探索了使用神经特征图作为经典关键点检测器的替代方案。经过训练的卷积层特征图提供了密集的图像描述符,可用于寻找图像间的点对应关系。我们利用这一方法计算粗略的图像对齐,并探讨了其局限性。