Audio-visual speaker tracking has drawn increasing attention over the past few years due to its academic values and wide application. Audio and visual modalities can provide complementary information for localization and tracking. With audio and visual information, the Bayesian-based filter can solve the problem of data association, audio-visual fusion and track management. In this paper, we conduct a comprehensive overview of audio-visual speaker tracking. To our knowledge, this is the first extensive survey over the past five years. We introduce the family of Bayesian filters and summarize the methods for obtaining audio-visual measurements. In addition, the existing trackers and their performance on AV16.3 dataset are summarized. In the past few years, deep learning techniques have thrived, which also boosts the development of audio visual speaker tracking. The influence of deep learning techniques in terms of measurement extraction and state estimation is also discussed. At last, we discuss the connections between audio-visual speaker tracking and other areas such as speech separation and distributed speaker tracking.
翻译:视听说话人跟踪在近些年因其学术价值与广泛应用而日益受到关注。音频与视觉模态可为定位与跟踪提供互补信息。借助音频与视觉信息,基于贝叶斯的滤波器能够解决数据关联、视听融合与轨迹管理问题。本文对视听说话人跟踪进行了全面综述。据我们所知,这是过去五年来首次系统性调研。我们介绍了贝叶斯滤波器家族,并总结了获取视听测量的方法。此外,还归纳了现有跟踪器及其在AV16.3数据集上的性能表现。近年来深度学习技术蓬勃发展,也推动了视听说话人跟踪的发展。本文讨论了深度学习技术在测量提取与状态估计方面的影响。最后,我们探讨了视听说话人跟踪与其他领域(如语音分离与分布式说话人跟踪)之间的关联。