Low-light image enhancement is a crucial visual task, and many unsupervised methods tend to overlook the degradation of visible information in low-light scenes, which adversely affects the fusion of complementary information and hinders the generation of satisfactory results. To address this, our study introduces "Enlighten-Your-Voice", a multimodal enhancement framework that innovatively enriches user interaction through voice and textual commands. This approach does not merely signify a technical leap but also represents a paradigm shift in user engagement. Our model is equipped with a Dual Collaborative Attention Module (DCAM) that meticulously caters to distinct content and color discrepancies, thereby facilitating nuanced enhancements. Complementarily, we introduce a Semantic Feature Fusion (SFM) plug-and-play module that synergizes semantic context with low-light enhancement operations, sharpening the algorithm's efficacy. Crucially, "Enlighten-Your-Voice" showcases remarkable generalization in unsupervised zero-shot scenarios. The source code can be accessed from https://github.com/zhangbaijin/Enlighten-Your-Voice
翻译:低光照图像增强是一项关键的视觉任务,许多无监督方法往往忽视低光照场景中可见信息的退化,这不利于互补信息的融合,并阻碍了令人满意结果的生成。为此,本研究引入了“Enlighten-Your-Voice”,一个通过语音和文本指令创新性地丰富用户交互的多模态增强框架。该方法不仅是技术上的飞跃,更代表了用户参与的范式转变。我们的模型配备了双协同注意力模块(DCAM),该模块精细地处理内容与色彩差异,从而促进细致的增强。作为补充,我们引入了一种即插即用的语义特征融合模块(SFM),该模块将语义上下文与低光照增强操作相结合,提升了算法的效能。关键的是,“Enlighten-Your-Voice”在无监督零样本场景中展现了卓越的泛化能力。源代码可在https://github.com/zhangbaijin/Enlighten-Your-Voice获取。