Recently, transformers have become incredibly popular in computer vision and vision-language tasks. This notable rise in their usage can be primarily attributed to the capabilities offered by attention mechanisms and the outstanding ability of transformers to adapt and apply themselves to a variety of tasks and domains. Their versatility and state-of-the-art performance have established them as indispensable tools for a wide array of applications. However, in the constantly changing landscape of machine learning, the assurance of the trustworthiness of transformers holds utmost importance. This paper conducts a thorough examination of vision-language transformers, employing three fundamental principles of responsible AI: Bias, Robustness, and Interpretability. The primary objective of this paper is to delve into the intricacies and complexities associated with the practical use of transformers, with the overarching goal of advancing our comprehension of how to enhance their reliability and accountability.
翻译:近年来,Transformer在计算机视觉和视觉-语言任务中变得极为流行。其使用率的显著增长主要归功于注意力机制提供的强大能力,以及Transformer在多种任务和领域中展现出的卓越适应性与泛化能力。它们的多功能性和领先性能使其成为广泛应用中不可或缺的工具。然而,在机器学习领域不断变化的格局中,确保Transformer的可信度至关重要。本文基于负责任AI的三个基本原则:偏见、鲁棒性和可解释性,对视觉-语言Transformer进行了全面审视。本文的主要目标是深入探讨Transformer实际应用中的复杂性与挑战,以增进对其可靠性与问责性的理解。