Transformer-based architectures have shown remarkable performance in vision and language tasks but pose unique challenges for safety-critical applications. This paper presents a conceptual framework for integrating Transformers into automotive systems from a safety perspective. We outline how multimodal Foundation Models can leverage sensor diversity and redundancy to improve fault tolerance and robustness. Our proposed architecture combines multiple independent modality-specific encoders that fuse their representations into a shared latent space, supporting fail-operational behavior if one modality degrades. We demonstrate how different input modalities could be fused in order to maintain consistent scene understanding. By structurally embedding redundancy and diversity at the representational level, this approach bridges the gap between modern deep learning and established functional safety practices, paving the way for certifiable AI systems in autonomous driving.
翻译:Transformer架构在视觉与语言任务中展现出卓越性能,但在安全关键型应用中仍面临独特挑战。本文从安全视角出发,提出将Transformer集成至汽车系统的概念框架。我们阐述了多模态基础模型如何利用传感器多样性与冗余性来提升容错能力与鲁棒性。所提出的架构结合多个独立的模态专用编码器,将其表征融合至共享潜在空间,从而在单一模态性能退化时支持故障可操作性。我们论证了不同输入模态的融合机制如何维持场景理解的一致性。通过在表征层面结构化嵌入冗余性与多样性,该方法弥合了现代深度学习与成熟功能安全实践之间的鸿沟,为自动驾驶领域可认证人工智能系统的实现铺平道路。