Foundation models (FMs) have achieved significant success across various tasks, leading to research on benchmarks for reasoning abilities. However, there is a lack of studies on FMs performance in exceptional scenarios, which we define as out-of-distribution (OOD) reasoning tasks. This paper is the first to address these cases, developing a novel dataset for evaluation of FMs across multiple modalities, including graphic novels, calligraphy, news articles, and lyrics. It includes tasks for instance classification, character recognition, token prediction, and text generation. The paper also proposes prompt engineering techniques like Chain-of-Thought (CoT) and CoT+Few-Shot to enhance performance. Validation of FMs using various methods revealed improvements. The code repository is accessible at: https://github.com/MLAI-Yonsei/ExceptionalBenchmark
翻译:基础模型(FMs)已在多种任务中取得显著成功,推动了针对其推理能力的基准测试研究。然而,目前缺乏关于基础模型在异常场景下性能的研究,我们将此类场景定义为分布外(OOD)推理任务。本文首次针对这些情况展开研究,构建了一个新颖的多模态评估数据集,涵盖图像小说、书法作品、新闻文章和歌词等多种形式。该数据集包含实例分类、字符识别、标记预测和文本生成等任务。本文还提出了提示工程技术,如思维链(CoT)及CoT+少样本学习,以提升模型性能。通过多种方法对基础模型进行验证,结果表明性能有所改善。代码仓库地址为:https://github.com/MLAI-Yonsei/ExceptionalBenchmark