Metaphor identification is a foundational task in figurative language processing, yet most computational approaches operate as opaque classifiers offering no insight into why an expression is judged metaphorical. This interpretability gap is especially acute for Chinese, where rich figurative traditions, absent morphological cues, and limited annotated resources compound the challenge. We present an LLM-assisted pipeline that operationalises four metaphor identification protocols--MIP/MIPVU lexical analysis, CMDAG conceptual-mapping annotation, emotion-based detection, and simile-oriented identification--as executable, human-auditable rule scripts. Each protocol is a modular chain of deterministic steps interleaved with controlled LLM calls, producing structured rationales alongside every classification decision. We evaluate on seven Chinese metaphor datasets spanning token-, sentence-, and span-level annotation, establishing the first cross-protocol comparison for Chinese metaphor identification. Within-protocol evaluation shows Protocol A (MIP) achieves an F1 of 0.472 on token-level identification, while cross-protocol analysis reveals striking divergence: pairwise Cohen's kappa between Protocols A and D is merely 0.001, whereas Protocols B and C exhibit near-perfect agreement (kappa = 0.986). An interpretability audit shows all protocols achieve 100% deterministic reproducibility, with rationale correctness from 0.40 to 0.87 and editability from 0.80 to 1.00. Error analysis identifies conceptual-domain mismatch and register sensitivity as dominant failure modes. Our results demonstrate that protocol choice is the single largest source of variation in metaphor identification, exceeding model-level variation, and that rule-script architectures achieve competitive performance while maintaining full transparency.
翻译:隐喻识别是比喻语言处理的基础任务,然而大多数计算方法作为不透明的分类器运行,无法解释为何判定某个表达为隐喻。这一可解释性差距在汉语中尤为突出,丰富的比喻传统、形态线索的缺失以及有限的标注资源加剧了这一挑战。我们提出了一种大语言模型辅助的流程,将四种隐喻识别协议——MIP/MIPVU词汇分析、CMDAG概念映射标注、基于情感的检测和面向明喻的识别——操作化为可执行、可人工审核的规则脚本。每个协议都是一个由确定性步骤与受控的大语言模型调用交织而成的模块化链条,在每次分类决策的同时生成结构化的推理依据。我们在七个涵盖词元级、句子级和片段级标注的汉语隐喻数据集上进行评估,首次建立了汉语隐喻识别的跨协议比较。协议内评估显示协议A(MIP)在词元级识别上达到0.472的F1分数,而跨协议分析揭示了显著分歧:协议A与D之间的成对Cohen's kappa仅为0.001,而协议B与C则表现出近乎完美的一致性(kappa = 0.986)。可解释性审计表明所有协议均实现100%的确定性可复现性,推理正确率在0.40至0.87之间,可编辑性在0.80至1.00之间。错误分析指出概念域不匹配和语域敏感性是主要的失效模式。我们的结果表明,协议选择是隐喻识别中最大的变异来源,超过了模型层面的变异,并且规则脚本架构在保持完全透明性的同时实现了有竞争力的性能。