Large Language Models (LLMs) trained on massive code corpora are now increasingly capable of generating code that is hard to distinguish from human-written code. This raises practical concerns, including security vulnerabilities and licensing ambiguity, and also motivates a forensic question: 'Who (or which LLM) wrote this piece of code?' We present GoCoMA, a multimodal framework that models an extrinsic hierarchy between (i) code stylometry, capturing higher-level structural and stylistic signatures, and (ii) image representations of binary pre-executable artifacts (BPEA), capturing lower-level, execution-oriented byte semantics shaped by compilation and toolchains. GoCoMA projects modality embeddings into a hyperbolic Poincaré ball, fuses them via a geodesic-cosine similarity-based cross-modal attention (GCSA) fusion mechanism, and back-projects the fused representation to Euclidean space for final LLM-source attribution. Experiments on two open-source benchmarks (CoDET-M4 and LLMAuthorBench) show that GoCoMA consistently outperforms unimodal and Euclidean multimodal baselines under identical evaluation protocols.
翻译:大型语言模型(LLM)经过海量代码语料训练后,所生成的代码越来越难以与人工编写的代码区分。这引发了实际层面的担忧,包括安全漏洞和许可模糊性问题,也催生了一个取证问题:“这段代码是谁(或哪个大语言模型)写的?”我们提出GoCoMA——一个多模态框架,该框架对(i)代码文体学(捕捉高层结构与风格特征)和(ii)二进制预执行工件(BPEA)的图像表示(捕捉由编译和工具链塑造的低层面向执行的字节语义)之间的外在层级进行了建模。GoCoMA将模态嵌入投影到双曲庞加莱球体中,通过基于测地余弦相似度的跨模态注意力(GCSA)融合机制对其进行融合,并将融合后的表示反向投影回欧几里得空间,以实现最终的LLM来源归属。在两个开源基准测试(CoDET-M4和LLMAuthorBench)上的实验表明,在相同的评估协议下,GoCoMA始终优于单模态和欧几里得多模态基线方法。