Predicting the complexity of source code is essential for software development and algorithm analysis. Recently, Baik et al. (2025) introduced CodeComplex for code time complexity prediction. The paper shows that LLMs without fine-tuning struggle with certain complexity classes. This suggests that no single LLM excels at every class, but rather each model shows advantages in certain classes. We propose MEC$^3$O, a multi-expert consensus system, which extends the multi-agent debate frameworks. MEC$^3$O assigns LLMs to complexity classes based on their performance and provides them with class-specialized instructions, turning them into experts. These experts engage in structured debates, and their predictions are integrated through a weighted consensus mechanism. Our expertise assignments to LLMs effectively handle Degeneration-of-Thought, reducing reliance on a separate judge model, and preventing convergence to incorrect majority opinions. Experiments on CodeComplex show that MEC$^3$O outperforms the open-source baselines, achieving at least 10% higher accuracy and macro-F1 scores. It also surpasses GPT-4o-mini in macro-F1 scores on average and demonstrates competitive on-par F1 scores to GPT-4o and GPT-o4-mini on average. This demonstrates the effectiveness of multi-expert debates and weight consensus strategy to generate the final predictions. Our code and data is available at https://github.com/suhanmen/MECO.
翻译:预测源代码的时间复杂度对于软件开发和算法分析至关重要。近期,Baik等人(2025年)提出了用于代码时间复杂度预测的CodeComplex。该论文表明,未经微调的大语言模型在某些复杂度类别上表现不佳。这说明没有单一的大语言模型能在所有类别上都表现优异,而是每个模型在特定类别上展现出优势。我们提出了MEC$^3$O,一个多专家共识系统,它扩展了多智能体辩论框架。MEC$^3$O根据大语言模型的性能将其分配到不同的复杂度类别,并为它们提供针对特定类别的指令,从而将其转变为专家。这些专家参与结构化辩论,并通过加权共识机制整合它们的预测。我们为大语言模型分配专长的方法有效处理了思维退化问题,减少了对独立评判模型的依赖,并防止了向错误多数意见的趋同。在CodeComplex上的实验表明,MEC$^3$O优于开源基线模型,实现了至少10%的准确率和宏F1分数提升。其宏F1分数平均也超过了GPT-4o-mini,并且平均F1分数与GPT-4o和GPT-4o-mini具有竞争力。这证明了多专家辩论与加权共识策略在生成最终预测方面的有效性。我们的代码和数据可在 https://github.com/suhanmen/MECO 获取。