The field of AI research is advancing at an unprecedented pace, enabling automated hypothesis generation and experimental design across diverse domains such as biology, mathematics, and artificial intelligence. Despite these advancements, there remains a significant gap in the availability of scalable advising systems capable of providing high-quality, well-reasoned feedback to refine proposed hypotheses and experimental designs. To address this challenge, we explore key factors that underlie the development of robust advising systems, including model size, context length, confidence estimation, and structured reasoning processes. Our findings reveal that a relatively small model, when equipped with a well-compressed literature database and a structured reasoning framework, can outperform powerful general-purpose language models such as Deepseek-R1 in terms of acceptance rates for self-ranked top-30% submissions to ICLR 2025. Moreover, when limited to high-confidence predictions, our system achieves an acceptance rate exceeding 90% on the ICLR 2025 test set, underscoring its potential to significantly enhance the quality and efficiency of hypothesis generation and experimental design. The code is released at https://github.com/HowardLiu0830/GUIDE-Research-Idea-Evaluation.
翻译:人工智能研究领域正以前所未有的速度发展,使得自动化假设生成与实验设计在生物学、数学和人工智能等多个领域成为可能。尽管取得了这些进展,能够提供高质量、推理严谨的反馈以完善假设与实验设计的可扩展建议系统仍存在显著缺口。为应对这一挑战,我们探究了构建稳健建议系统的关键因素,包括模型规模、上下文长度、置信度估计与结构化推理过程。我们的研究结果表明,当配备经过高效压缩的文献数据库和结构化推理框架时,一个相对较小的模型在国际学习表征会议2025年自评前30%投稿的接收率上,能够超越Deepseek-R1等强大的通用语言模型。此外,当仅限高置信度预测时,我们的系统在国际学习表征会议2025测试集上的接收率超过90%,彰显了其显著提升假设生成与实验设计质量与效率的潜力。代码发布于https://github.com/HowardLiu0830/GUIDE-Research-Idea-Evaluation。