The field of AI research is advancing at an unprecedented pace, enabling automated hypothesis generation and experimental design across diverse domains such as biology, math- ematics, and artificial intelligence. Despite these advancements, there remains a significant gap in the availability of scalable advising systems capable of providing high-quality, well-reasoned feedback to refine proposed hypotheses and experimental designs. To address this challenge, we explore key factors that underlie the development of robust advising systems, including model size, context length, confidence estimation, and structured reasoning processes. Our findings reveal that a relatively small model, when equipped with a well-compressed literature database and a structured reasoning framework, can outperform powerful general-purpose language models such as Deepseek-R1 in terms of acceptance rates for self-ranked top-30% submissions to ICLR 2025. Moreover, when limited to high-confidence predictions, our system achieves an acceptance rate exceeding 90% on the ICLR 2025 test set, underscoring its potential to significantly enhance the quality and efficiency of hypothesis generation and experimental design.
We evaluate GUIDE-7B on 1,000 random ICLR 2025 submissions (acceptance rate = 31.9%) using three key metrics:
Results show that GUIDE-7B achieves the highest Top-30% Precision (51.3%), outperforming larger general-purpose LLMs such as GPT-4o-mini and DeepSeek-R1.
Model | Top-5% Precision | Top-30% Precision | Accept Recall |
---|---|---|---|
GPT-4o-mini | 70.0 ± 4.6% | 47.7 ± 2.4% | 44.8 ± 2.2% |
QwQ-32B | 66.7 ± 1.2% | 48.6 ± 1.5% | 45.8 ± 1.4% |
DeepSeek-R1 | 69.3 ± 4.6% | 50.2 ± 0.5% | 47.2 ± 0.5% |
GUIDE-7B | 72.0 ± 2.0% | 51.3 ± 0.4% | 48.3 ± 0.3% |
Beyond overall precision and recall, GUIDE-7B quantifies its own uncertainty when evaluating research ideas, allowing filtering by confidence level. When only high-confidence predictions are considered, the system achieves over 90% precision in acceptance prediction. This indicates that GUIDE is not only competitive in aggregate ranking metrics, but can also provide trustworthy shortlists of promising papers by exposing confidence estimates alongside rubric-guided evaluations.
Baseline / System | Accuracy | F1 |
---|---|---|
Human (NeurIPS) | 73.4% | 48.4% |
AI Scientist with DeepSeek-R1 | 40.7% | 49.5% |
AI Scientist with QwQ-32B | 42.7% | 43.3% |
AI Scientist with GPT-4.1-nano | 61.2% | 20.8% |
GUIDE-7B | 69.1% | 50.1% |
Takeaway: GUIDE-7B achieves human-comparable performance while outperforming all AI Scientist baselines. Despite differences in dataset and acceptance rate, its accuracy and F1 score approach those of human reviewers, demonstrating the reliability of hypothesis-centric advising over full-text review systems.
We conduct two ablations to identify which components most strongly drive GUIDE’s advising quality. First, Modular Summarization tests how different retrieved contents affect precision under a limited context window. Second, Rubric-Guided Prompting isolates the effect of emphasizing specific review rubrics in the system prompt.
Replacing long full-text retrieval with concise, sectioned summaries lets the system pack more relevant literature into the same context window. Using summarized Abstract and Method is most beneficial for performance, while adding Experiment Setups can be inconsistent due to cross-paper mismatch of settings.
Retrieved Content | GPT-4o-mini | QwQ-32B | DeepSeek-R1 |
---|---|---|---|
Full paper (baseline) | 45.0% | 44.3% | 46.7% |
Abstract only | 47.0% (↑2.0%) | 45.7% (↑1.4%) | 49.0% (↑2.3%) |
+ Contribution | 46.7% (↑1.7%) | 46.0% (↑1.7%) | 48.7% (↑2.0%) |
+ Method | 47.7% (↑2.7%) | 48.7% (↑4.4%) | 49.7% (↑3.0%) |
+ Experiment | 47.7% (↑2.7%) | 48.3% (↑4.0%) | 50.3% (↑3.6%) |
Takeaway: Summarized retrieval improves precision by increasing relevant coverage under context limits; Abstract and Method are key components.
We compare prompts that emphasize different rubrics. Focusing on Novelty and Significance consistently boosts Top-30% Precision, whereas emphasizing only Soundness may hurt. Using all rubrics yields the best overall performance.
Prompt Type | GPT-4o-mini | Gemini-flash-2.0 | DeepSeek-R1 |
---|---|---|---|
No rubrics (baseline) | 45.3% | 47.0% | 48.0% |
+ Soundness only | 44.7% (↓0.6%) | 43.3% (↓3.7%) | 47.3% (↓0.7%) |
+ Novelty only | 47.3% (↑2.0%) | 48.3% (↑1.3%) | 49.3% (↑1.3%) |
+ Significance only | 47.7% (↑2.4%) | 48.3% (↑1.3%) | 49.3% (↑1.3%) |
+ All rubrics | 47.7% (↑2.4%) | 49.7% (↑2.7%) | 50.3% (↑2.3%) |
Takeaway: Novelty & Significance drive the most gain; Soundness-only can be detrimental; “All rubrics” performs best.
Our pipeline evaluates research ideas in four stages:
Using modular summaries allows more literature to fit into the same context window, enabling richer and more precise advising at scale. With rubric guidance, the feedback is detailed and multi-dimensional, assessing originality, impact, and methodological & experimental rigor.
1. Warm-up SFT: We start from Qwen2.5-7B-Instruct and fine-tune it on 4k high-quality idea–evaluation pairs synthesized with DeepSeek-R1. These rubric-guided examples teach the model to provide structured advising feedback.
2. Reward Modeling: We build a reward function that scores responses by aligning with human ratings and checking textual similarity (e.g., ROUGE). This helps identify candidates that best reflect expert-style reviews.
3. RAFT: We apply Reward-Ranked Fine-Tuning in an iterative generate–select–train loop: the model produces multiple candidates, the reward model selects the highest-scoring ones, and these are fed back into training. This progressively sharpens rubric-grounded, human-like evaluations.
In GUIDE, rubric-based prompts guide the model to provide structured, reliable feedback for early-stage research ideas. The rubrics emphasize three key dimensions—Novelty, Significance, and Soundness. By explicitly structuring evaluations along these axes, GUIDE reduces hallucinations, improves alignment with expert review criteria, and ensures each idea is assessed for originality, impact, and methodological & experimental rigor.
@misc{liu2025guide,
title = {GUIDE: Towards Scalable Advising for Research Ideas},
author = {Yaowenqi Liu and Bingxu Meng and Rui Pan and Yuxing Liu and Jerry Huang and Jiaxuan You and Tong Zhang},
year = {2025},
eprint = {2507.08870},
archivePrefix= {arXiv},
primaryClass = {cs.AI; cs.LG},
url = {https://arxiv.org/abs/2507.08870},
}