GUIDE

Towards Scalable Advising for Research Ideas

Yaowenqi Liu^* Bingxu Meng^* Rui Pan^* Yuxing Liu Jerry Huang
Jiaxuan You Tong Zhang

¹ University of Illinois Urbana–Champaign

^* Equal Contribution

Abstract

The field of AI research is advancing at an unprecedented pace, enabling automated hypothesis generation and experimental design across diverse domains such as biology, math- ematics, and artificial intelligence. Despite these advancements, there remains a significant gap in the availability of scalable advising systems capable of providing high-quality, well-reasoned feedback to refine proposed hypotheses and experimental designs. To address this challenge, we explore key factors that underlie the development of robust advising systems, including model size, context length, confidence estimation, and structured reasoning processes. Our findings reveal that a relatively small model, when equipped with a well-compressed literature database and a structured reasoning framework, can outperform powerful general-purpose language models such as Deepseek-R1 in terms of acceptance rates for self-ranked top-30% submissions to ICLR 2025. Moreover, when limited to high-confidence predictions, our system achieves an acceptance rate exceeding 90% on the ICLR 2025 test set, underscoring its potential to significantly enhance the quality and efficiency of hypothesis generation and experimental design.

Key Achievements

7B > 671B

GUIDE-7B achieves higher predict precision and recall on ICLR 2025 submissions than DeepSeek-R1 (671B).

Hover / tap to see details

Main Results (ICLR-2025 test set)

Top-5% Precision: 72.0% (GUIDE-7B) vs 69.3% (DeepSeek-R1)
Top-30% Precision : 51.3% (GUIDE-7B) vs 50.2% (DeepSeek-R1)
Accept Recall: 48.3% (GUIDE-7B) vs 47.2% (DeepSeek-R1)

View Results

>90% Precision at High Confidence

Papers predicted to be accepted with top 10% confidence have >90% probability of being actually accepted in ICLR 2025

Hover / tap to see details

Uncertainty-Aware Evaluation

Rank by low entropy (high certainty) to surface the most reliable evaluations.
Within the highest-confidence slice, acceptance-prediction precision exceeds 90%.
Ideal for automated paper triage and monthly recommendations.

See Confidence Curve

Experimental Results

Evaluation Setup

We evaluate GUIDE-7B on 1,000 random ICLR 2025 submissions (acceptance rate = 31.9%) using three key metrics:

Top-5% Precision: Among the top 5% highest-ranked papers by the model, the proportion that were actually accepted — measuring the ability to identify spotlight-level submissions.
Top-30% Precision: Among the top 30% highest-ranked papers, the proportion that were accepted — aligning with ICLR’s real acceptance threshold and testing if the system can predict acceptable papers.
Accept Recall: Among all papers actually accepted by ICLR 2025, the proportion that appear in the model’s top 30% predictions — measuring how many good papers the system does not miss.

Results show that GUIDE-7B achieves the highest Top-30% Precision (51.3%), outperforming larger general-purpose LLMs such as GPT-4o-mini and DeepSeek-R1.

Main Results

Table: Mean ± std over 3 trials on ICLR 2025 test set

Model	Top-5% Precision	Top-30% Precision	Accept Recall
GPT-4o-mini	70.0 ± 4.6%	47.7 ± 2.4%	44.8 ± 2.2%
QwQ-32B	66.7 ± 1.2%	48.6 ± 1.5%	45.8 ± 1.4%
DeepSeek-R1	69.3 ± 4.6%	50.2 ± 0.5%	47.2 ± 0.5%
GUIDE-7B	72.0 ± 2.0%	51.3 ± 0.4%	48.3 ± 0.3%

>90% Precision at High Confidence

Beyond overall precision and recall, GUIDE-7B quantifies its own uncertainty when evaluating research ideas, allowing filtering by confidence level. When only high-confidence predictions are considered, the system achieves over 90% precision in acceptance prediction. This indicates that GUIDE is not only competitive in aggregate ranking metrics, but can also provide trustworthy shortlists of promising papers by exposing confidence estimates alongside rubric-guided evaluations.

Close the Gap with Human Reviewers

Baseline / System	Accuracy	F1
Human (NeurIPS)	73.4%	48.4%
AI Scientist with DeepSeek-R1	40.7%	49.5%
AI Scientist with QwQ-32B	42.7%	43.3%
AI Scientist with GPT-4.1-nano	61.2%	20.8%
GUIDE-7B	69.1%	50.1%

Takeaway: GUIDE-7B achieves human-comparable performance while outperforming all AI Scientist baselines. Despite differences in dataset and acceptance rate, its accuracy and F1 score approach those of human reviewers, demonstrating the reliability of hypothesis-centric advising over full-text review systems.

Ablation Experiments

We conduct two ablations to identify which components most strongly drive GUIDE’s advising quality. First, Modular Summarization tests how different retrieved contents affect precision under a limited context window. Second, Rubric-Guided Prompting isolates the effect of emphasizing specific review rubrics in the system prompt.

Ablation 1 — Scalable Advising with Modular Summarization

Replacing long full-text retrieval with concise, sectioned summaries lets the system pack more relevant literature into the same context window. Using summarized Abstract and Method is most beneficial for performance, while adding Experiment Setups can be inconsistent due to cross-paper mismatch of settings.

Retrieved Content	GPT-4o-mini	QwQ-32B	DeepSeek-R1
Full paper (baseline)	45.0%	44.3%	46.7%
Abstract only	47.0% (↑2.0%)	45.7% (↑1.4%)	49.0% (↑2.3%)
+ Contribution	46.7% (↑1.7%)	46.0% (↑1.7%)	48.7% (↑2.0%)
+ Method	47.7% (↑2.7%)	48.7% (↑4.4%)	49.7% (↑3.0%)
+ Experiment	47.7% (↑2.7%)	48.3% (↑4.0%)	50.3% (↑3.6%)

Takeaway: Summarized retrieval improves precision by increasing relevant coverage under context limits; Abstract and Method are key components.

Ablation 2 — Rubric-Guided Prompting

We compare prompts that emphasize different rubrics. Focusing on Novelty and Significance consistently boosts Top-30% Precision, whereas emphasizing only Soundness may hurt. Using all rubrics yields the best overall performance.

Prompt Type	GPT-4o-mini	Gemini-flash-2.0	DeepSeek-R1
No rubrics (baseline)	45.3%	47.0%	48.0%
+ Soundness only	44.7% (↓0.6%)	43.3% (↓3.7%)	47.3% (↓0.7%)
+ Novelty only	47.3% (↑2.0%)	48.3% (↑1.3%)	49.3% (↑1.3%)
+ Significance only	47.7% (↑2.4%)	48.3% (↑1.3%)	49.3% (↑1.3%)
+ All rubrics	47.7% (↑2.4%)	49.7% (↑2.7%)	50.3% (↑2.3%)

Takeaway: Novelty & Significance drive the most gain; Soundness-only can be detrimental; “All rubrics” performs best.

Methodology

Inference Pipeline

Our pipeline evaluates research ideas in four stages:

Step 1: Take the paper’s abstract, contributions, method, and experiment setup.
Step 2: Retrieve section-level summaries of similar works from ICLR papers.
Step 3: Generate structured feedback guided by rubrics on novelty, significance, and soundness.
Step 4: Produce a final 1–10 rating via a lightweight classifier.

Using modular summaries allows more literature to fit into the same context window, enabling richer and more precise advising at scale. With rubric guidance, the feedback is detailed and multi-dimensional, assessing originality, impact, and methodological & experimental rigor.

Training Pipeline

1. Warm-up SFT: We start from Qwen2.5-7B-Instruct and fine-tune it on 4k high-quality idea–evaluation pairs synthesized with DeepSeek-R1. These rubric-guided examples teach the model to provide structured advising feedback.

2. Reward Modeling: We build a reward function that scores responses by aligning with human ratings and checking textual similarity (e.g., ROUGE). This helps identify candidates that best reflect expert-style reviews.

3. RAFT: We apply Reward-Ranked Fine-Tuning in an iterative generate–select–train loop: the model produces multiple candidates, the reward model selects the highest-scoring ones, and these are fed back into training. This progressively sharpens rubric-grounded, human-like evaluations.

Case Study

Rubric-Guided Evaluation

In GUIDE, rubric-based prompts guide the model to provide structured, reliable feedback for early-stage research ideas. The rubrics emphasize three key dimensions—Novelty, Significance, and Soundness. By explicitly structuring evaluations along these axes, GUIDE reduces hallucinations, improves alignment with expert review criteria, and ensures each idea is assessed for originality, impact, and methodological & experimental rigor.

Rubrics on Novelty: guiding the model to assess originality of research ideas.

Rubrics on Significance: highlighting the importance and impact of contributions.

Rubrics on Soundness: ensuring methodological rigor and experimental validity.

BibTeX

@misc{liu2025guide,
    title        = {GUIDE: Towards Scalable Advising for Research Ideas},
    author       = {Yaowenqi Liu and Bingxu Meng and Rui Pan and Yuxing Liu and Jerry Huang and Jiaxuan You and Tong Zhang},
    year         = {2025},
    eprint       = {2507.08870},
    archivePrefix= {arXiv},
    primaryClass = {cs.AI; cs.LG},
    url          = {https://arxiv.org/abs/2507.08870},
  }