Multi-Agent Peer Review Enhances LLM Performance in Medical Question Answering

arXiv CS · June 16, 2026 · 2 min read · Engineering & Technology

Read research and analysis on Multi-Agent Peer Review Enhances LLM Performance in Medical Question Answering published by ICANEWS, a global research journal for emerging researchers.

Key Takeaways

Multi-agent peer-reviewed reasoning consistently outperformed single-model and majority voting baselines in MedQA.
The best model combination achieved an average accuracy of 0.820 across HeadQA, MedQA-USMLE, and PubMedQA datasets.
The method demonstrated effective scalability with an increased number of participating models.
Peer assessments reliably distinguished between high and low-quality reasoning chains.

Why This Matters

This method improves LLM accuracy, interpretability, and robustness in medical question answering, offering a promising direction for trustworthy biomedical AI. It enables LLMs to function as both solvers and evaluators, enhancing the reliability of AI systems in healthcare.

Overview

A multi-agent peer-reviewed reasoning method has been developed to enhance the accuracy, interpretability, and robustness of large language models (LLMs) in medical question answering (MedQA). This method enables multiple LLM agents to engage in a structured process of independent reasoning, answer generation, and peer evaluation. The approach was tested against established benchmarks using state-of-the-art LLMs, demonstrating consistent improvements over single-model and majority-voting baselines.

Research Context

The objective of this research was to address challenges in enhancing LLM performance within the domain of medical question answering. The work specifically focused on improving three key aspects: accuracy in providing correct answers, interpretability of the reasoning process, and robustness of the models' performance. This enhancement was sought through a novel methodological design.

Approach

The research employed a multi-agent peer-reviewed reasoning method. This method involves several steps:

Multiple LLM agents independently generate chain-of-thought reasoning.
Each agent also produces a candidate answer based on its reasoning.
These agents then act as peer reviewers, evaluating each other's reasoning chains.
The evaluation criteria for peer review include factual correctness and logical soundness.
The reasoning chain receiving the highest rating is subsequently selected to inform the final answer.

Experiments were conducted using five state-of-the-art LLMs: Llama-3.1-8B, Qwen2.5-7B, Phi-4, DeepSeek-LLM-7B, and GPT-oss-20B. These models were evaluated on three benchmark datasets: HeadQA, MedQA-USMLE, and PubMedQA. Performance was comparatively assessed against two baseline methodologies: single-model chain-of-thought reasoning and chain-of-thought-based majority voting ensembles.

Findings

The proposed peer-reviewed reasoning method consistently outperformed both the single-model chain-of-thought reasoning and the chain-of-thought-based majority voting baselines.
The highest-performing model combination, utilizing the peer-reviewed reasoning method, achieved an average accuracy of 0.820 across the tested datasets.
This performance surpassed that of the strongest single model, which recorded an accuracy of 0.777.
It also exceeded the accuracy achieved by majority voting ensembles, which reached up to 0.789.
The method demonstrated effective scalability in performance as more models participated in the peer-review process.
Peer assessments within the multi-agent system reliably differentiated between high-quality and low-quality reasoning chains generated by the LLM agents.
The approach emphasizes reasoning quality over sole answer agreement as a mechanism for performance improvement.

Why This Matters

The multi-agent peer-reviewed reasoning method offers a promising direction for developing trustworthy biomedical AI systems. By improving accuracy, interpretability, and robustness in medical question answering, this approach advances the capabilities of LLMs for critical applications. This methodology enables LLMs to function as both problem solvers and evaluators, contributing to enhanced AI reliability in healthcare contexts.

Research Information

Institution: arXiv CS
Original Study: View Publication
Source: arXiv CS

About ICANEWS

ICANEWS is a global research journal for emerging researchers, publishing student and emerging researcher work across all fields.