Overview
A multi-agent peer-reviewed reasoning method has been developed to enhance the accuracy, interpretability, and robustness of large language models (LLMs) in medical question answering (MedQA). This method enables multiple LLM agents to engage in a structured process of independent reasoning, answer generation, and peer evaluation. The approach was tested against established benchmarks using state-of-the-art LLMs, demonstrating consistent improvements over single-model and majority-voting baselines.
Research Context
The objective of this research was to address challenges in enhancing LLM performance within the domain of medical question answering. The work specifically focused on improving three key aspects: accuracy in providing correct answers, interpretability of the reasoning process, and robustness of the models' performance. This enhancement was sought through a novel methodological design.
Approach
The research employed a multi-agent peer-reviewed reasoning method. This method involves several steps:
- Multiple LLM agents independently generate chain-of-thought reasoning.
- Each agent also produces a candidate answer based on its reasoning.
- These agents then act as peer reviewers, evaluating each other's reasoning chains.
- The evaluation criteria for peer review include factual correctness and logical soundness.
- The reasoning chain receiving the highest rating is subsequently selected to inform the final answer.
Experiments were conducted using five state-of-the-art LLMs: Llama-3.1-8B, Qwen2.5-7B, Phi-4, DeepSeek-LLM-7B, and GPT-oss-20B. These models were evaluated on three benchmark datasets: HeadQA, MedQA-USMLE, and PubMedQA. Performance was comparatively assessed against two baseline methodologies: single-model chain-of-thought reasoning and chain-of-thought-based majority voting ensembles.
Findings
- The proposed peer-reviewed reasoning method consistently outperformed both the single-model chain-of-thought reasoning and the chain-of-thought-based majority voting baselines.
- The highest-performing model combination, utilizing the peer-reviewed reasoning method, achieved an average accuracy of 0.820 across the tested datasets.
- This performance surpassed that of the strongest single model, which recorded an accuracy of 0.777.
- It also exceeded the accuracy achieved by majority voting ensembles, which reached up to 0.789.
- The method demonstrated effective scalability in performance as more models participated in the peer-review process.
- Peer assessments within the multi-agent system reliably differentiated between high-quality and low-quality reasoning chains generated by the LLM agents.
- The approach emphasizes reasoning quality over sole answer agreement as a mechanism for performance improvement.
Why This Matters
The multi-agent peer-reviewed reasoning method offers a promising direction for developing trustworthy biomedical AI systems. By improving accuracy, interpretability, and robustness in medical question answering, this approach advances the capabilities of LLMs for critical applications. This methodology enables LLMs to function as both problem solvers and evaluators, contributing to enhanced AI reliability in healthcare contexts.