FGDM Framework Enhances Software Bug Detection and Repair Using Multi-Agent LLMs

arXiv CS · April 29, 2026 · 11 min read · Engineering & Technology

Read research and analysis on FGDM Framework Enhances Software Bug Detection and Repair Using Multi-Agent LLMs published by ICANEWS, a global research journal for emerging researchers.

Key Takeaways

FGDM outperforms extant approaches in software bug detection and repair.
Demonstrated a mean reduction of 24.33 in Levenshtein distance for Python programs.
Demonstrated a mean reduction of 8.37 in Levenshtein distance for C programs.
Achieved a cosine similarity of 0.951 for Python programs.
Achieved a cosine similarity of 0.974 for C programs.

Why This Matters

The FGDM framework's ability to precisely detect and repair software bugs in complex codebases could significantly enhance software development efficiency and reliability. By automating crucial steps in the debugging process, it promises reductions in development cycles and maintenance costs, leading to higher quality software products.

Novel Multi-Agent Framework Advances Software Bug Detection with LLMs

A new research development, detailed in the paper FGDM: Reasoning Aware Multi-Agentic Framework for Software Bug Detection using Chain of Thought and Tree of Thought Prompting (arXiv:2604.24831v1), introduces an advanced system for identifying and resolving software bugs. The Flow-Graph-Driven Multi-Agent Framework (FGDM) leverages the capabilities of Large Language Models (LLMs) to address existing limitations in automated bug detection, particularly in complex codebases.

The core challenge in automated software bug detection, as highlighted by the researchers, lies in the inability of traditional Deep Learning methods to grasp the 'global understanding' of code. This deficiency often leads to performance degradation, especially when these methods are applied to intricate, interconnected codebases or modular programs. The FGDM framework was developed to overcome these barriers by capitalizing on LLMs' effectiveness in capturing dependencies across multiple interconnected modules within code.

Research Goal: Developing a Reasoning-Aware Framework for Bug Detection and Repair

The primary objective of this research was to propose and demonstrate a reasoning-aware multi-agentic framework for automated software bug detection and subsequent repair. The researchers aimed to harness the recognized strengths of Large Language Models (LLMs) in understanding intricate code structures and interdependencies. Deep Learning methods, while prominent in automated software bug detection, often fall short due to their lack of global understanding of a given code. This limitation becomes particularly pronounced in environments characterized by large, interconnected code bases or complex modular programs, where their performance is prone to degradation. The research specifically set out to create a framework that could not only identify bugs but also generate reparative code, thereby offering a comprehensive solution.

To achieve this, the proposed framework, named the Flow-Graph-Driven Multi-Agent Framework (FGDM), was designed with a specific operational sequence. It is composed of four distinct agents that function in a sequential manner. The initial step involves converting the received code into a flow graph, a structural representation that aids in understanding code execution paths. Following this, the framework is tasked with identifying erroneous segments within this flow graph. The ultimate objective is the generation of repaired code, thereby automating a significant portion of the bug-fixing process. A critical aspect of the framework's design is the utilization of Chain-of-Thought (COT) and Tree-of-Thoughts (TOT) prompts by all employed agents, which are integral to their reasoning capabilities. Additionally, the system was augmented with integration to a FAISS vector database, providing a mechanism to retrieve similar previous bugs and their associated repairs, further enhancing its repair capabilities.

Key Findings: Superior Performance in Bug Detection and Repair

The experiments conducted with the FGDM framework have demonstrated its efficacy and superior performance when compared to existing approaches in the field of automated software bug detection and repair. The framework was rigorously tested across a diverse set of 100 programs sourced from various prominent projects. These projects span several well-known libraries and frameworks, including Ansible, Black, FastAPI, Keras, Luigi, Matplotlib, Pandas, Scrapy, SpaCy, and Tornado. Crucially, the evaluation encompassed programs written in two widely used programming languages: C and Python.

Significant Reduction in Levenshtein Distance

One of the primary metrics used to assess the effectiveness of the FGDM framework was the Levenshtein distance, which quantifies the number of single-character edits (insertions, deletions, or substitutions) required to change one word into the other. In the context of code repair, a lower Levenshtein distance indicates a closer match between the generated repaired code and the ideally correct code, implying more accurate and efficient bug fixes. The experiments revealed significant reductions in this metric across both programming languages tested.

The FGDM demonstrated efficacy over 100 programs from several projects, including Ansible, Black, FastAPI, Keras, Luigi, Matplotlib, Pandas, Scrapy, SpaCy, and Tornado in both C and Python programs. Our experiments demonstrate that the FGDM outperforms the extant approaches and yielded reductions with a mean of 24.33 and 8.37 in Levenshtein distance for Python and C, respectively.

Specifically, for Python programs, the framework achieved a mean reduction of $24.33$ in Levenshtein distance. This substantial reduction suggests that the FGDM framework is capable of generating repaired code that is considerably closer to the correct version, requiring $24.33$ fewer edits on average compared to prior methods. For C programs, the framework yielded a mean reduction of $8.37$ in Levenshtein distance, indicating a similar, albeit numerically smaller, improvement in repair quality for this language. These figures highlight the framework's ability to produce more accurate and less error-prone code repairs compared to conventional techniques.
High Cosine Similarity for Repaired Code

Another crucial metric employed to evaluate the quality of the repaired code generated by the FGDM framework was cosine similarity. Cosine similarity measures the cosine of the angle between two non-zero vectors in an inner product space. In the context of code analysis, it is often used to determine how similar two pieces of text or code are in terms of their content or structure. A higher cosine similarity score indicates a greater degree of similarity between the generated repaired code and the ground truth (correct) code.

The research findings showed very high cosine similarity scores for the repaired code produced by FGDM. For Python programs, the framework achieved a cosine similarity of $0.951$. This exceptionally high value suggests that the generated Python code after repair is remarkably similar to the desired correct code, indicating that the framework effectively captures the nuances and logical structure required for accurate bug fixes. For C programs, the performance was even higher, with a cosine similarity of $0.974$. This result signifies an almost perfect alignment between the repaired C code and the correct C code, underscoring the framework's precision and effectiveness in handling the intricacies of C programming.

The combination of significant reductions in Levenshtein distance and very high cosine similarity scores unequivocally demonstrates that the FGDM framework not only outperforms existing approaches but also delivers repairs that are highly accurate and aligned with the intended correct code across both Python and C programming languages.

Methodology: A Sequential Multi-Agent Framework

The Flow-Graph-Driven Multi-Agent Framework (FGDM) is architecturally designed as a sequential system, incorporating four distinct agents that operate in a predefined order to achieve software bug detection and repair. This multi-agent structure is fundamental to the framework's ability to handle complex code and identify intricate dependencies.

Code Conversion to Flow Graph

The initial step in the FGDM framework involves the conversion of the received software code into a flow graph. This is a crucial preprocessing stage, as flow graphs provide a structured, visual representation of the control flow within a program. By transforming linear code into a graphical representation, the framework gains a more comprehensive and 'global understanding' of the program's execution paths, branching points, and interactive components. This representation is vital for the subsequent agents to effectively analyze the code's structure and identify potential issues that might be overlooked by linear analysis techniques.

Sequential Agent Operation and Prompting Strategies

Following the flow graph generation, the four agents within the FGDM framework begin their sequential operation. Each agent is tasked with a specific role in the bug detection and repair pipeline. A key distinguishing feature of this framework is the consistent use of advanced prompting strategies for all agents:

Chain-of-Thought (COT) Prompts: These prompts encourage LLMs to articulate their reasoning process step-by-step. By guiding the agents to generate intermediate reasoning steps, COT prompts allow the framework to break down complex problems into manageable sub-problems, leading to more robust and explainable bug detection. This mimics human problem-solving by providing a clear trace of decisions and derivations.
Tree-of-Thoughts (TOT) Prompts: TOT prompts extend the idea of COT by enabling the LLMs to explore multiple reasoning paths and self-correct. Instead of a single linear chain, TOT allows the agents to branch out, consider different hypotheses, and evaluate their consequences before committing to a specific solution. This multi-path exploration capability is particularly beneficial for complex bug scenarios where initial assumptions might be misleading, ensuring a more thorough and accurate analysis.

The combination of COT and TOT prompts enhances the reasoning abilities of the agents, allowing them to not only identify erroneous segments with higher precision but also to generate more coherent and functionally correct repaired code.

Identification of Erroneous Segments

Once the code has been converted to a flow graph and processed by the reasoning-aware agents, the framework's subsequent task is to explicitly identify the erroneous segments within the code. This involves pinpointing the exact locations or blocks of code that contain bugs. The global understanding derived from the flow graph and the sophisticated reasoning facilitated by COT/TOT prompts enable the agents to pinpoint these issues with greater accuracy, especially in interconnected or modular programs where bugs might manifest due to interactions between different code parts.

Generation of Repaired Code

The final stage in the FGDM framework’s sequential operation is the generation of repaired code. After successfully identifying the erroneous segments, the agents are designed to formulate and propose corrected versions of these segments. This step transforms the framework from a mere bug detector into a complete bug resolution system. The quality of this repaired code is directly influenced by the preceding steps: the accuracy of the flow graph conversion, the depth of reasoning enabled by COT and TOT prompts, and the precision in identifying the bug locations. The demonstrated reductions in Levenshtein distance and high cosine similarity scores confirm the effectiveness of this repair generation capability.

Integration with FAISS Vector Database

In a further enhancement to its capabilities, the FGDM framework incorporates an integration with a FAISS vector database. FAISS (Facebook AI Similarity Search) is a library for efficient similarity search and clustering of dense vectors. This integration serves a crucial role in improving the framework's bug repair process. By leveraging the FAISS database, the system can efficiently retrieve similar previous bugs and their corresponding repairs. This mechanism provides a valuable source of historical knowledge, allowing the agents to learn from past bug-fixing experiences. When a new bug is detected, the framework can query the database for similar known issues, potentially suggesting proven repair strategies or patterns, thereby augmenting the agents' generative repair capabilities and leading to more effective and reliable fixes.

Implications for Software Development and Future Research

The findings from the FGDM framework present significant implications for the field of software development, particularly in the realm of automated quality assurance and maintenance. By demonstrating superior performance in identifying and repairing bugs in both Python and C programs, the framework offers a potential pathway to more efficient and reliable software engineering processes.

The ability of the FGDM to achieve substantial reductions in Levenshtein distance and very high cosine similarity scores suggests a future where a significant portion of bug detection and initial repair could be automated faster than current manual processes or less sophisticated automated tools. This could translate into reduced development cycles, lower maintenance costs, and a higher quality of deployed software. For developers, this framework could act as an intelligent assistant, offering precise bug locations and suggesting highly accurate fixes, thereby freeing up valuable human capital for more complex architectural design and innovation rather than tedious debugging tasks.

Furthermore, the reliance on LLMs with Chain-of-Thought (COT) and Tree-of-Thoughts (TOT) prompts indicates a growing trend towards more explainable and robust AI applications in critical domains like software engineering. The sequential multi-agent approach, combined with the integration of a knowledge base like FAISS, sets a precedent for how complex reasoning tasks can be compartmentalized and executed by specialized AI components, offering a modular and scalable solution for future advancements in automated code analysis.

What's Next: Expanding and Refining Automated Code Repair

While the present research successfully demonstrates the efficacy of the FGDM framework, the nature of scientific inquiry suggests continuous opportunities for expansion and refinement. The current methodology, tested across 100 programs from various projects in C and Python, has laid a strong foundation. Future work could potentially explore the framework's applicability to a broader spectrum of programming languages and paradigms, assessing its performance against a wider array of software complexities and domains.

Further investigation might focus on enhancing the granularity and interpretability of the bug detection and repair processes. While COT and TOT prompts contribute to reasoning awareness, delving deeper into how these prompts can be optimized for even more nuanced bug types or context-specific errors could be a productive avenue. The integration with the FAISS vector database, which retrieves similar previous bugs, could also be expanded. Researchers might explore advanced retrieval mechanisms or hybrid approaches that combine historical data with real-time analysis to offer even more adaptive repair solutions. Additionally, exploring the framework’s performance in real-time integration into continuous integration/continuous deployment (CI/CD) pipelines, where automated bug detection and repair could offer immediate feedback, represents a practical and impactful direction for future research.

Research Information

Institution: arXiv CS
Original Study: View Publication
Source: arXiv CS

About ICANEWS

ICANEWS is a global research journal for emerging researchers, publishing student and emerging researcher work across all fields.