Overview
This research investigates the degradation of routing accuracy in production Large Language Model (LLM) assistants as the catalog of specialized tools and agents expands. The study focuses on single-step routing within a deployed enterprise productivity assistant, analyzing performance across various scales and identifying contributing factors to accuracy decline. It also evaluates a recovery mechanism involving embedding-based shortlisting.
Research Context
Production LLM assistants are increasingly employed to route user requests to specialized tools. As the libraries of these tools grow, a key concern is how routing accuracy changes with increasing complexity and scale. The study specifically addresses how routing accuracy degrades when handling under-specified requests directed at growing libraries of specialized tools.
Approach
The study utilized a 110-agent, 584-tool catalog derived from a deployed enterprise productivity assistant. The evaluation focused on single-step routing. Three frontier LLM models were assessed at scales ranging from 10 to 110 agents. Routing F1 score was used as the primary metric for performance measurement, specifically for under-specified requests.
An oracle analysis was conducted to decompose observed routing degradation into two components:
- Retrieval gap: Attributed to the model's inability to surface the correct tool.
- Confusion gap: Represents the drop in the oracle ceiling even with perfect retrieval, suggesting inherent difficulty in distinguishing between tools.
To evaluate a recovery mechanism, embedding-based shortlisting was implemented. Its effectiveness was assessed across all three models and two providers. The recovery mechanism's performance was further validated through a production annotation study. This study involved 1,435 human-labeled utterances and engaged three annotators to evaluate performance on real traffic data.
Findings
- Routing F1 scores on under-specified requests decreased by 16 to 23 percentage points across the three evaluated models when scaling from 10 to 110 agents.
- The oracle analysis indicated a retrieval gap as a contributor to degradation.
- The analysis also identified a confusion gap, where the oracle ceiling dropped by 10 percentage points, even if perfect retrieval were achieved.
- Embedding-based shortlisting recovered +10 to +11 percentage points in F1 at full scale across all three models and two providers in the initial evaluations.
- The production annotation study, using 1,435 human-labeled utterances from three annotators, confirmed recovery on real traffic. This study showed a recovery of +10 to +17 percentage points in F1, despite an observed 10 to 15 percentage point lower absolute performance compared to the initial evaluations.
Why This Matters
The findings quantify the performance degradation of LLM agent routing systems in enterprise settings as the number of agents and tools increases. Identifying both retrieval and confusion as contributing factors provides specific areas for research and development. The demonstrated effectiveness of embedding-based shortlisting as a recovery mechanism offers a practical approach to mitigating accuracy losses in production environments.