Odysseys Benchmark Evaluates Web Agents on Realistic Long-Horizon, Multi-Site Web Tasks

arXiv CS · · 7 min read · Engineering & Technology

Read research and analysis on Odysseys Benchmark Evaluates Web Agents on Realistic Long-Horizon, Multi-Site Web Tasks published by ICANEWS, a global research journal for emerging researchers.

Key Takeaways

  • Existing web agent benchmarks largely focus on short, single-site tasks where frontier models are approaching saturation.
  • Real-world web use involves long-horizon, multi-site workflows requiring sustained context and cross-site reasoning over potentially hours of browsing.
  • Odysseys, a benchmark of 200 long-horizon web tasks derived from real-world browsing sessions and evaluated on the live Internet, has been introduced.
  • Binary pass/fail evaluation is inadequate for long-horizon settings; a rubric-based evaluation, with an average of 6.1 graded rubrics per task, yields higher agreement with humans and provides a more fine-grained signal.
  • Leading frontier models achieve a success rate of 44.5% on Odysseys, indicating substantial room for improvement.
  • Efficiency is a first-class concern for long-horizon agents, and the Trajectory Efficiency metric (rubric score per step) reveals even frontier agents achieve only 1.15%, highlighting a need for efficient agents.

Why This Matters

This research provides a more realistic benchmark for evaluating web agents, revealing current limitations in handling complex, long-duration web tasks and highlighting the critical need for improvements in both success rates and operational efficiency for practical applications.

Introduction to Odysseys: A New Benchmark for Web Agent Performance

In the evolving landscape of artificial intelligence, web agents are designed to autonomously navigate and interact with the internet. While significant progress has been made, the assessment of these agents has often relied on benchmarks that may not fully reflect the complexities of real-world web usage. A recently introduced benchmark, named Odysseys, aims to address this by focusing on 'long-horizon' and 'multi-site' tasks, providing a more realistic evaluation framework for web agents.

The research, titled "Odysseys: Benchmarking Web Agents on Realistic Long Horizon Tasks" and published on arXiv, highlights a critical gap in current evaluation methodologies. Existing web agent benchmarks have 'largely converged on short, single-site tasks' where leading models are reportedly 'approaching saturation'. This suggests that while models perform well on simpler, confined tasks, their capabilities in navigating the more intricate and prolonged interactions characteristic of real-world browsing remain underexplored and, potentially, limited.

Odysseys seeks to bridge this gap by presenting a new set of evaluation criteria derived from actual human browsing sessions. The benchmark integrates elements that demand 'sustained context and cross-site reasoning over potentially hours of browsing', thereby offering a robust platform to measure the proficiency of web agents in scenarios that mirror human activity.

The Research Goal: Measuring Long-Horizon Web Agent Proficiency

The primary research goal behind Odysseys is to 'capture and evaluate such behaviors' that characterize 'real world web use', specifically 'long-horizon, multi-site workflows'. The study emphasizes that common web navigation tasks — including activities like 'comparing products across different domains', 'planning trips across multiple services', or 'summarizing information from multiple search queries' — necessitate complex cognitive functions from an agent. These functions involve maintaining context across various web pages and sites, integrating information from disparate sources, and executing a sequence of steps over an extended period. The existing benchmarks were not adequately designed to assess these critical capabilities, hence the introduction of Odysseys.

By focusing on these more complex and prolonged interactions, Odysseys aims to provide a more accurate and challenging test environment for the next generation of web agents. The benchmark's design is intended to 'isolate the critical evaluation of long-horizon proficiency in open-web environments', thereby providing 'a realistic benchmark to measure progress towards computer-use agents that can potentially productively operate for hours'. This objective underscores a move towards developing agents that are not just task-complete, but also context-aware and efficient over prolonged operations.

Key Findings: Performance Gaps and Evaluation Improvements

The study derived several key findings from applying the Odysseys benchmark to leading models. These findings reveal significant limitations in current web agent capabilities and propose an enhanced evaluation methodology. One of the most striking findings is the performance of 'strongest models' on the new benchmark. These models achieved 'a success rate of 44.5%', which the researchers explicitly state 'leaves substantial room for future improvements'. This indicates that despite advancements, even the most capable agents are far from perfectly handling complex, real-world web navigation tasks.

Addressing Evaluation Deficiencies: Beyond Binary Pass/Fail

A significant methodological improvement introduced by Odysseys concerns the evaluation approach itself. The researchers found that 'binary pass/fail evaluation is inadequate for long-horizon settings'. This traditional method, which simply determines if a task was completed successfully or not, fails to capture the nuances of prolonged, multi-step tasks where partial success or efficient execution are important metrics. To overcome this limitation, Odysseys introduces 'a rubric-based evaluation'. Each Odysseys task is 'annotated with an average of 6.1 graded rubrics'. This detailed, multi-faceted evaluation provides a 'more fine-grained signal' compared to 'commonly used trajectory-level LLM-as-a-judge evaluation metrics'. The study also demonstrates that this rubric-based approach 'yields higher agreement with humans', suggesting a more reliable and meaningful assessment of agent performance.

Efficiency as a Critical Metric for Long-Horizon Agents

Beyond task success, the research emphasizes the importance of efficiency. The authors argue that 'efficiency is a first-class concern for long-horizon agents'. An agent might eventually complete a task, but if it takes an excessively long time or consumes too many resources, its practical utility diminishes. To quantify this, the study introduces a new metric: 'Trajectory Efficiency'. This metric is defined as 'rubric score per step'. By measuring how much progress (as indicated by the rubric score) an agent makes for each step it takes, the researchers can assess its operational efficiency. The findings for this metric were particularly illuminating: 'even frontier agents achieve only 1.15%' in Trajectory Efficiency. This low figure 'marking an evident need for agents that can succeed efficiently and not simply eventually'. This highlights that current agents, while capable of achieving some success, often do so inefficiently, a critical bottleneck for real-world application.

Methodology: Constructing a Realistic Benchmark

The methodology employed for Odysseys involved the creation of '200 long-horizon web tasks'. A crucial aspect of this methodology is that these tasks were 'derived from real world browsing sessions'. This grounding in actual human behavior ensures the benchmark's relevance and realism. Furthermore, the evaluation of these tasks was conducted 'on the live Internet', an important distinction from simulations or static datasets, which might not capture the dynamic nature and potential variability of the real web. The use of the live internet means that agents must contend with actual web pages, forms, and services, including their inherent complexities and occasional unreliability, mirroring the experiences of human users.

The development of the rubric-based evaluation system was central to the methodology. By annotating each task with an average of 6.1 graded rubrics, the researchers moved beyond a simple binary assessment. These rubrics likely break down complex tasks into smaller, measurable sub-goals, allowing for a more granular understanding of where an agent succeeds or fails. This detailed annotation process, and its impact on human agreement, is a direct result of the methodical approach taken in developing Odysseys.

Implications for Future Web Agent Development

The findings from Odysseys carry significant implications for the future development of web agents. The observed '44.5% success rate' of even 'the strongest models' clearly indicates that current artificial intelligence models still have 'substantial room for future improvements' in handling complex web tasks. This suggests a need for research and development efforts to focus on enhancing agent capabilities in areas such as 'sustained context' management, 'cross-site reasoning', and robust error handling across diverse web environments.

The introduction of the Trajectory Efficiency metric and the finding that agents achieve only '1.15%' highlights that merely achieving a solution is insufficient. Future web agents must not only be effective but also efficient to be truly useful in real-world scenarios. This necessitates a shift in focus towards optimizing agent strategies, reducing unnecessary steps, and improving decision-making processes to achieve desired outcomes with minimal overhead. The benchmark's emphasis on tasks derived from 'real world browsing sessions' and evaluation on the 'live Internet' underscores the need for agents that are robust and adaptable to the dynamic and unpredictable nature of the open web. This means moving beyond highly structured or simplified test environments to develop agents that can operate effectively in messy, real-world conditions. Ultimately, Odysseys provides 'a realistic benchmark to measure progress towards computer-use agents that can potentially productively operate for hours', laying a clear path for future research and engineering efforts.

What's Next: Accessing the Odysseys Benchmark

To facilitate further research and development in web agent capabilities, the creators of Odysseys have made the benchmark publicly available. The official statement indicates: "We release our tasks, evaluation scripts, and other results at https://odysseys-website.pages.dev". This open access allows researchers and developers worldwide to test their own web agents against the challenging tasks presented in Odysseys, compare their performance against frontier models, and contribute to the advancement of long-horizon web agent technology. The availability of evaluation scripts is particularly valuable, as it enables consistent and reproducible assessment of agent performance using the rubric-based methodology. This resource is poised to become a valuable tool for driving innovation in a field that seeks to create more capable and autonomous web agents.

Research Information

Institution
arXiv CS
Original Study
View Publication
Source
arXiv CS

About ICANEWS

ICANEWS is a global research journal for emerging researchers, publishing student and emerging researcher work across all fields.