JARVIS: A VLM-Driven AR Instruction System for Cross-Reality Task Guidance Improves Usability and Success

arXiv CS · April 15, 2026 · 7 min read · Engineering & Technology

Read research and analysis on JARVIS: A VLM-Driven AR Instruction System for Cross-Reality Task Guidance Improves Usability and Success published by ICANEWS, a global research journal for emerging researchers.

Key Takeaways

JARVIS is a VLM-driven AR instruction system that generates contextual, step-by-step guidance from a single prompt, with real-time state verification and adaptive visual feedback.
A formative study of cross-reality tasks identified key requirements for state awareness and cross-reality coordination, categorizing tasks into real-to-real (R2R), real-to-virtual (R2V), virtual-to-real (V2R), and virtual-to-virtual (V2V).
A within-subjects study (N=14) across four domains showed JARVIS improves usability, workload, success rate, and visualization effectiveness over baselines.

Why This Matters

JARVIS addresses the issue of workflow disruption and increased cognitive load caused by traditional tutorials in everyday tasks. By providing in-situ, adaptive AR guidance, it aims to enhance efficiency and reduce mental burden in increasingly common hybrid physical and virtual workspaces.

JARVIS: Advancing Just-in-Time AR Visual Instruction for Cross-Reality Tasks

In an era increasingly defined by complex, hybrid digital and physical workflows, a novel augmented reality (AR) system, dubbed JARVIS, has emerged to transform how individuals engage with instructional guidance. This VLM-driven AR instruction system is designed to provide just-in-time, contextualized assistance for a diverse range of ‘cross-reality’ tasks, which blend physical and virtual actions. By leveraging recent advancements in large language models (LLMs) and vision-language models (VLMs), JARVIS seeks to overcome the limitations of traditional tutorials that frequently interrupt workflow and impose significant cognitive burdens on users.

Conventional methods of learning and executing everyday tasks often necessitate users to alternate between consulting external resources—such as manuals or video guides—and performing the actual steps. This constant switching disrupts the user's flow and contributes to an elevated cognitive load, potentially hindering efficiency and increasing the likelihood of errors. The development of JARVIS directly addresses these challenges by offering in-situ guidance, ensuring that instructions are delivered contextually and adaptively within the user's immediate environment.

The Core Challenge: Bridging Physical and Virtual Workspaces

The impetus behind JARVIS stems from an identified gap in existing AI-powered AR tutorial systems. While such systems have made strides in supporting physical procedural tasks, their capabilities for hybrid physical and virtual workspaces have been notably limited. The integration of actions across real and virtual environments presents unique challenges related to awareness of the task's current state and the precise coordination required between different realities.

A formative study was undertaken to thoroughly investigate these ‘cross-reality tasks’ and to articulate the essential requirements for effective guidance within such complex scenarios. This study specifically aimed to understand the nuances of ‘state awareness’—the system's comprehension of the user's progress and environment—and ‘cross-reality coordination’—the seamless synchronization of instructions for actions spanning both physical and virtual domains. The findings from this foundational research directly informed the architectural design and functional specifications of the JARVIS system.

Introducing JARVIS: A VLM-Driven Solution

JARVIS distinguishes itself as a sophisticated AR instruction system powered by vision-language models (VLMs). This integration allows the system to generate highly contextual and step-by-step guidance from a single initial prompt. A central feature of JARVIS is its capacity for ‘real-time state verification,’ meaning the system continuously monitors the user's actions and the environment to confirm that steps are being executed correctly and to understand the current progress of the task. Should inconsistencies or deviations occur, JARVIS is equipped to provide ‘adaptive visual feedback,’ adjusting its guidance dynamically to help the user realign with the task objectives.

The system's ability to process and interpret both visual information from the real world and linguistic instructions underpins its effectiveness. By leveraging VLMs, JARVIS can effectively 'see' what the user is doing and 'understand' the task instructions simultaneously, translating this understanding into actionable, visual overlays and prompts within the augmented reality environment. This dynamic interaction minimizes the need for users to mentally switch contexts, thereby preserving workflow continuity.

Formative Study Highlights: Categorizing Cross-Reality Tasks

To ensure JARVIS was precisely tailored to the needs of cross-reality tasks, the development team conducted a formative study. This study was instrumental in characterizing the various types of cross-reality interactions and identifying specific guidance requirements for each. The research categorized cross-reality tasks into four distinct types:

Real-to-Real (R2R): Tasks that primarily involve interactions within the physical world, where AR supplements the physical actions.
Real-to-Virtual (R2V): Tasks where actions performed in the real world directly influence or control elements within a virtual environment.
Virtual-to-Real (V2R): Tasks where actions or instructions originating from a virtual environment are acted upon within the physical world.
Virtual-to-Virtual (V2V): Tasks that are entirely confined to virtual environments, but where AR guidance can still enhance the user experience, perhaps by visualizing complex virtual instructions in an overlaid manner.

This detailed categorization provided a robust framework for understanding the diverse applications and challenges that JARVIS needed to address, ensuring its design accounted for the complexities inherent in each task type. The insights gleaned from this formative study informed the development of JARVIS's adaptive feedback mechanisms and its state awareness capabilities across these varied scenarios.

Empirical Validation: A Within-Subjects Study

To rigorously evaluate the efficacy of JARVIS, a comprehensive within-subjects study was conducted. This study involved fourteen participants (N=14) and spanned across four distinct domains, allowing for a broad assessment of the system's performance in varied contexts. A within-subjects design ensures that each participant is exposed to all experimental conditions, minimizing variability between groups and allowing for direct comparisons of performance under different instructional paradigms.

The study compared JARVIS against established baselines, which likely represent traditional tutorial methods or existing AR instruction systems. The metrics used for evaluation were meticulously chosen to assess various dimensions of user experience and task success. These metrics included:

Usability: Measuring the ease of use and learnability of the system. This often involves participant feedback on the system's interface, intuitiveness, and overall user satisfaction.
Workload: Quantifying the cognitive effort required by users to complete tasks. Reduced workload is a key objective for AR instruction systems, as it directly addresses the problem of cognitive load inherent in traditional methods.
Success Rate: Assessing the proportion of tasks or task steps completed accurately and efficiently by participants using JARVIS, as compared to baseline methods. A higher success rate indicates improved task execution.
Visualization Effectiveness: Evaluating how well the visual cues, overlays, and instructions provided by JARVIS contributed to task understanding and performance. This metric is crucial for AR systems, where visual guidance is a primary mode of instruction.

The results of this rigorous evaluation demonstrated significant improvements across all measured parameters. JARVIS improved usability, workload, success rate, and visualization effectiveness over the established baselines. This suggests that the system's VLM-driven contextual guidance, real-time state verification, and adaptive visual feedback mechanisms are highly effective in supporting users in cross-reality task environments.

"JARVIS improves usability, workload, success rate, and visualization effectiveness over baselines."

Implications for Future Cross-Reality Task Guidance

The successful development and validation of JARVIS have significant implications for the future of task guidance, particularly in domains requiring intricate interactions between physical and virtual elements. By providing an advanced, intelligent AR instruction system, JARVIS paves the way for more intuitive and less cognitively demanding methods of learning and performing complex tasks. This could revolutionize training, maintenance, assembly, and various other industrial and personal applications where hybrid physical-digital workflows are increasingly common.

The ability of JARVIS to automatically generate contextual, step-by-step guidance from a single prompt, coupled with its real-time state verification and adaptive feedback, positions it as a powerful tool for reducing errors and increasing efficiency. This approach moves beyond static instruction sets, offering a dynamic and responsive learning environment that adapts to the user's progress and immediate needs.

Furthermore, the detailed categorization of cross-reality tasks into R2R, R2V, V2R, and V2V provides a valuable framework for future research and development in this nascent field. This understanding ensures that upcoming AR instruction systems can be designed with a granular appreciation for the specific challenges presented by different types of hybrid tasks, leading to more targeted and effective solutions.

Looking Ahead: Enhancing State Awareness and Coordination

The research underscores the critical importance of 'state awareness' and 'cross-reality coordination' for effective AR guidance in hybrid environments. JARVIS's design, which explicitly addresses these requirements, represents a significant step forward. Future enhancements might involve even more sophisticated VLM capabilities for understanding complex task states, anticipating user intentions, and providing proactive, rather than merely reactive, guidance.

The system's modular architecture, leveraging LLMs and VLMs, suggests a high degree of adaptability and potential for integration with other AI technologies. As these underlying AI models continue to evolve, JARVIS and similar systems are likely to become even more intelligent, capable of handling an increasingly broad spectrum of tasks with even greater accuracy and nuance. The implications extend not only to industrial settings but also to educational contexts, personal assistance, and entertainment, where real-time, context-aware guidance can unlock new possibilities for interaction and learning.

The ongoing development inspired by JARVIS promises to further blur the lines between the physical and digital, making complex cross-reality tasks more accessible and less burdensome for users. This advance marks a pivotal moment in the evolution of augmented reality as a practical tool for everyday and specialized tasks.

Research Information

Institution: arXiv CS
Original Study: View Publication
Source: arXiv CS

About ICANEWS

ICANEWS is a global research journal for emerging researchers, publishing student and emerging researcher work across all fields.