Large Language Models Found to Corrupt Documents During Delegated Tasks
Recent research published on arXiv, titled 'LLMs Corrupt Your Documents When You Delegate,' has unveiled significant challenges concerning the reliability of Large Language Models (LLMs) in delegated work environments. The findings indicate that current LLMs introduce errors and silently corrupt documents, a critical issue given their increasing role in various knowledge work sectors. This study, which involved extensive testing across 52 professional domains, sheds light on the inherent risks associated with entrusting LLMs with document-editing tasks over extended periods.
The Rise of Delegated Work and the Need for Trust
Large Language Models are rapidly transforming the landscape of knowledge work. A new interaction paradigm, referred to as 'delegated work' (e.g., vibe coding), is emerging, where users assign complex tasks to these AI systems. This shift fundamentally relies on trust – the expectation that an LLM will execute a task faithfully, without introducing unintended errors into critical documents. The integrity of this delegation process is paramount, as the utility of LLMs in professional settings hinges on their ability to maintain data fidelity.
The research emphasizes that for LLMs to be truly effective and widely adopted in delegated workflows, they must demonstrably perform tasks without compromising the content they are asked to manage or modify. Any degradation, even subtle, can have significant implications for the quality and reliability of knowledge work outputs. This critical need for trustworthiness formed the foundation for the investigation into LLM performance in such scenarios.
Introducing DELEGATE-52: A Comprehensive Evaluation Framework
To systematically study the readiness of AI systems in delegated workflows, researchers introduced a new benchmark called DELEGATE-52. This novel framework was specifically designed to simulate long, complex delegated workflows. These workflows necessitate in-depth document editing across a broad spectrum of professional domains. The breadth of coverage in DELEGATE-52 is remarkable, encompassing 52 distinct areas, including highly specialized fields such as coding, crystallography, and music notation.
The design of DELEGATE-52 addresses the limitations of previous evaluations by focusing on the cumulative effects of delegation over time. Unlike short, isolated tasks, DELEGATE-52 models scenarios where LLMs are engaged in sustained interactions, requiring them to manage and modify content iteratively. This approach aims to provide a more realistic assessment of how LLMs would perform in real-world professional contexts where tasks often involve multiple steps and prolonged engagement with documents.
Widespread Document Degradation Across LLM Models
A large-scale experiment conducted using the DELEGATE-52 framework involved 19 distinct Large Language Models. The results of this extensive evaluation were stark: current models consistently degrade documents during the delegation process. The study found that even highly advanced, 'frontier models,' specifically named as Gemini 3.1 Pro, Claude 4.6 Opus, and GPT 5.4, were not immune to this issue. These leading models corrupted an average of 25% of document content by the completion of long workflows.
"Our large-scale experiment with 19 LLMs reveals that current models degrade documents during delegation: even frontier models (Gemini 3.1 Pro, Claude 4.6 Opus, GPT 5.4) corrupt an average of 25% of document content by the end of long workflows, with other models failing more severely."
The severity of failure varied among the models; while frontier models showed a 25% corruption rate, other models exhibited even more pronounced degradation. This indicates a pervasive problem across the spectrum of available LLMs, suggesting that the issue is not confined to less sophisticated systems but is a fundamental challenge in their current architectural and operational capabilities when applied to delegated document editing.
Factors Exacerbating Document Corruption
The research also delved into additional experiments to identify factors that might influence the severity of document degradation. These experiments revealed several key variables that exacerbate the problem. Specifically, the degradation severity was found to be intensified by document size, the length of interaction, and the presence of distractor files.
- Document Size: As documents grow larger, the propensity for LLMs to introduce errors or corruption increases. This suggests that the complexity associated with managing and editing more extensive content might overwhelm current models, leading to a higher error rate.
- Length of Interaction: The longer an LLM is engaged in an interaction or workflow, the more severe the degradation becomes. This cumulative effect highlights that errors are not isolated incidents but compound over time, making long-term delegated tasks particularly risky.
- Presence of Distractor Files: When extraneous or irrelevant files are present alongside the primary documents an LLM is tasked with editing, the degradation severity increases. This implies that LLMs struggle with contextual filtering and may be prone to errors when the information environment becomes noisier or less focused.
These findings provide crucial insights into the conditions under which LLMs are most likely to underperform in delegated editing tasks. Understanding these contributing factors is essential for designing more robust systems and for users to be aware of the inherent limitations.
Agentic Tool Use Does Not Improve Performance
Another significant finding from the study relates to the use of agentic tools. The research specifically investigated whether integrating such tools could mitigate the observed document degradation. The results were categorical: additional experiments revealed that agentic tool use does not improve performance on DELEGATE-52.
This finding is particularly noteworthy because agentic tool use is often proposed as a method to enhance LLM capabilities, allowing them to interact with external systems or perform more complex actions. However, in the context of delegated document editing and the cumulative errors measured by DELEGATE-52, this approach did not yield positive results. This suggests that the issues leading to document corruption are not merely a matter of lacking external capabilities but may stem from deeper intrinsic limitations in how LLMs process and maintain document integrity over time.
The Nature of Errors: Sparse but Severe
The analysis conducted by the researchers illuminated the specific characteristics of the errors introduced by current LLMs. The study concluded that these LLMs are 'unreliable delegates' because they introduce errors that are "sparse but severe." This means that while errors might not occur on every single character or line, when they do appear, their impact on the document's content is significant.
Furthermore, these errors are described as silently corrupting documents. This implies that the degradation is not immediately obvious or flagged by the LLM itself, making it difficult for human users to detect without thorough, painstaking review. The insidious nature of these silent corruptions, combined with their severe impact, poses a substantial risk to the accuracy and trustworthiness of documents processed by LLMs, especially when these errors compound over long interactions.
The compounding effect over long interactions suggests a snowballing problem: small, silent corruptions accumulate over time, potentially rendering a document significantly altered or unusable without extensive manual correction. This highlights a fundamental challenge in relying on LLMs for workflows that demand high fidelity and sustained accuracy.
Implications for Knowledge Work and Future Development
The findings from the DELEGATE-52 study have profound implications for the adoption and development of LLMs in knowledge work sectors. The core problem identified is a lack of unwavering trustworthiness in these systems when performing delegated tasks that involve document editing. This directly impacts the potential for LLMs to truly 'disrupt' knowledge work in a positive and reliable manner.
For industries and professionals considering the integration of LLMs for document-intensive tasks, this research serves as a crucial caution. The expectation of flawless execution, which is foundational to the concept of delegation, is currently unmet by even the most advanced LLMs. The observed corruption rates and the nature of these errors – sparse but severe, and silently compounding – necessitate a rethinking of how these technologies are deployed and monitored.
Moving forward, the research suggests an urgent need for advancements in LLM capabilities that specifically address document fidelity and error propagation in long, delegated workflows. Future development must focus on mechanisms to prevent silent corruption, detect errors effectively, and ensure that LLMs can maintain the integrity of content regardless of document size, interaction length, or informational complexity. Without these improvements, the promise of LLMs as fully reliable delegates in knowledge work remains significantly distant.