Overview
The Memento framework is proposed to address challenges in long-form video generation, particularly concerning the consistent representation of recurring subjects across varying shots, viewpoints, motions, and scene transitions. It approaches subject preservation as an identity grounding problem, founded on the principle that a reliable memory bank for a subject should facilitate its reconstruction from memory.
Research Context
Long-form video generation necessitates that recurring subjects maintain consistency throughout the video's duration. Existing temporal decomposition methods contribute to scalability by generating videos on a shot-by-shot basis. However, these methods primarily focus on optimizing plausible next-shot continuations. A recognized limitation identified is their lack of verification regarding whether historical memory adequately preserves identity-critical subject evidence. This can lead to recurring subjects becoming diluted, overwritten, or forgotten as the generation process continues.
Approach
Memento operates on the premise that if a memory bank faithfully preserves a subject, it should enable the reconstruction of that subject using only the stored memory. The framework jointly trains two components: autoregressive next-shot generation and memory-based subject reconstruction. The reconstruction component aims to recover target appearances by utilizing historical memory in conjunction with global story captions.
To distinguish between long-range subject evidence and short-range contextual cues, Memento incorporates a dual-query memory mechanism. Within this mechanism, one query is designed to retrieve memory relevant to the subject's identity, while the second query selects short-context keyframes to support coherent continuation of the video sequence.
Furthermore, Memento integrates a subject-aware cinematic data pipeline. This pipeline's function is to provide precise reconstruction supervision. It achieves this through the use of consistent, pronoun-free subject descriptions, which aid in maintaining the integrity of subject representation during the generation process.
Findings
Experiments conducted on the Memento framework indicated that it achieved state-of-the-art performance. This performance was observed across several evaluation criteria: long-term subject consistency, cross-shot coherence, and overall visual quality.