RefDecoder: Enhancing Visual Generation with Conditional Video Decoding

Xiang Fan, Yuheng Wang, Bohan Fang, Zhongzheng Ren, Ranjay Krishna

University of Washington University of North Carolina at Chapel Hill

code GitHub Model article arXiv picture_as_pdf Paper

+2.1 dB PSNR over unconditional VAE baselines.

Superior subject consistency across extended sequences.

Enhanced background stability in high-motion scenes.

"an asian city street at night with people and bicycles"

Baseline

RefDecoder

drag_indicator

"a yellow boat is cruising in front of a bridge"

Baseline

RefDecoder

drag_indicator

"a city street at night during a snow storm"

Baseline

RefDecoder

drag_indicator

Abstract

Did you know there's still a free lunch 🍲 in video generation? DiT models gobble up your beautiful image details 🌃 inside their compressed latent space and never give them back during decoding. With RefDecoder, we recover those details straight from your input image and hand them right back to you.

RefDecoder improves video decoding quality by up to 10% over even the best frontier open-source video generation models, including Wan and Hunyuan Video. Human evaluators prefer our outputs over vanilla decoding 80% of the time, with even larger gains in content-rich scenes such as urban streets, indoor venues, and sports action. As a plug-and-play module, RefDecoder drops into any video generation pipeline without re-training the heavy DiT transformer, and you can swap it in and out to compare against the original output. Beyond I2V, our reference-conditioning framework is versatile — the same framework extends to video editing, style transfer, and more, unlocking exciting new applications and efficiency gains that bypass the costly, time-consuming diffusion process.

Overview & Architecture

Architecture overview of the REFDECODER. The reference image is encoded into tokens and injected into the decoder through shared attention blocks, enabling interactions between reference and latent tokens during decoding.

Detailed architecture of RefDecoder. The high-fidelity reference signal is injected via shared attention blocks to maintain spatial-temporal consistency.

The Innovation

RefDecoder improves video VAE decoders by conditioning on a high-fidelity reference signal that bypasses the lossy VAE latent round trip. It integrates seamlessly with existing pipelines like Wan 2.1 without requiring costly retraining of the diffusion model. The reference encoder extracts fine-grained spatial features which are injected into the upsampling layers of the decoder.

Key Advantage

Bypasses the bottleneck of low-dimensional latent spaces by providing a direct path for high-frequency details from the source reference image.

Visual Comparison

Qualitative reconstruction comparison between Wan baseline and Ours across two video sequences.

Video generation comparisons. For the same input reference, three frames generated at different timesteps are shown. The baseline distorts the underlying scene structure, whereas RefDecoder preserves the structure while producing sharper and more consistent details across frames.

Application: Style Transfer

Qualitative results on the style transfer task. The reconstructed images follow the reference style while preserving the structural content of the input images.

Application: Video Editing

Qualitative comparison on LoRA-Edit. Reference: first frame of the source video. The baseline decoder distorts non-edited regions, while RefDecoder regenerates them faithfully from the reference, preserving high-quality details such as the text.

Citation

@misc{fan2026refdecoderenhancingvisualgeneration,
      title={RefDecoder: Enhancing Visual Generation with Conditional Video Decoding}, 
      author={Xiang Fan and Yuheng Wang and Bohan Fang and Zhongzheng Ren and Ranjay Krishna},
      year={2026},
      eprint={2605.15196},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2605.15196}, 
}