PAI-Studio: Cinematic Video Background Replacement with Camera-Aware Motion

Heyuan Gao1,2,* Bangxun Tang1,3,* Yiren Song1,4,*,‡ Guian Fang1,4 Zijian He1 Jie Yang1 Mike Zheng Shou4,†

1Utopai Studios  2Nanyang Technological University  3University of California, Irvine  4Show Lab, National University of Singapore

*Equal contribution Project leader Corresponding author

Paper

Given a foreground video and background reference images, our method synthesizes a complete video with camera-aware background motion, foreground consistency, and scene-consistent relighting at 1080P resolution.

Abstract

We present PAI-Studio, a new reference-conditioned video synthesis task that addresses a long-standing challenge in cinematic background replacement: generating dynamic backgrounds aligned with foreground motion while preserving foreground identity, matching reference scene appearance, and achieving globally consistent illumination with realistic foreground relighting. Existing open-source systems and commercial APIs cannot simultaneously ensure motion-consistent background generation, high-fidelity foreground relighting and foreground identity preservation, often resulting in static backgrounds, inconsistent boundaries, and noticeable compositing artifacts. To bridge this gap, we build upon a Diffusion Transformer video backbone and reformulate the problem as an in-context conditional generation task. Through bidirectional attention, our model jointly captures foreground dynamics and background reference information within a unified architecture. We further construct a 30K-scale dataset sourced from high-quality films and online videos to support this task. Extensive evaluations demonstrate that our method significantly outperforms existing open-source and commercial API solutions.

Method

Method overview

Overview of the PAI-Studio architecture. Multi-condition inputs—including multiple background reference images, the illumination-perturbed foreground video, a structured prompt JSON, and the denoising video—are encoded into tokens and concatenated. A multi-modal attention model leverages bidirectional attention to model their global correlations. In addition, Temporal Positional Encoding (PE) Cloning is introduced to precisely control the temporal placement of multiple background images in the generated video.

Dataset: CineStudio

Data pipeline

Overview of the CineStudio data construction pipeline.

Generation Results

Edge Harmonization

Edge harmonization

Superior edge harmonization. Compared to baselines that suffer from severe green spill and boundary artifacts (highlighted by red boxes), our model generates clean, natural edges without residual green contamination.

Robustness to Imperfect Segmentation

Segmentation robustness

Robustness to imperfect foreground segmentation. We highlight four challenging cases where the input green-screen videos (top rows) contain mask defects, as indicated by the red bounding boxes. These include severely corrupted body parts, arbitrary holes, and artificial occlusions. Without any explicit inpainting prompts, our model (bottom rows) robustly reconstructs the missing foreground structures while seamlessly compositing them into the generated backgrounds with global illumination and structural harmony.

Implicit Scene-Adaptive Relighting

Relighting results

Implicit scene-adaptive relighting. Our model automatically harmonizes the foreground illumination with the newly synthesized backgrounds without requiring any explicit relighting prompts or maps.

Multi-Frame Background Control

Multi-frame control

Effect of multi-frame background control. Comparing inference results conditioned on 1, 2, and 3 background reference images. The 3-reference control effectively anchors the background at the beginning, middle, and end temporal locations, yielding the most temporally coherent results that closely match the ground truth (GT). Fewer reference frames lead to information deficits at unconditioned time steps, causing structural deviations and abrupt disappearance or morphing of background objects.