Qualcomm AI Research
We present SetDiff, a geometry-grounded multi-view diffusion framework that enhances novel-view renderings produced by 3D Gaussian Splatting. Our method integrates explicit 3D priors, pixel-aligned coordinate maps and pose-aware Plücker ray embeddings, into a set-based diffusion model capable of jointly processing variable numbers of reference and target views. This formulation enables robust occlusion handling, reduces hallucinations under low-signal conditions, and improves photometric fidelity in visual content restoration. A unified set mixer performs global token-level attention across all input views, supporting scalable multi-camera enhancement while maintaining computational efficiency through latent-space supervision and selective decoding. Extensive experiments on EUVS, Para-Lane, nuScenes, and DL3DV demonstrate significant gains in perceptual fidelity, structural similarity, and robustness under severe extrapolation. SetDiff establishes a state-of-the-art diffusion-based solution for realistic and reliable novel-view synthesis in autonomous driving scenarios.
SetDiff enhances a variable‑length set of novel‑view images rendered from 3D Gaussian Splatting. It employs a single‑step image‑diffusion backbone, where the denoising process is conditioned on multiple high‑quality reference camera images. The refinement is further guided by geometry‑aware 3D correspondences and relative camera poses, enabling consistent and view‑dependent detail enhancement across the generated viewpoints.
Average measure on 5 seconds extrapolation
@article{zanjani2026SetDiff,
title={Enhancing Novel View Synthesis via Geometry Grounded Set Diffusion},
author={Farhad G. Zanjani and Hong Cai and Amirhossein Habibian},
year={2026},
eprint={2601.07540},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2601.07540},
}