SetDiff
Enhancing Novel-View Synthesis via Geometry Grounded Set Diffusion

Qualcomm AI Research

Para-Lane Novel-views

Double-lane extrapolation

3DGS ( [+] reference view)
3DGS + SetDiff

EUVS Novel-views

Cross-camera Enhancement

DL3DV Novel-views

Sparse Novel-view Synthesis

3DGS
3DGS + SetDiff

Abstract

We present SetDiff, a geometry-grounded multi-view diffusion framework that enhances novel-view renderings produced by 3D Gaussian Splatting. Our method integrates explicit 3D priors, pixel-aligned coordinate maps and pose-aware Plücker ray embeddings, into a set-based diffusion model capable of jointly processing variable numbers of reference and target views. This formulation enables robust occlusion handling, reduces hallucinations under low-signal conditions, and improves photometric fidelity in visual content restoration. A unified set mixer performs global token-level attention across all input views, supporting scalable multi-camera enhancement while maintaining computational efficiency through latent-space supervision and selective decoding. Extensive experiments on EUVS, Para-Lane, nuScenes, and DL3DV demonstrate significant gains in perceptual fidelity, structural similarity, and robustness under severe extrapolation. SetDiff establishes a state-of-the-art diffusion-based solution for realistic and reliable novel-view synthesis in autonomous driving scenarios.

Method Overview

SetDiff enhances a variable‑length set of novel‑view images rendered from 3D Gaussian Splatting. It employs a single‑step image‑diffusion backbone, where the denoising process is conditioned on multiple high‑quality reference camera images. The refinement is further guided by geometry‑aware 3D correspondences and relative camera poses, enabling consistent and view‑dependent detail enhancement across the generated viewpoints.

Overview: A set of rendered novel-view images are enhanced via an image diffusion model, conditioned on reference views, camera poses and 3D geometric priors.

Novel-view metrics (Para-Lane)

Single-lane extrapolation

Double-lane extrapolation

Novel-view metrics (EUVS)

Setting 1 (Camera translation)

Setting 2 (Camera rotation)

Setting 3 (Camera rotation+translation)

Novel-view metrics on temporally-extrapolated trajectories (nuScenes)

Average measure on 5 seconds extrapolation

Novel-view metrics on sparse views (DL3DV)

Three views

Six views

Nine views

BibTex

              
              @article{zanjani2026SetDiff,
              title={Enhancing Novel View Synthesis via Geometry Grounded Set Diffusion},
              author={Farhad G. Zanjani and Hong Cai and Amirhossein Habibian},
              year={2026},
              eprint={2601.07540},
              archivePrefix={arXiv},
              primaryClass={cs.CV},
              url={https://arxiv.org/abs/2601.07540},
              }