Multi-Scale Local Speculative Decoding for Image Generation

Qualcomm AI Research

Accepted @ CVPR 2026

Tar-1.5B @ 1024

Tar-1.5B @ 1024 + MuLo-SD

Abstract

Autoregressive (AR) models have achieved remarkable success in image synthesis, yet their sequential nature imposes significant latency constraints. Speculative Decoding offers a promising avenue for acceleration, but existing approaches are limited by token-level ambiguity and lack of spatial awareness. In this work, we introduce Multi-Scale Local Speculative Decoding (MuLo-SD), a novel framework that combines multi-resolution drafting with spatially informed verification to accelerate AR image generation. Our method leverages a low-resolution drafter paired with an up-sampling step to propose candidate image tokens, which are then verified in parallel by a high-resolution target model. Crucially, we incorporate a local rejection and resampling mechanism, enabling efficient correction of draft errors by focusing on spatial neighborhoods rather than raster-scan resampling after the first rejection. When integrated with parallel decoding resampling, MuLo-SD achieves substantial speedups -- up to 5x -- outperforming both speculative decoding and parallel decoding baselines in terms of acceleration, while maintaining comparable semantic alignment and perceptual quality. These results are validated using GenEval, DPG-Bench, and FID/HPSv2 on the MS-COCO 5k validation split. Extensive ablations highlight the impact of up-sampling design, probability pooling, and local rejection and resampling with neighborhood expansion. Our approach sets a new state-of-the-art in speculative decoding for image synthesis, bridging the gap between efficiency and fidelity.

Method

  1. Sequential sample from the drafter: the process begins by sequentially sampling draft tokens from the low-resolution model.
  2. Upsampling: these tokens are then upsampled with the trained module.
  3. Verification: the target model verifies the sequence in parallel.
  4. Accaptance Rule: apply an acceptance thresholding to keep / reject tokens.
  5. Expand Local Neighborhood: we expand the acceptance mask by rejecting neighboors tokens of rejected tokens.
  6. Sequential sampling from target: rejected tokens are resampled sequentially using the target autoregressive model.
  7. Downsampling verified tokens are appended to the accepted prefix, serving as the prefix for the next low-res draft sampling.

Experiments

Quantitative Results

Method Speedup GenEval DPG-Bench FID HPS-v2
Tar-1.5B @ 512 1.00× 77.7 82.8 33.0 28.6
  + ZipAR-16 1.88× 76.6 (-1.1) 82.8 (-0.0) 33.0 (-0.0) 28.5 (-0.1)
  + EAGLE-2 0.72× 77.7 (-0.0) 82.8 (-0.0) 33.0 (-0.0) 28.6 (-0.0)
  + LANTERN 1.08× 75.9 (-1.8) 82.1 (-0.9) 32.7 (-0.3) 27.7 (-0.8)
  + MuLo-SD (2×) 1.94× 76.4 (-1.3) 82.6 (-0.2) 33.4 (+0.4) 28.3 (-0.3)
Tar-7B @ 512 1.00× 85.1 81.3 38.6 29.8
  + ZipAR-16 1.88× 85.3 (+0.2) 81.0 (-0.3) 38.7 (+0.1) 29.8 (-0.0)
  + EAGLE-2 0.76× 85.1 (-0.0) 81.3 (-0.0) 38.6 (-0.0) 29.8 (-0.0)
  + LANTERN 1.20× 84.9 (-0.2) 80.5 (-0.8) 36.9 (-1.8) 28.7 (-0.9)
  + MuLo-SD (2×) 2.03× 85.1 (-0.0) 80.8 (-0.5) 38.2 (-0.4) 29.5 (-0.3)
Tar-1.5B @ 1024 1.00× 77.1 82.3 32.4 29.5
  + ZipAR-16 3.65× 76.6 (-0.5) 82.5 (+0.2) 32.4 (-0.0) 29.6 (+0.1)
  + EAGLE-2 0.78× 77.1 (-0.0) 82.3 (-0.0) 32.4 (-0.0) 29.5 (-0.0)
  + LANTERN 1.42× 75.4 (-1.7) 82.3 (-0.0) 31.1 (-1.3) 28.7 (-1.0)
  + MuLo-SD (2×) 3.90× 76.8 (-0.3) 82.2 (-0.1) 31.3 (+0.4) 28.7 (-0.8)
Tar-7B @ 1024 1.00× 85.2 80.4 37.9 30.5
  + ZipAR-16 3.65× 85.2 (-0.0) 80.3 (-0.1) 37.9 (-0.0) 30.5 (-0.0)
  + EAGLE-2 0.83× 85.2 (-0.0) 80.4 (-0.0) 37.9 (-0.0) 30.5 (-0.0)
  + LANTERN 1.45× 82.9 (-2.3) 80.5 (+0.1) 34.6 (-3.3) 29.4 (-0.9)
  + MuLo-SD (4×) 5.33× 85.4 (+0.2) 80.8 (+0.4) 34.8 (-3.1) 29.5 (-0.8)

Qualitative Results

Tar-1.5B @ 1024

Tar-1.5B @ 1024 + MuLo-SD

a1
a1
a1
a1

Tar-1.5B @ 512

Tar-1.5B @ 512 + MuLo-SD

Tar-1.5B @ 512

Tar-1.5B @ 512 + MuLo-SD

a1
a1
a1
a1
a1
a1
a1
a1
a1
a1
a1
a1

BibTex

              
            @misc{peruzzo2026multiscalelocalspeculativedecoding,
            title={Multi-Scale Local Speculative Decoding for Image Generation}, 
            author={Elia Peruzzo and Guillaume Sautière and Amirhossein Habibian},
            year={2026},
            eprint={2601.05149},
            archivePrefix={arXiv},
            primaryClass={cs.CV},
            url={https://arxiv.org/abs/2601.05149},}