Autoregressive (AR) models have achieved remarkable success in image synthesis, yet their sequential nature imposes significant latency constraints. Speculative Decoding offers a promising avenue for acceleration, but existing approaches are limited by token-level ambiguity and lack of spatial awareness. In this work, we introduce Multi-Scale Local Speculative Decoding (MuLo-SD), a novel framework that combines multi-resolution drafting with spatially informed verification to accelerate AR image generation. Our method leverages a low-resolution drafter paired with an up-sampling step to propose candidate image tokens, which are then verified in parallel by a high-resolution target model. Crucially, we incorporate a local rejection and resampling mechanism, enabling efficient correction of draft errors by focusing on spatial neighborhoods rather than raster-scan resampling after the first rejection. When integrated with parallel decoding resampling, MuLo-SD achieves substantial speedups -- up to 5x -- outperforming both speculative decoding and parallel decoding baselines in terms of acceleration, while maintaining comparable semantic alignment and perceptual quality. These results are validated using GenEval, DPG-Bench, and FID/HPSv2 on the MS-COCO 5k validation split. Extensive ablations highlight the impact of up-sampling design, probability pooling, and local rejection and resampling with neighborhood expansion. Our approach sets a new state-of-the-art in speculative decoding for image synthesis, bridging the gap between efficiency and fidelity.
| Method | Speedup | GenEval | DPG-Bench | FID | HPS-v2 |
|---|---|---|---|---|---|
| Tar-1.5B @ 512 | 1.00× | 77.7 | 82.8 | 33.0 | 28.6 |
| + ZipAR-16 | 1.88× | 76.6 (-1.1) | 82.8 (-0.0) | 33.0 (-0.0) | 28.5 (-0.1) |
| + EAGLE-2 | 0.72× | 77.7 (-0.0) | 82.8 (-0.0) | 33.0 (-0.0) | 28.6 (-0.0) |
| + LANTERN | 1.08× | 75.9 (-1.8) | 82.1 (-0.9) | 32.7 (-0.3) | 27.7 (-0.8) |
| + MuLo-SD (2×) | 1.94× | 76.4 (-1.3) | 82.6 (-0.2) | 33.4 (+0.4) | 28.3 (-0.3) |
| Tar-7B @ 512 | 1.00× | 85.1 | 81.3 | 38.6 | 29.8 |
| + ZipAR-16 | 1.88× | 85.3 (+0.2) | 81.0 (-0.3) | 38.7 (+0.1) | 29.8 (-0.0) |
| + EAGLE-2 | 0.76× | 85.1 (-0.0) | 81.3 (-0.0) | 38.6 (-0.0) | 29.8 (-0.0) |
| + LANTERN | 1.20× | 84.9 (-0.2) | 80.5 (-0.8) | 36.9 (-1.8) | 28.7 (-0.9) |
| + MuLo-SD (2×) | 2.03× | 85.1 (-0.0) | 80.8 (-0.5) | 38.2 (-0.4) | 29.5 (-0.3) |
| Tar-1.5B @ 1024 | 1.00× | 77.1 | 82.3 | 32.4 | 29.5 |
| + ZipAR-16 | 3.65× | 76.6 (-0.5) | 82.5 (+0.2) | 32.4 (-0.0) | 29.6 (+0.1) |
| + EAGLE-2 | 0.78× | 77.1 (-0.0) | 82.3 (-0.0) | 32.4 (-0.0) | 29.5 (-0.0) |
| + LANTERN | 1.42× | 75.4 (-1.7) | 82.3 (-0.0) | 31.1 (-1.3) | 28.7 (-1.0) |
| + MuLo-SD (2×) | 3.90× | 76.8 (-0.3) | 82.2 (-0.1) | 31.3 (+0.4) | 28.7 (-0.8) |
| Tar-7B @ 1024 | 1.00× | 85.2 | 80.4 | 37.9 | 30.5 |
| + ZipAR-16 | 3.65× | 85.2 (-0.0) | 80.3 (-0.1) | 37.9 (-0.0) | 30.5 (-0.0) |
| + EAGLE-2 | 0.83× | 85.2 (-0.0) | 80.4 (-0.0) | 37.9 (-0.0) | 30.5 (-0.0) |
| + LANTERN | 1.45× | 82.9 (-2.3) | 80.5 (+0.1) | 34.6 (-3.3) | 29.4 (-0.9) |
| + MuLo-SD (4×) | 5.33× | 85.4 (+0.2) | 80.8 (+0.4) | 34.8 (-3.1) | 29.5 (-0.8) |
@misc{peruzzo2026multiscalelocalspeculativedecoding,
title={Multi-Scale Local Speculative Decoding for Image Generation},
author={Elia Peruzzo and Guillaume Sautière and Amirhossein Habibian},
year={2026},
eprint={2601.05149},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2601.05149},}