Autoregressive (AR) models have achieved remarkable success in image synthesis, yet their sequential nature imposes significant latency constraints. Speculative Decoding offers a promising avenue for acceleration, but existing approaches are limited by token-level ambiguity and lack of spatial awareness. In this work, we introduce Multi-Scale Local Speculative Decoding (MuLo-SD), a novel framework that combines multi-resolution drafting with spatially informed verification to accelerate AR image generation. Our method leverages a low-resolution drafter paired with learned up-samplers to propose candidate image tokens, which are then verified in parallel by a high-resolution target model. Crucially, we incorporate a local rejection and resampling mechanism, enabling efficient correction of draft errors by focusing on spatial neighborhoods rather than raster-scan resampling after the first rejection. We demonstrate that MuLo-SD achieves substantial speedups -- up to 1.7x -- outperforming strong speculative decoding baselines such as EAGLE-2 and LANTERN in terms of acceleration, while maintaining comparable semantic alignment and perceptual quality. These results are validated using GenEval, DPG-Bench, and FID/HPSv2 on the MS-COCO 5k validation split. Extensive ablations highlight the impact of up-sampling design, probability pooling, and local rejection and resampling with neighborhood expansion. Our approach sets a new state-of-the-art in speculative decoding for image synthesis, bridging the gap between efficiency and fidelity.
| Method | Speedup | GenEval | DPG-Bench | FID | HPS-v2 |
|---|---|---|---|---|---|
| Chameleon-7B | – | 39.0 | – | – | – |
| LWM-7B | – | 47.0 | – | – | – |
| Lumina-mGPT-7B | – | 56.0 | 79.7 | – | – |
| ILLUME-7B | – | 61.0 | – | – | – |
| Transfusion-7B | – | 63.0 | – | – | – |
| Janus-Pro-7B | – | 80.0 | 84.2 | – | – |
| Show-O-1.3B | – | 53.0 | – | – | – |
| Janus-1.3B | – | 61.0 | 79.8 | – | – |
| D-DiT-2B | – | 65.0 | – | – | – |
| Emu3 | – | 66.0 | 80.6 | – | – |
| Janus-Pro-1B | – | 73.0 | 82.6 | – | – |
| Harmon-1.5B | – | 76.0 | 82.6 | – | – |
| Tar-1.5B @ 512 | 1.00× | 77.7 | 82.8 | 33.0 | 28.6 |
|   + ZipAR-16 | 1.88× | 76.6 (-1.1) | 82.8 (-0.0) | 33.0 (-0.0) | 28.5 (-0.1) |
|   + EAGLE-2 | 0.72× | 77.7 (-0.0) | 82.8 (-0.0) | 33.0 (-0.0) | 28.6 (-0.0) |
|   + LANTERN | 1.08× | 75.9 (-1.8) | 82.1 (-0.9) | 32.7 (-0.3) | 27.7 (-0.8) |
| + MuLo-SD (2×) | 1.22× | 76.0 (-1.7) | 82.4 (-0.4) | 33.7 (+0.7) | 27.8 (-0.7) |
| Tar-1.5B @ 1024 | 1.00× | 77.1 | 82.3 | 32.4 | 29.5 |
|   + ZipAR-16 | 3.65× | 76.6 (-0.5) | 82.5 (+0.2) | 32.4 (-0.0) | 29.6 (+0.1) |
|   + EAGLE-2 | 0.78× | 77.1 (-0.0) | 82.3 (-0.0) | 32.4 (-0.0) | 29.5 (-0.0) |
|   + LANTERN | 1.42× | 75.4 (-1.7) | 82.3 (-0.0) | 31.1 (-1.3) | 28.5 (-1.0) |
|   + MuLo-SD (4×) | 1.68× | 76.3 (-0.8) | 82.0 (-0.3) | 32.8 (+0.4) | 28.4 (-1.1) |
@misc{peruzzo2026multiscalelocalspeculativedecoding,
title={Multi-Scale Local Speculative Decoding for Image Generation},
author={Elia Peruzzo and Guillaume Sautière and Amirhossein Habibian},
year={2026},
eprint={2601.05149},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2601.05149},}