MuLo-SD

Abstract

Autoregressive (AR) models have achieved remarkable success in image synthesis, yet their sequential nature imposes significant latency constraints. Speculative Decoding offers a promising avenue for acceleration, but existing approaches are limited by token-level ambiguity and lack of spatial awareness. In this work, we introduce Multi-Scale Local Speculative Decoding (MuLo-SD), a novel framework that combines multi-resolution drafting with spatially informed verification to accelerate AR image generation. Our method leverages a low-resolution drafter paired with an up-sampling step to propose candidate image tokens, which are then verified in parallel by a high-resolution target model. Crucially, we incorporate a local rejection and resampling mechanism, enabling efficient correction of draft errors by focusing on spatial neighborhoods rather than raster-scan resampling after the first rejection. When integrated with parallel decoding resampling, MuLo-SD achieves substantial speedups -- up to 5x -- outperforming both speculative decoding and parallel decoding baselines in terms of acceleration, while maintaining comparable semantic alignment and perceptual quality. These results are validated using GenEval, DPG-Bench, and FID/HPSv2 on the MS-COCO 5k validation split. Extensive ablations highlight the impact of up-sampling design, probability pooling, and local rejection and resampling with neighborhood expansion. Our approach sets a new state-of-the-art in speculative decoding for image synthesis, bridging the gap between efficiency and fidelity.

Method

Sequential sample from the drafter: the process begins by sequentially sampling draft tokens from the low-resolution model.
Upsampling: these tokens are then upsampled with the trained module.
Verification: the target model verifies the sequence in parallel.
Accaptance Rule: apply an acceptance thresholding to keep / reject tokens.
Expand Local Neighborhood: we expand the acceptance mask by rejecting neighboors tokens of rejected tokens.
Sequential sampling from target: rejected tokens are resampled sequentially using the target autoregressive model.
Downsampling verified tokens are appended to the accepted prefix, serving as the prefix for the next low-res draft sampling.

Quantitative Results

Method	Speedup	GenEval	DPG-Bench	FID	HPS-v2
Tar-1.5B @ 512	1.00×	77.7	82.8	33.0	28.6
+ ZipAR-16	1.88×	76.6 (-1.1)	82.8 (-0.0)	33.0 (-0.0)	28.5 (-0.1)
+ EAGLE-2	0.72×	77.7 (-0.0)	82.8 (-0.0)	33.0 (-0.0)	28.6 (-0.0)
+ LANTERN	1.08×	75.9 (-1.8)	82.1 (-0.9)	32.7 (-0.3)	27.7 (-0.8)
+ MuLo-SD (2×)	1.94×	76.4 (-1.3)	82.6 (-0.2)	33.4 (+0.4)	28.3 (-0.3)
Tar-7B @ 512	1.00×	85.1	81.3	38.6	29.8
+ ZipAR-16	1.88×	85.3 (+0.2)	81.0 (-0.3)	38.7 (+0.1)	29.8 (-0.0)
+ EAGLE-2	0.76×	85.1 (-0.0)	81.3 (-0.0)	38.6 (-0.0)	29.8 (-0.0)
+ LANTERN	1.20×	84.9 (-0.2)	80.5 (-0.8)	36.9 (-1.8)	28.7 (-0.9)
+ MuLo-SD (2×)	2.03×	85.1 (-0.0)	80.8 (-0.5)	38.2 (-0.4)	29.5 (-0.3)
Tar-1.5B @ 1024	1.00×	77.1	82.3	32.4	29.5
+ ZipAR-16	3.65×	76.6 (-0.5)	82.5 (+0.2)	32.4 (-0.0)	29.6 (+0.1)
+ EAGLE-2	0.78×	77.1 (-0.0)	82.3 (-0.0)	32.4 (-0.0)	29.5 (-0.0)
+ LANTERN	1.42×	75.4 (-1.7)	82.3 (-0.0)	31.1 (-1.3)	28.7 (-1.0)
+ MuLo-SD (2×)	3.90×	76.8 (-0.3)	82.2 (-0.1)	31.3 (+0.4)	28.7 (-0.8)
Tar-7B @ 1024	1.00×	85.2	80.4	37.9	30.5
+ ZipAR-16	3.65×	85.2 (-0.0)	80.3 (-0.1)	37.9 (-0.0)	30.5 (-0.0)
+ EAGLE-2	0.83×	85.2 (-0.0)	80.4 (-0.0)	37.9 (-0.0)	30.5 (-0.0)
+ LANTERN	1.45×	82.9 (-2.3)	80.5 (+0.1)	34.6 (-3.3)	29.4 (-0.9)
+ MuLo-SD (4×)	5.33×	85.4 (+0.2)	80.8 (+0.4)	34.8 (-3.1)	29.5 (-0.8)

Method

Speedup

GenEval

DPG-Bench

FID

HPS-v2

Tar-1.5B @ 512

1.00×

77.7

82.8

33.0

28.6

+ ZipAR-16

1.88×

76.6 (-1.1)

82.8 (-0.0)

33.0 (-0.0)

28.5 (-0.1)

+ EAGLE-2

0.72×

77.7 (-0.0)

82.8 (-0.0)

33.0 (-0.0)

28.6 (-0.0)

+ LANTERN

1.08×

75.9 (-1.8)

82.1 (-0.9)

32.7 (-0.3)

27.7 (-0.8)

+ MuLo-SD (2×)

1.94×

76.4 (-1.3)

82.6 (-0.2)

33.4 (+0.4)

28.3 (-0.3)

Tar-7B @ 512

1.00×

85.1

81.3

38.6

29.8

+ ZipAR-16

1.88×

85.3 (+0.2)

81.0 (-0.3)

38.7 (+0.1)

29.8 (-0.0)

+ EAGLE-2

0.76×

85.1 (-0.0)

81.3 (-0.0)

38.6 (-0.0)

29.8 (-0.0)

+ LANTERN

1.20×

84.9 (-0.2)

80.5 (-0.8)

36.9 (-1.8)

28.7 (-0.9)

+ MuLo-SD (2×)

2.03×

85.1 (-0.0)

80.8 (-0.5)

38.2 (-0.4)

29.5 (-0.3)

Tar-1.5B @ 1024

1.00×

77.1

82.3

32.4

29.5

+ ZipAR-16

3.65×

76.6 (-0.5)

82.5 (+0.2)

32.4 (-0.0)

29.6 (+0.1)

+ EAGLE-2

0.78×

77.1 (-0.0)

82.3 (-0.0)

32.4 (-0.0)

29.5 (-0.0)

+ LANTERN

1.42×

75.4 (-1.7)

82.3 (-0.0)

31.1 (-1.3)

28.7 (-1.0)

+ MuLo-SD (2×)

3.90×

76.8 (-0.3)

82.2 (-0.1)

31.3 (+0.4)

28.7 (-0.8)

Tar-7B @ 1024

1.00×

85.2

80.4

37.9

30.5

+ ZipAR-16

3.65×

85.2 (-0.0)

80.3 (-0.1)

37.9 (-0.0)

30.5 (-0.0)

+ EAGLE-2

0.83×

85.2 (-0.0)

80.4 (-0.0)

37.9 (-0.0)

30.5 (-0.0)

+ LANTERN

1.45×

82.9 (-2.3)

80.5 (+0.1)

34.6 (-3.3)

29.4 (-0.9)

+ MuLo-SD (4×)

5.33×

85.4 (+0.2)

80.8 (+0.4)

34.8 (-3.1)

29.5 (-0.8)

Qualitative Results

Tar-1.5B @ 1024

Tar-1.5B @ 1024 + MuLo-SD

Tar-1.5B @ 512

Tar-1.5B @ 512 + MuLo-SD

Tar-1.5B @ 512

Tar-1.5B @ 512 + MuLo-SD

BibTex

              
            @misc{peruzzo2026multiscalelocalspeculativedecoding,
            title={Multi-Scale Local Speculative Decoding for Image Generation}, 
            author={Elia Peruzzo and Guillaume Sautière and Amirhossein Habibian},
            year={2026},
            eprint={2601.05149},
            archivePrefix={arXiv},
            primaryClass={cs.CV},
            url={https://arxiv.org/abs/2601.05149},}

Multi-Scale Local Speculative Decoding for Image Generation

Qualcomm AI Research

Accepted @ CVPR 2026

Tar-1.5B @ 1024

Tar-1.5B @ 1024 + MuLo-SD

Abstract

Method

Experiments

Quantitative Results

Qualitative Results

Tar-1.5B @ 1024

Tar-1.5B @ 1024 + MuLo-SD

Tar-1.5B @ 512

Tar-1.5B @ 512 + MuLo-SD

Tar-1.5B @ 512

Tar-1.5B @ 512 + MuLo-SD

BibTex