Ar2Can: An Architect and an Artist Leveraging a Canvas for Multi-Human Generation

Abstract

Disentangling Spatial Planning from Identity Rendering

Despite recent advances in personalized image generation, existing models consistently fail to produce reliable multi-human scenes, often merging or losing facial identity. We present Ar2Can, a novel two-stage framework that disentangles spatial planning from identity rendering for multi-human generation. The Architect predicts structured layouts, specifying where each person should appear. The Artist then synthesizes photorealistic images, guided by a spatially-grounded face matching reward that combines Hungarian spatial alignment with identity similarity. This approach ensures faces are rendered at correct locations and faithfully preserve reference identities.

We develop two Architect variants, seamlessly integrated with our diffusion-based Artist model. This is optimized via Group Relative Policy Optimization (GRPO) using compositional rewards for count accuracy, image quality, and identity matching. Evaluated on the MultiHuman-Testbench, Ar2Can achieves substantial improvements in both count accuracy and identity preservation, while maintaining high perceptual quality. Notably, our method achieves these results using primarily synthetic data, without requiring real multi-human images.

Framework

Architect + Canvas + Artist

Ar2Can decomposes multi-human generation into two stages. The Architect first generates a structured spatial plan — bounding boxes (and optionally pose) specifying where and how each person should appear. The Artist then renders the photorealistic image conditioned on this spatial plan, reference identities, and the text prompt. This explicit decomposition eliminates the identity merging and swapping failures that plague single-stage methods.

Ar2Can Framework Overview: Architect predicts layout, Artist renders photorealistic image.

Fig 1. Ar2Can Framework Overview. Our two-stage approach decomposes multi-human generation into spatial planning (Architect) and identity-preserving rendering (Artist). The Architect first visualizes where each person should be placed; the Artist then renders the complete scene with realistic pose and lighting.

Ar2Can generates photorealistic multi-human scenes preserving individual identities.

Fig 2. Ar2Can generates highly photorealistic multi-human scenes with 1–5 people while preserving individual identities. Our two-stage architecture produces natural poses, realistic lighting, and proper spatial arrangements without identity merging or blending artifacts.

Method

Two-Stage Architecture

Architect: Spatial Layout Generation

The Architect generates facial bounding boxes and/or pose from textual descriptions, focusing on accurate instance counts and spatially plausible placements. We design two complementary variants with different efficiency trade-offs:

Architect-A

LLM-based Bounding-Box Regression

Built on Qwen-2.5 (0.5B), extended with special layout tokens <SoL>, <EoL>, <C> and dual prediction heads for continuous coordinate regression alongside standard token generation.

Strong language understanding for accurate counts
Handles complex linguistic reasoning (e.g. "two girls next to a man")
Data-sorting for stable coordinate regression
Trained with CE + gIoU + L1 losses

Architect-B

T2I-based Layout Generation

Fine-tunes Flux-Schnell via GRPO with count accuracy and HPSv3 rewards, leveraging the model's 2D spatial priors to synthesize layout sketches from which bounding boxes and human pose coordinates are extracted.

Rich spatial priors from 2D generation
Generates both face boxes and body pose
Fast: only 4 denoising steps
HPSv3 reward prevents spatial reward hacking

Artist: Identity-Preserving Rendering

The Artist — Flux-Kontext fine-tuned via GRPO — renders photorealistic multi-human images conditioned on the Architect's layout, reference identity images, and the text prompt. Rather than supervised fine-tuning (which requires expensive paired annotations), we use RL to directly optimise non-differentiable objectives via a four-component compositional reward.

A key innovation is the spatially-grounded face matching reward: Hungarian algorithm matching first establishes spatial correspondence between Architect-predicted centroids and detected face locations, then ArcFace identity similarity is computed only between spatially-matched pairs. This prevents copy-paste artifacts while jointly optimising location accuracy and identity preservation. Token dropping (2× speedup) and shared RoPE encodings for overlapping regions further improve efficiency and occlusion handling.

Count Accuracy

Binary reward: generated face count must exactly match the target count in the prompt.

HPSv3 Quality

Normalised human preference score ensuring prompt alignment and preventing visual artifacts.

Face Matching

Hungarian centroid matching + ArcFace similarity. Jointly optimises spatial location and identity.

Pose Correction

Frontality score from facial landmarks suppresses copy-paste artifacts and encourages natural poses.

Artist training pipeline with GRPO and compositional rewards.

Fig 3. Artist training pipeline with GRPO. Given the input canvas and text prompt, the model samples a group of images and optimises over four compositional rewards: count accuracy, prompt alignment / aesthetic quality (HPSv3), spatially-grounded face matching, and pose correction. Hungarian centroid matching (right) establishes flexible spatial correspondence before computing ArcFace identity similarity.

Results

State-of-the-Art on MultiHuman-Testbench

Ar2Can substantially outperforms both proprietary systems (GPT-Image-1, Nanobanana) and open-source methods on identity preservation and count accuracy, while maintaining competitive perceptual quality and action alignment.

90.2 Count Accuracy

68.2 Multi-ID Similarity

72.4 Unified Score

+13.7pts over MH-OmniGen

Qualitative comparison with state-of-the-art methods on MultiHuman-Testbench.

Fig 4. Qualitative comparison with state-of-the-art methods. Left: reference images and text prompts from MultiHuman-Testbench. Right: outputs from all methods. Scorecards indicate ID preservation and prompt alignment. Existing methods frequently fail at one or both objectives, while Ar2Can consistently achieves both across diverse multi-human scenes.

Quantitative Analysis

Ar2Can maintains consistent identity preservation and prompt alignment as person count scales from 1 to 5, with a strongly favourable latency–quality trade-off.

Fig 5. Quantitative analysis and scalability. (a) Latency–quality trade-off on A100 GPU. (b) Multi-ID similarity vs. person count. (c) HPS vs. person count. Ar2Can maintains consistent identity preservation and prompt alignment across 1–5 people, with token sharing/dropping providing a further 2× speedup.