DisCo: Resolving the Identity Crisis for Text-to-Image Generation

Motivation

The Identity Crisis in Multi-Human Generation

Modern text-to-image models produce strikingly photorealistic imagery—but ask them to render a group of people and they consistently fail: faces are duplicated, individuals look eerily identical, and the requested headcount is wrong. We call this the identity crisis.

The core problem: Existing models optimise for aesthetics and realism, but have no mechanism to enforce that each generated person carries a distinct identity—either within a single image or across multiple generations of the same prompt.

This severely limits real-world applications: synthetic data for group photo personalisation, consistent character storytelling, narrative media, educational content, and social simulation all require reliably different people in every scene.

Fig 1. State-of-the-art text-to-image models frequently generate near-identical faces when prompted for multiple people, even when overall image quality is high. DisCo eliminates these failures.

Our Approach

DisCo: Reinforcement with Diversity Constraints

DisCo fine-tunes flow-matching text-to-image models (e.g. Flux-Dev, Krea-Dev) via Group-Relative Policy Optimisation (GRPO), guided by a compositional reward that simultaneously targets identity uniqueness, count accuracy, and perceptual quality—without requiring any additional annotations.

A key finding: optimising intra-image diversity alone causes global diversity collapse—duplicate identities simply migrate across samples rather than disappearing. DisCo addresses this with a novel group-wise counterfactual reward that penalises cross-sample identity repetition under GRPO. Explicit count and quality rewards prevent reward hacking (grid artefacts, face undercounting). A single-stage curriculum anneals from simple (2–4 person) to complex (2–7 person) scenes for stable convergence.

①

Intra-Image Diversity

Penalises high cosine similarity between ArcFace embeddings of different faces within the same generated image.

②

Group-wise Diversity

Counterfactual "remove-one" reward that directly discourages identity recurrence across the GRPO sample group for the same prompt.

③

Count Accuracy

Binary reward enforcing that the number of detected faces exactly matches the target person count in the prompt.

④

HPS Quality

Normalised HPSv3 score that preserves perceptual quality, prevents grid artefacts, and reinforces fine-grained prompt following.

Fig 2. DisCo training overview. For each prompt the model generates a group of images, evaluates all four reward components, and updates via GRPO. The curriculum gradually increases the number of prompted individuals for stable convergence.

Qualitative Results

Generated Multi-Human Scenes

DisCo consistently produces groups of individuals with genuinely distinct faces, accurate counts, and high perceptual quality—across both a generalist backbone (Flux-Dev) and a specialist backbone (Krea-Dev).

98.6% Unique Face Accuracy

98.3% Global Identity Spread

92.4% Count Accuracy

vs 85.1% GPT-Image-1 UFA

Fig 3. DisCo vs. related work. DisCo consistently generates the correct number of visually distinct individuals while maintaining high perceptual quality and prompt alignment.

Citation

BibTeX

If you find DisCo useful in your research, please cite our paper:

@article{borse2025disco,
  title={Disco: Reinforcement with diversity constraints for multi-human generation},
  author={Borse, Shubhankar and Farhadzadeh, Farzad and Hayat, Munawar and Porikli, Fatih},
  journal={arXiv preprint arXiv:2510.01399},
  year={2025}
}