Qualcomm AI Research
Motivation
Modern text-to-image models produce strikingly photorealistic imagery—but ask them to render a group of people and they consistently fail: faces are duplicated, individuals look eerily identical, and the requested headcount is wrong. We call this the identity crisis.
This severely limits real-world applications: synthetic data for group photo personalisation, consistent character storytelling, narrative media, educational content, and social simulation all require reliably different people in every scene.
Fig 1. State-of-the-art text-to-image models frequently generate near-identical faces when prompted for multiple people, even when overall image quality is high. DisCo eliminates these failures.
Our Approach
DisCo fine-tunes flow-matching text-to-image models (e.g. Flux-Dev, Krea-Dev) via Group-Relative Policy Optimisation (GRPO), guided by a compositional reward that simultaneously targets identity uniqueness, count accuracy, and perceptual quality—without requiring any additional annotations.
A key finding: optimising intra-image diversity alone causes global diversity collapse—duplicate identities simply migrate across samples rather than disappearing. DisCo addresses this with a novel group-wise counterfactual reward that penalises cross-sample identity repetition under GRPO. Explicit count and quality rewards prevent reward hacking (grid artefacts, face undercounting). A single-stage curriculum anneals from simple (2–4 person) to complex (2–7 person) scenes for stable convergence.
Penalises high cosine similarity between ArcFace embeddings of different faces within the same generated image.
Counterfactual "remove-one" reward that directly discourages identity recurrence across the GRPO sample group for the same prompt.
Binary reward enforcing that the number of detected faces exactly matches the target person count in the prompt.
Normalised HPSv3 score that preserves perceptual quality, prevents grid artefacts, and reinforces fine-grained prompt following.
Fig 2. DisCo training overview. For each prompt the model generates a group of images, evaluates all four reward components, and updates via GRPO. The curriculum gradually increases the number of prompted individuals for stable convergence.
Qualitative Results
DisCo consistently produces groups of individuals with genuinely distinct faces, accurate counts, and high perceptual quality—across both a generalist backbone (Flux-Dev) and a specialist backbone (Krea-Dev).
Fig 3. DisCo vs. related work. DisCo consistently generates the correct number of visually distinct individuals while maintaining high perceptual quality and prompt alignment.
Citation
If you find DisCo useful in your research, please cite our paper:
@article{borse2025disco,
title={Disco: Reinforcement with diversity constraints for multi-human generation},
author={Borse, Shubhankar and Farhadzadeh, Farzad and Hayat, Munawar and Porikli, Fatih},
journal={arXiv preprint arXiv:2510.01399},
year={2025}
}