Generative models have become a powerful tool for image editing tasks, including object insertion. However, these methods often lack spatial awareness, generating objects with unrealistic locations and scales, or unintentionally altering the scene background. A key challenge lies in maintaining visual coherence, which requires both a geometrically suitable object location and a high-quality image edit.
In this paper, we focus on the former, creating a location model dedicated to identifying realistic object locations. Specifically, we train an autoregressive model that generates bounding box coordinates, conditioned on the background image and the desired object class. This formulation allows to effectively handle sparse placement annotations and to incorporate implausible locations into a preference dataset by performing direct preference optimization. Our extensive experiments demonstrate that our generative location model, when paired with an inpainting method, substantially outperforms state-of-the-art instruction-tuned models and location modeling baselines in object insertion tasks, delivering accurate and visually coherent results.
We approach the problem of generative object insertion by decomposing it into two steps. First, we train a location model to determine where a specific object category could be potentially placed within a given scene. Then, we provide such location information (as well as the scene and object category to be added) to a pretrained inpainting model to render it.
This strategy is in contrast with the common practice of editing scenes with instruction-based methods. Such methods typically operate holistically and regenerate the whole scene, establishing where to place new objects implicitly during the rendering process.
Explicitly modeling locations allows to safely protect background areas (that should not be changed) and it enables fine-grained control over the editing process. For instance, it makes easier to bias the location of the edit towards preferred regions, or to force multiple edits in different areas of the scene.
Our location model is generative and operates autoregressively, decoding bounding boxes one coordinate at a time:
Our location model first observes a background scene and an object category to be added to it, and it processes them into a visual and textual tokens, respectively. Next, these tokens are processed by a transformer architecture that outputs a plausible location in the form of a bounding box. Its coordinates are generated one at a time by following an autoregressive procedure.
Training such model only requires examples of plausible object placement. However, we show that inplausible placements can benefit its training by using them as negative human preference or reward, e.g. with Direct Preference Optimization (DPO).
Considering pure location modeling performance, our autoregressive model achieves a very competitive balance between True and False Positive Rates (TPR, FPR). It performs favorably as compared to other location modeling or object placement approaches.
When used in conjunction with an off-the-shelf inpainter (PowerPaint), locations proposed by our model result in very realistic edits. A user study involving 46 participants suggested that edits from such a model are typically preferred to ones coming from modern instruction-based models, that edit the image globally and reason about location for the added object only implicitly.
Our model is also preferred to a discriminative baseline (FasterRCNN), whose proposed locations are fed to the same inpainter.
Combining an explicit location model and a pretrained inpainting method, we observe a higher diversity in edits as compared to instruction-based architectures. Diversity spans both from the location of the edit and its appearance.
@article{yun2024locationmodeling,
author = {Yun, Jooyeol and Abati, Davide and Omran, Mohamed and Choo, Jaegul and Habibian, Amir and Wiggers, Auke},
title = {Generative Location Modeling for Spatially Aware Object Insertion},
journal = {arXiv},
year = {2024},
}