MCOW

A Generate-then-Compose Training-Free Framework for Diffusion-based Generative Models

1University of Sheffield, Computer Vision Group
Description of image

Figure: Comparison of different methods on sub-tasks in T2I-CompBench. We compare our proposed method (MCOW) with three baselines: AAE, GLIGEN, and SD3. AAE is a training-free method that enhances object-centric cross-attention but consistently fails across all sub-tasks. GLIGEN, fine-tuned on Stable Diffusion with bounding box support, performs well on spatial reasoning but suffers from binding issues and image quality degradation. SD3 handles binding tasks well but lacks spatial reasoning due to absence of position conditioning. Our method successfully generates all target objects with correct attributes and spatial configurations.

Abstract

We propose MCOW (Multi-subjects Cyclic-One-Way Diffusion), a training-free framework for compositional text-to-image generation that overcomes the limitations of traditional one-shot diffusion models. MCOW follows a Generate-then-Compose paradigm by first generating individual object images and then composing them based on spatial layouts. Unlike prior layout-to-image methods, MCOW decouples content generation and spatial arrangement, enabling stronger attribute binding, object localization, and numeracy capabilities. Experimental results on the T2I-CompBench benchmark demonstrate that MCOW outperforms existing baselines in shape attribute binding and texture attribute binding. Additionally, we show that MCOW can be seamlessly transferred to domain-specific diffusion models such as DiffusionSat for satellite image synthesis. Despite its effectiveness, MCOW is limited by its reliance on DDIM-based diffusion models and its assumption of context independence during subject generation. We discuss these challenges and highlight potential future directions to further improve compositional generalization in generative models.

Method

Description of image

Overview of the MCOW pipeline. Given a compositional prompt (e.g., A realistic photograph of a cat and a dog sitting together in a cozy outdoor setting during sunset), MCOW independently generates each subject and composes them on a gray background (Image Initialization). During Cyclic Diffusion, ground-truth subject latents are repeatedly injected to preserve identity. The final image is obtained via forward diffusion from a stabilized latent.

MCOW on Specific Domain

Description of image

Left: traditional generative models such as SD2 exhibit limitations in generating high-quality satellite imagery. Mid: DiffusionSat tailored for this domain struggles with compositional generation. Right: applying MCOW framework on DiffusionSat successfully overcomes these challenges and demonstrates superior performance in compositional satellite image synthesis.

Shortages

Description of image

MCOW may fail when automated segmentation misinterprets object semantics. There is a case where “red orange” is interpreted as pulp, leading to a loss of identity post-masking. Such issues are exacerbated in fully automated pipelines. However, with minimal human intervention, MCOW achieves accurate results.

Description of image

The Generate-then-Compose paradigm gives MCOW a natural edge in handling numeracy tasks. There is a case that MCOW correctly generates “eight bottles,” preserving each object's integrity. In contrast, SD3 fails at this task. However, layout-to-image approaches may struggle to maintain quality when dealing with large numbers of objects—an open issue for future work.