MCOW: a Generate-then-Compose training-free framework for Compositional Generative Models

Abstract

We propose MCOW (Multi-subjects Cyclic-One-Way Diffusion), a training-free framework for compositional text-to-image generation that overcomes the limitations of traditional one-shot diffusion models. MCOW follows a Generate-then-Compose paradigm by first generating individual object images and then composing them based on spatial layouts. Unlike prior layout-to-image methods, MCOW decouples content generation and spatial arrangement, enabling stronger attribute binding, object localization, and numeracy capabilities. Experimental results on the T2I-CompBench benchmark demonstrate that MCOW outperforms existing baselines in shape attribute binding and texture attribute binding. Additionally, we show that MCOW can be seamlessly transferred to domain-specific diffusion models such as DiffusionSat for satellite image synthesis. Despite its effectiveness, MCOW is limited by its reliance on DDIM-based diffusion models and its assumption of context independence during subject generation. We discuss these challenges and highlight potential future directions to further improve compositional generalization in generative models.

Method

Overview of the MCOW pipeline. Given a compositional prompt (e.g., A realistic photograph of a cat and a dog sitting together in a cozy outdoor setting during sunset), MCOW independently generates each subject and composes them on a gray background (Image Initialization). During Cyclic Diffusion, ground-truth subject latents are repeatedly injected to preserve identity. The final image is obtained via forward diffusion from a stabilized latent.

MCOW on Specific Domain

Left: traditional generative models such as SD2 exhibit limitations in generating high-quality satellite imagery. Mid: DiffusionSat tailored for this domain struggles with compositional generation. Right: applying MCOW framework on DiffusionSat successfully overcomes these challenges and demonstrates superior performance in compositional satellite image synthesis.

Shortages

MCOW may fail when automated segmentation misinterprets object semantics. There is a case where “red orange” is interpreted as pulp, leading to a loss of identity post-masking. Such issues are exacerbated in fully automated pipelines. However, with minimal human intervention, MCOW achieves accurate results.

The Generate-then-Compose paradigm gives MCOW a natural edge in handling numeracy tasks. There is a case that MCOW correctly generates “eight bottles,” preserving each object's integrity. In contrast, SD3 fails at this task. However, layout-to-image approaches may struggle to maintain quality when dealing with large numbers of objects—an open issue for future work.

MCOW

A Generate-then-Compose Training-Free Framework for Diffusion-based Generative Models

Abstract

Method

MCOW on Specific Domain

Shortages