Trio Diffusion | Ehsan Nazari

"What if I could generate images that go on forever? – seemed like a fun problem to tackle."

This project continues my TailorGAN work but with diffusion models, aiming to generate images of arbitrary size with no real boundaries through auto-regressive patch generation.

The Basic Idea

Standard diffusion models generate fixed-size images. For larger outputs, you’re limited to upscaling or tiling. My approach: generate images piece by piece, auto-regressively, where each new patch conditions on three neighboring patches in an L-shape configuration.

The model looks at a “trio” of patches (top-left, top-right, bottom-left) and generates the missing bottom-right patch, creating seamless spatial continuity.

Trio Configuration:

[Top-Left]  [Top-Right]
[Bottom-Left]   [To-Be-Generated]

Loss Function Architecture

The model uses a RobustCombinedLoss with five key components:

MSE Loss (welknown) - Base diffusion reconstruction loss
Perceptual Loss (welknown) - VGG-based perceptual similarity
LPIPS Loss (welknown) - Learned perceptual metric
Edge Loss (welknown) - Preserves high-frequency details
Boundary Loss - added for spatial continuity

Boundary Loss Details

The Boundary Continuity Loss extracts edge regions from generated patches and compares them to expected boundaries from context patches using both pixel-level MSE and gradient-based MSE (via Sobel filters). This helps reduce visible seams at patch boundaries, though it’s just one piece of the auto-regressive generation puzzle.

Early Results

Example of auto-regressive infinite image generation - showing the patch-based approach in action

Another example run from the same initial margin patches to generate a larger image

Another example run from the different margin patches.

Yet, another example run from the different margin patches.

Example of extending a ChatGPT-generated image.

Second example of extending a ChatGPT-generated image.

The current model shows promise with improved local continuity compared to naive tiling. The boundary loss effectively reduces visible seams, though generating complex and realistic objects with long-range coherence remains challenging and requires further experimentation.

Generation Process:

Top and left patches of the initial image condition on real images
All subsequent patches condition on previously generated content
Auto-regressive generation enables theoretically infinite image extension