Personalizing Diffusion Models

An interactive guide to six methods for teaching text-to-image models new concepts — from full fine-tuning to embedding optimization, consistency distillation, and reinforcement learning.

Independent study by Meghana Nanuvala under the guidance of Professor Mohammad Al Hasan · Indiana University

Stable Diffusion v1.5 HuggingFace Diffusers NVIDIA H100 IU Quartz HPC
Scroll to explore ↓

The Personalization Problem

Diffusion models like Stable Diffusion can generate stunning images from text prompts. But what if you want them to generate images of your specific dog, in your favorite art style, or capturing your unique concept? That's the personalization problem: adapting a general-purpose model to faithfully reproduce specific subjects, styles, or concepts from just 3–5 reference images.

The challenge is finding the right balance — the model must learn the new concept without forgetting everything else it knows. Different methods make different trade-offs between fidelity (how well it captures the concept), efficiency (how much compute and storage it needs), and flexibility (how easily it can be shared and combined).

Five Approaches, One Goal

Full Fine-Tuning

DreamBooth

Fine-tunes the entire U-Net (~860M params) to bind your subject to a rare token like "sks".

~3.4 GB • Highest fidelity • 400 steps
Parameter-Efficient

LoRA

Injects small trainable matrices into frozen attention layers. Merges at inference with zero overhead.

~3 MB • Great for styles • 15k steps
Minimal Footprint

Textual Inversion

Learns just a text embedding vector for a new pseudo-token. The entire model stays frozen.

~3-24 KB • Lightest method • 800 steps
Speed

LCM Distillation

Distills the model into a student that generates in 1–4 steps instead of 20–50.

~4x faster inference • In progress
Selective Fine-Tuning

Custom Diffusion

Trains only cross-attention K & V projections + a modifier token. Supports multi-concept composition.

~75 MB • Multi-concept • In progress
Reinforcement Learning

DDPO

Treats denoising as an MDP and uses PPO to optimize arbitrary reward functions.

Custom rewards • In progress

Evaluation Metrics — How Do I Measure Success?

Before comparing methods, it is important to understand what is being measured. Personalization quality isn't a single number — it has multiple dimensions. A generated image of "your dog on the beach" needs to (1) actually show a beach scene, (2) look like your specific dog, and (3) preserve the dog's structural identity (shape, pose, features). No single metric captures all three, so I use three complementary metrics that together give a complete picture.

T

CLIP-T — Text-Image Alignment

Question it answers: "Does the generated image match what the prompt asked for?"

CLIP-T measures how well the generated image aligns with the text prompt. It uses OpenAI's CLIP model, which was trained on 400 million image-text pairs to understand the relationship between images and natural language. CLIP encodes both text and images into the same embedding space, so their similarity can be directly measured.

How it works step-by-step:

1

Encode the text prompt through CLIP's text encoder to get a text embedding vector

2

Encode the generated image through CLIP's image encoder to get an image embedding vector

3

Compute cosine similarity between the two vectors (range: -1 to 1)

CLIP-T = cos(Etext(prompt), Eimage(generated)) = (et · ei) / (||et|| · ||ei||)
0.15-0.20
Poor — image doesn't match prompt
0.20-0.27
Moderate — partially matches
0.27-0.35
Good — strong text alignment

Why is this important?

A personalized model that can only reproduce the training images verbatim isn't useful. CLIP-T tells us if the model can compose — placing your subject into new scenes described by text. If you ask for "sks dog wearing a spacesuit" and get a generic dog, CLIP-T will be high but the personalization failed. If you get your specific dog but not in a spacesuit, CLIP-T drops. Both components matter.

I

CLIP-I — Image-Image Similarity

Question it answers: "Does the generated image look visually similar to the reference photos?"

CLIP-I measures the visual similarity between the generated image and the original reference images of your subject. Unlike CLIP-T which compares text-to-image, CLIP-I compares image-to-image — both are passed through CLIP's image encoder, and their closeness in the shared embedding space is measured.

How it works step-by-step:

1

Encode each reference image through CLIP's image encoder

2

Encode the generated image through the same image encoder

3

Compute cosine similarity between generated and each reference, then average

CLIP-I = (1/N) ∑i cos(Eimage(generated), Eimage(referencei))
<0.75
Poor — doesn't resemble subject
0.75-0.85
Good — recognizable similarity
0.85-0.95
Excellent — strong visual match

Why is this important?

CLIP-I captures the high-level "visual gestalt" — color palette, texture patterns, and overall appearance. It's good at catching whether the generated image looks like the subject at a glance. However, CLIP was trained for semantic understanding, not fine-grained identity — it might give high scores to two different golden retrievers. That's why DINO-I is also needed.

D

DINO-I — Structural / Identity Similarity

Question it answers: "Is this the same subject, not just a similar-looking one?"

DINO-I uses DINOv2 (ViT-S/14) — a vision transformer trained with self-supervised learning on images only (no text). Unlike CLIP, which aligns images with language, DINOv2 learns purely visual features through self-distillation. This makes it exceptionally sensitive to structural details: specific shapes, poses, ear positions, fur patterns — the things that distinguish your dog from other dogs of the same breed.

How it works step-by-step:

1

Extract DINOv2 CLS token features from the generated image

2

Extract DINOv2 CLS token features from each reference image

3

Compute cosine similarity — sensitive to shape, pose, and fine-grained identity

DINO-I = (1/N) ∑i cos(DINO(generated), DINO(referencei))
<0.45
Poor — wrong identity/structure
0.45-0.65
Moderate — partial identity match
0.65-0.80
Good — clear identity preservation

Why is this important?

DINO-I is the most demanding metric. CLIP-I might give a high score to any similar-looking dog, but DINO-I drops sharply if the structural identity changes — wrong ear shape, different face markings, altered body proportions. When DreamBooth's DINO-I dropped from 0.588 to 0.462 at λ=1.0, it meant the model was generating generic dogs that happened to look dog-like, not the specific subject. DINO-I caught what CLIP-I (0.800) might have obscured.

Why all three? Each metric has a blind spot. CLIP-T doesn't care about subject identity (any dog in a bucket scores high). CLIP-I captures visual similarity but can be fooled by similar-looking subjects. DINO-I is strict on identity but doesn't evaluate prompt adherence. Together, they form a triangle: text alignment + visual similarity + structural identity = comprehensive evaluation.

Generated Image "sks dog on the beach" CLIP Text Encoder prompt → embedding CLIP Image Encoder generated → embedding CLIP-T Score CLIP Image Enc CLIP Image Enc generated reference CLIP-I Score DINOv2 (ViT-S/14) self-supervised features DINOv2 (ViT-S/14) reference features DINO-I Score Text-Image Alignment Visual Similarity Structural Identity

How Each Method Works

Click each tab to dive into the core idea, training objective, and what makes each method unique.

DreamBooth — Fine-Tuning the Entire Model

DreamBooth (Ruiz et al., CVPR 2023) takes the most direct approach: it fine-tunes all ~860M parameters of the U-Net denoiser. You provide 3–5 images of your subject and bind them to a rare identifier token like "sks". After training, prompting with "a photo of sks dog on the beach" generates your specific dog in that scene.

The key innovation is the prior-preservation loss — during training, the model simultaneously generates generic class images (e.g., random dogs) and ensures it doesn't forget what a "dog" looks like in general. This prevents language drift, where fine-tuning makes the model associate "dog" exclusively with your specific dog.

L = E[||ε - εθtx + σtε, cinst)||²] + λ · E[||ε - εθtxpr + σtε, cpr)||²]
  Instance loss (learn your subject)              Prior loss (don't forget the class)

Architecture Deep Dive: What Happens During Training

1. Collect Reference Images

Gather 3–5 photos of your subject. These are your "instance" images — the ground truth the model will learn from. Quality matters more than quantity.

2. Bind to Rare Token

Choose a rare identifier like "sks". The prompt becomes "a photo of sks dog". The rarity avoids colliding with existing vocabulary the model already knows.

3. Generate Class Images

Before training, the original model generates ~100 generic class images (e.g., "a photo of a dog"). These serve as the "memory" of what the class should look like.

4. Dual-Path Training

Each training step runs two forward passes: one with your subject (instance path) and one with a generic class image (prior path). The U-Net learns your subject while the prior path prevents forgetting.

5. Full U-Net Update

Gradients flow through the entire U-Net — all ~860M parameters in the encoder, middle block, decoder, and attention layers. This is why DreamBooth achieves the highest fidelity but also the largest checkpoint.

6. Noise Prediction

Like standard diffusion training, the model learns to predict the noise ε added to the image at each timestep. The CLIP text encoder conditions this prediction on the prompt embedding.

Why does prior preservation matter?

Without it, after seeing just 3 dog photos, the model "forgets" what dogs in general look like. The word "dog" becomes synonymous with your specific dog. With the prior loss weighted by λ, the model maintains a balance: "sks dog" = your dog, "a dog" = any dog. My experiments show λ=1.0 goes too far the other way — the prior loss dominates and suppresses the subject's unique features (DINO-I dropped 27%).

The λ parameter is critical: too low and the model overfits to your subject; too high and it over-regularizes, suppressing subject-specific features. My experiments show λ ∈ [0.50, 0.75] is the sweet spot.

LoRA — Low-Rank Adaptation

LoRA (Hu et al., ICLR 2022) takes a smarter approach: instead of updating all 860M weights, it freezes the pre-trained model and injects tiny trainable "adapter" matrices into each attention layer. These adapters are low-rank decompositions — two small matrices A and B whose product approximates the change needed.

The beauty of LoRA is at inference time: the adapter matrices can be merged directly into the original weights (W' = W + BA), so there's literally zero additional latency. You get a ~3 MB file instead of a ~3.4 GB checkpoint, and you can swap adapters in and out.

h = Wx + BAx    where W is frozen (d×k), A is (r×k), B is (d×r)
  rank 4: ~1.6M params (0.2%)  |  rank 8: ~3.2M (0.4%)  |  rank 16: ~6.4M (0.7%)

Architecture Deep Dive: Low-Rank Decomposition in Attention

1. Freeze Everything

The original U-Net weights W are completely frozen — no gradients flow through them. This preserves all the knowledge the base model learned during pre-training on billions of images.

2. Inject Adapter Pairs

For each attention layer's Q, K, V, and output projections, insert a parallel path: a "down-projection" matrix A (r×k, compresses to rank r) and an "up-projection" matrix B (d×r, expands back). Only A and B are trainable.

3. Parallel Forward Pass

During inference, input x flows through both paths: the frozen Wx and the adapter BAx. The outputs are summed: h = Wx + BAx. The adapter path is a "correction" to the original computation.

4. Why "Low-Rank"?

The key insight: the change ΔW needed for fine-tuning has much lower rank than the full weight matrix. For d=k=768 and r=4, a full update needs 589,824 params per layer; LoRA needs only 6,144 — a 96x reduction.

5. Merge at Inference

After training, compute W' = W + BA and replace the original weights. The adapter paths are eliminated entirely — the model architecture is identical to the original, just with modified weights. Zero latency overhead.

6. Composability

Because adapters are additive (W + BA), you can combine multiple LoRAs: W + B1A1 + B2A2. Mix a "Naruto style" LoRA with a "watercolor" LoRA. This is impossible with DreamBooth.

What does rank r actually control?

Rank r determines the adapter's expressiveness — how complex a modification it can represent. Think of it as resolution: rank 4 can capture broad style changes (color palettes, line styles), while rank 16 can represent finer adjustments. But more capacity means more risk of overfitting — my rank 16 experiments collapsed because the adapter memorized training data noise rather than learning generalizable style features.

Higher rank doesn't always mean better! My experiments show rank 16 caused mode collapse (black outputs) at 15k training steps, while rank 4 produced the best results. The effective dimensionality of the style manifold was lower than expected.

Textual Inversion — Teaching Through Words

Textual Inversion (Gal et al., ICLR 2023) is the most minimalist approach. Instead of changing any model weights, it learns a new word — specifically, a new embedding vector v* for a pseudo-token like <sks-cat>. The entire U-Net and text encoder stay completely frozen.

With multiple vectors, the pseudo-token expands to [v1, v2, ..., vN], giving a richer representation. Even at 8 vectors, you're only training 6,144 parameters (~24 KB) — orders of magnitude less than LoRA or DreamBooth.

v* = argminv E[||ε - εθtx + σtε, cθ(prompt with v))||²]
  Only v* gets gradients — all 983M model parameters stay frozen

Architecture Deep Dive: How a Single Embedding Captures a Concept

1. Add Token to Vocabulary

A new pseudo-token <sks-cat> is added to the CLIP tokenizer's vocabulary. Its embedding is initialized randomly or from a similar word (e.g., "cat"). This embedding lives in the 768-dimensional CLIP embedding space.

2. Freeze Everything Else

The CLIP text encoder (~123M params) and the entire U-Net (~860M params) are frozen. Only the embedding vector(s) for the new token receive gradients. That's 768 params for 1 vector — literally 0.00008% of the model.

3. Forward Through Frozen Pipeline

During training, the prompt "a photo of <sks-cat>" is tokenized, the new embedding is looked up, and the entire sequence passes through the frozen CLIP encoder and U-Net. The denoising loss is computed normally.

4. Backward to Embedding Only

Gradients from the denoising loss propagate back through the frozen U-Net, through the frozen text encoder, all the way to the embedding lookup table. But only the new token's entry gets updated — all other embeddings are frozen too.

5. Multi-Vector Expansion

With N vectors, the token expands to N consecutive embeddings [v1,...,vN] in the sequence. This is like giving the model N "words" to describe your concept instead of one. 4 vectors = 3,072 params (~12 KB).

6. Use in Any Prompt

The learned embedding works in any prompt: "a painting of <sks-cat> in space", "<sks-cat> as a cartoon". The frozen model handles composition; the embedding just tells it what the subject looks like.

Why does this work at all with so few parameters?

The frozen model already "knows" about cats, textures, lighting, and composition. All Textual Inversion needs to do is find the right point in the text embedding space that describes your specific cat to the existing model. It's not learning to generate — it's learning to describe. The frozen model is an incredibly strong regularizer: it can't overfit because there's nothing to overfit with. This is why TI with 4 vectors (3,072 params) achieved higher CLIP-I (0.857) than DreamBooth's 860M params (0.845).

Despite being ~140,000x more parameter-efficient than DreamBooth, Textual Inversion with 4 vectors achieves the highest CLIP-I score (0.857) across all my experiments! The frozen model acts as a strong regularizer.

Latent Consistency Distillation

LCM (Luo et al., 2023) tackles a different problem: speed. Standard diffusion needs 20–50 denoising steps to generate an image. LCM distills the model into a student that can do it in 1–4 steps — a 5–50x speedup.

The idea is elegant: treat the multi-step denoising as solving an ODE, then train the student to predict the ODE solution directly. The consistency constraint ensures predictions at different points along the ODE trajectory all map to the same clean output.

LLCD = E[d(fθ(ztn+1, tn+1, c), fθ-(ẑtnφ, tn, c))]
  Student output at tn+1 must be consistent with EMA target at tn

Architecture Deep Dive: From Multi-Step ODE to One-Step Prediction

1. Teacher Provides Trajectory

The pre-trained SD v1.5 model acts as a "teacher." Given a noisy latent zt(n+1), the teacher uses one ODE solver step to produce an estimate ẑt(n) — what the latent should look like one step closer to clean.

2. Student Learns Shortcuts

The student model takes the same noisy input but tries to predict the final clean output directly. It's learning to skip the intermediate steps the teacher would need.

3. EMA Target for Stability

An exponential moving average (EMA) of the student acts as the target model. The student's prediction at tn+1 must be consistent with the EMA target's prediction at tn. This self-consistency is the key constraint.

4. 1-4 Steps at Inference

After distillation, the student can generate images in just 1–4 denoising steps instead of the teacher's 20–50. This enables real-time generation and interactive editing applications.

Both L2 and Huber loss variants achieve comparable CLIP-T scores (~0.25) with a ~27× latency reduction (108ms vs. 2,913ms for DreamBooth). L2 shows slightly higher variance across prompts (0.191–0.293) compared to Huber (0.232–0.266), while the choice of loss function does not significantly affect text-image alignment.

Custom Diffusion — Selective Cross-Attention Fine-Tuning

Custom Diffusion (Kumari et al., CVPR 2023) finds a middle ground between DreamBooth's full fine-tuning and Textual Inversion's embedding-only approach. It trains only the key (K) and value (V) projection matrices in the U-Net's cross-attention layers — the layers where text conditioning meets visual features. Additionally, it learns a modifier token embedding (like <V1>) similar to Textual Inversion.

The breakthrough feature is multi-concept composition: Custom Diffusion can learn two or more concepts simultaneously (e.g., your dog + your couch) and compose them in a single prompt. It uses real images retrieved via CLIP for regularization instead of model-generated class images, which provides stronger diversity.

L = E[||ε - εθtx + σtε, cmod)||²] + λ · E[||ε - εθtxreg + σtε, cclass)||²]
  Only K, V projections + modifier token receive gradients    |    Regularization with real retrieved images

Architecture Deep Dive: Why Cross-Attention K & V Are the Sweet Spot

1. Identify the Critical Layers

The U-Net has two types of attention: self-attention (spatial features attend to each other) and cross-attention (visual features attend to text). Custom Diffusion's key insight is that cross-attention K and V projections are where concept identity is encoded — they map text to the visual "what."

2. Freeze Everything Except K & V

The entire U-Net is frozen except for to_k and to_v weights in cross-attention layers. Optionally, to_q and to_out can also be trained (using --freeze_model=crossattn). This targets ~75 MB of parameters — ~2% of the model, far less than DreamBooth but more expressive than TI.

3. Learn a Modifier Token

Like Textual Inversion, a new modifier token (e.g., <V1>) is added and its embedding is initialized from a semantically close word. The text encoder is frozen except for this token's embedding, providing a dual learning signal: the token describes the concept, and K/V projections learn to attend to it.

4. Real-Image Regularization

Instead of generating class images with the base model (DreamBooth's approach), Custom Diffusion retrieves ~200 real images via CLIP retrieval from LAION. Real images provide more diverse regularization, reducing overfitting more effectively than model-generated samples.

5. Multi-Concept Joint Training

The star feature: train on multiple concepts simultaneously by providing a JSON config with each concept's images, prompts, and class data. The model learns separate modifier tokens (<V1>, <V2>) and shared K/V updates. At inference: "<V1> dog sitting on <V2> couch".

6. Efficient Storage & Composability

The saved checkpoint contains only the modified K/V weights (~75 MB) plus the modifier token embeddings. At inference, these are loaded via load_attn_procs() on top of the base model. Multiple Custom Diffusion checkpoints can be composed for novel concept combinations.

Why are K and V projections so special?

In cross-attention, Q comes from the visual features (what the image "asks about"), while K and V come from the text (what the text "offers" as answers). K determines where the model looks in the text for each spatial position, and V determines what information flows back. By only modifying K and V, Custom Diffusion changes how the model interprets text as visual features — without touching spatial reasoning (self-attention) or the denoising backbone. This is why it can learn new concepts with ~2% of the parameters while maintaining the model's compositional abilities.

Experiments in progress. Custom Diffusion occupies a unique niche: more expressive than Textual Inversion, more efficient than DreamBooth, and the only method that natively supports multi-concept composition.

DDPO — Reinforcement Learning Meets Diffusion

DDPO (Black et al., 2024) is fundamentally different: it treats each denoising step as a policy action in a Markov Decision Process. The "reward" comes from evaluating the final generated image — this could be an aesthetic score, CLIP similarity, or any custom metric. PPO (Proximal Policy Optimization) updates the model to maximize this reward.

This is powerful because it can optimize objectives that can't be expressed as simple reconstruction losses — like "make the image more aesthetically pleasing" or "better match human preferences."

J(θ) = Eτ~pθ[R(x0, c)]
  Maximize expected reward over denoising trajectories using PPO

Architecture Deep Dive: Denoising as a Markov Decision Process

1. Formulate Denoising as MDP

Each denoising step is reframed as an RL action. The state is the current noisy latent xt, the action is the noise prediction εθ(xt, t, c), and the transition applies the scheduler to get xt-1. The full T-step trajectory τ = (xT, aT, ..., x0) forms one episode.

2. Generate Full Trajectories

The policy (U-Net) generates complete denoising trajectories from pure noise xT to clean image x0. Unlike supervised training that sees real data, DDPO trains entirely on self-generated images. Multiple trajectories are sampled per batch to reduce gradient variance.

3. Evaluate with Reward Function

Only the final image x0 is scored by the reward function R(x0, c). This can be anything differentiable or not: CLIP similarity to a text prompt, an aesthetic predictor, a human preference model, or even a composition of multiple objectives. The reward is sparse — assigned only at the end of the episode.

4. Credit Assignment via PPO

PPO distributes the sparse end-of-episode reward back to each denoising step. It computes advantages: how much better was each action than average? The clipped objective prevents any single update from changing the policy too drastically, which is critical for stable diffusion model training.

5. Policy Gradient Update

The gradient ∇θJ = E[∑t ∇ log pθ(at|st) · At] updates the U-Net weights. Steps that contributed to high-reward images get reinforced; steps that led to poor images get suppressed. A KL penalty against the original model prevents reward hacking.

6. Iterate with Fresh Samples

Unlike offline methods, DDPO is on-policy: each training iteration generates new trajectories with the current policy, evaluates them, and updates. This means the model continually explores new regions of the image space, adapting its generation strategy based on what the reward function values.

Why is RL-based optimization game-changing for personalization?

Traditional methods (DreamBooth, LoRA, TI) minimize a reconstruction loss — they can only learn to reproduce what's in the training data. DDPO can optimize for arbitrary objectives that may not have a differentiable loss function. Want images that are more aesthetically pleasing? Use an aesthetic scorer as reward. Want better text-image alignment? Use CLIP similarity. Want to match human preferences? Use a reward model trained on human rankings. This flexibility means DDPO can improve qualities that reconstruction-based methods fundamentally cannot target, making it complementary to them rather than a replacement.

The LoRA variant achieves a higher aesthetic score (6.28) than full U-Net fine-tuning (6.05) with comparable CLIP-T (~0.23), suggesting LoRA's constrained parameter space effectively regularizes reward optimization. Both variants share steady-state latency of ~1,190ms at 50 inference steps.

Architecture Animations

Watch how data flows through each method. Purple = trained, blue = frozen, green dashed = data.

"a photo of sks dog" Subject Imgs(3-5 photos) CLIP Encoderfrozen Noise Sched U-NetALL 860M trained εpredicted "a photo of a dog" Class Imgs(100 generic) CLIP Encoderfrozen Noise Sched L = L_inst + λ · L_prior
Input x Pretrained W(d×k) FROZEN Wx A (r×k)down-project B (d×r)up-project Ax BAx + h = Wx + BAx r=4: ~1.6M (0.2%)   |   r=8: ~3.2M (0.4%)   |   r=16: ~6.4M (0.7%)
Subject Imgs(3-5 photos) optimize <sks-cat>v* TRAINED CLIP EncoderFROZEN U-NetFROZEN Image gradient flows ONLY to v* 1 vec = 768 params (3 KB)   |   4 vec = 3,072 (12 KB)   |   8 vec = 6,144 (24 KB) Frozen: Text Encoder (~123M) + U-Net (~860M) = entire model unchanged
z_t(n+1)noisy latent Teacher(SD v1.5) frozen ODE Solver(1 step) EMA Targetf_θ- Student f_θTRAINED L_LCDConsistency Loss Student predicts ODE solution directly → 1-4 steps instead of 20-50
Only cross-attention K & V projections are trained "photo of <V1> cat" Subject Imgs(4-5 photos) <V1> Tokenmodifier emb. CLIP Encfrozen Cross-Attention Layer Q K V Attn(Q,K,V) U-Net Bodyfrozen "a photo of a cat" Real ImagesCLIP-retrieved regularization (200 real imgs) L = L_concept + λ · L_reg (real imgs)
Denoising Trajectory = MDP Episode x_Tnoise U-Net(policy) x_T-1 ... x_1 U-Netε_θ x_0image RewardR(x_0, c) PPOPolicy Gradient ∇θ

Experimental Setup

All experiments run on Stable Diffusion v1.5 with NVIDIA H100 80 GB GPUs on IU Quartz HPC. For each method, I sweep its most impactful hyperparameter.

DreamBoothLoRATextual Inversion
TaskSubject (dog)Style (Naruto)Subject (cat)
Dataset3 dog imagesnaruto-blip-captions3 cat images
Steps40015,000800
Learning rate5×10-61×10-45×10-4
Variableλ ∈ {0.25, 0.50, 0.75, 1.0}rank r ∈ {4, 8, 16}vectors ∈ {1, 2, 4, 8}
MetricsCLIP-T, CLIP-I, DINO-ICLIP-T only*CLIP-T, CLIP-I, DINO-I

*LoRA is a style-transfer task with no reference images, so CLIP-I/DINO-I are not applicable.

Custom DiffusionLCM DistillationDDPO
TaskSubject (dog)General accelerationAesthetic reward
Base modelSD v1.4 (CompVis)SD v1.5SD v1.5
Dataset5 dog imagesLAION-CC12MRL-generated
Epochs / Steps250 steps10,000 steps200 epochs
Learning rate1×10-51×10-63×10-4
Variablecrossattn vs crossattn_kvLoss: L2 vs HuberNo LoRA vs LoRA
Inference steps30450
MetricsCLIP-T, CLIP-I, DINO-ICLIP-T, Latency†Aesthetic, CLIP-T, Latency†

†LCM and DDPO generate generic images (no subject); CLIP-I/DINO-I not applicable. CLIP-T is measured for text-image alignment.

Input Datasets

Subject-driven methods require a small set of reference images. LoRA and LCM use large-scale datasets for style transfer and distillation respectively. DDPO, as an on-policy RL method, trains entirely on its own generated images scored by an aesthetic reward model.

Dog — DreamBooth (3 imgs) & Custom Diffusion (5 imgs)

dog 1 dog 2 dog 3

Source: diffusers/dog-example — same subject, different image counts

Cat toy — Textual Inversion (3 images)

cat 1 cat 2 cat 3

Source: diffusers/cat_toy_example

Large-scale Datasets

LoRA: lambdalabs/naruto-blip-captions
~1,200 Naruto-style illustrations with BLIP captions

LCM: LAION-CC12M
Web-crawled image–text pairs for consistency distillation

DDPO: Self-generated
On-policy RL; trains on its own outputs scored by aesthetic reward

Results

DreamBooth: Effect of Prior Weight λ

λCLIP-T ↑CLIP-I ↑DINO-I ↑
0.250.2740.8230.588
0.500.2720.8230.560
0.750.2690.8450.564
1.000.2690.8000.462

DINO-I (structural similarity) by λ:

λ = 0.25
0.588
λ = 0.50
0.560
λ = 0.75
0.564
λ = 1.00
0.462

Generated Samples

Prompts: "sks dog in a bucket" | "sks dog on grassy field" | "sks dog wearing bandana"

λ = 0.25 (best CLIP-T & DINO-I)

pw25 bucket
bucket
pw25 field
field
pw25 bandana
bandana

λ = 0.75 (best CLIP-I)

pw75 bucket
bucket
pw75 field
field
pw75 bandana
bandana

λ = 1.00 (over-regularized)

pw100 bucket
bucket
pw100 field
field
pw100 bandana
bandana
Key Finding: Lower λ (0.25) yields the highest CLIP-T (0.274) and DINO-I (0.588), letting the model capture subject-specific structural details. λ = 0.75 maximizes CLIP-I (0.845), preserving pixel-level visual characteristics. λ = 1.0 causes over-regularization with DINO-I dropping to 0.462. The sweet spot is λ ∈ [0.50, 0.75].

LoRA: Effect of Rank

PromptRank 4Rank 8Rank 16
Bill Gates with a hoodie0.2420.1720.172
John Oliver Naruto style0.2870.2900.155
Hello Kitty Naruto style0.3140.2770.253
Mickael Jackson as ninja0.1540.2380.154
Mean CLIP-T0.2490.2440.183

Generated Samples

Rank 4 (best overall)

r4 bill gates
Bill Gates
r4 john oliver
John Oliver
r4 hello kitty
Hello Kitty

Rank 8

r8 john oliver
John Oliver
r8 hello kitty
Hello Kitty
r8 jackson
M. Jackson

Rank 16 (mode collapse)

r16 collapsed
John Oliver (collapsed)
r16 hello kitty
Hello Kitty
r16 collapsed
M. Jackson (collapsed)
Key Finding: Rank 4 achieves the highest mean CLIP-T (0.249) with only ~1.6M trainable parameters. Rank 8 performs comparably (0.244) and outperforms on some prompts, suggesting prompt-dependent optimal capacity. Rank 16 suffers mode collapse, producing black/collapsed outputs (mean CLIP-T drops to 0.183). Rank 4–8 is optimal; increasing rank beyond the effective dimensionality of the style manifold is counterproductive.

Textual Inversion: Effect of Number of Vectors

VectorsCLIP-T ↑CLIP-I ↑DINO-I ↑
10.2410.7980.628
20.2370.8190.664
40.2710.8570.687
80.2770.8340.690

Generated Samples

Prompts: "<sks-cat> in a basket" | "on a table" | "in a garden"

1 Vector (768 params, ~3 KB)

1v basket
basket
1v table
table
1v garden
garden

4 Vectors (3,072 params, ~12 KB) — best CLIP-I

4v basket
basket
4v table
table
4v garden
garden

8 Vectors (6,144 params, ~24 KB) — best CLIP-T & DINO-I

8v basket
basket
8v table
table
8v garden
garden
Key Finding: 1 vector (768 params) has insufficient capacity, yielding the lowest scores. 4 vectors peaks on CLIP-I (0.857); 8 vectors peaks on CLIP-T (0.277) and DINO-I (0.690). Remarkably, with only ~3–24 KB of storage, Textual Inversion achieves competitive or superior CLIP-I and DINO-I compared to DreamBooth's ~3.4 GB — the frozen model acts as a powerful regularizer.

Custom Diffusion Results

I compare two Custom Diffusion variants: crossattn (trains all cross-attention projections) vs. crossattn_kv (trains only K and V projections + modifier token). Both use 5 dog images, 250 training steps, SD v1.4, modifier token <new1>.

VariantCLIP-TCLIP-IDINO-ILatency (ms)
crossattn (all)0.2580.7230.005645
crossattn_kv0.2570.7410.081644

crossattn_kv (K, V only) — Best variant

kv park
<new1> dog in the park
kv sunglasses
<new1> dog wearing sunglasses
kv cartoon
<new1> dog as a cartoon

crossattn (all cross-attention)

all park
<new1> dog in the park
all sunglasses
<new1> dog wearing sunglasses
all cartoon
<new1> dog as a cartoon
Key Finding: The crossattn_kv variant (fewer params) outperforms full crossattn on both CLIP-I (0.741 vs 0.723) and DINO-I (0.081 vs 0.005). Training additional projections introduces noise at 250 steps. The low DINO-I overall suggests more training steps would improve structural fidelity.

LCM Distillation Results

LCM targets inference speed, not subject fidelity. I compare two distillation loss variants — L2 and Huber (c=0.001) — both trained for 10k steps on LAION-CC12M from SD v1.5, generating images in only 4 denoising steps vs. 30 for other methods.

PromptL2 CLIP-THuber CLIP-TL2 (ms)Huber (ms)
a cat sitting on a sofa0.2800.266576*17,063*
a car parked on a street0.1910.247108111
a bowl of fruit on a table0.2930.257106107
a person riding a bicycle0.2430.232106107
Mean0.2520.2502244,347
Mean (excl. warmup)108108

* First image includes GPU warmup overhead (varies across runs). Steady-state latency is identical for both loss variants.

Latency Comparison Across Methods

MethodInference StepsLatency (ms)Speedup
DreamBooth (λ=0.75)302,9131.0×
LoRA (r=8)308353.5×
Custom Diff. (kv)306444.5×
Textual Inv. (4 vec)306134.8×
LCM (L2 / Huber)410827.0×

LCM Generated Samples — L2 Loss (4 steps)

L2 cat sofa
cat on sofa
L2 car street
car on street
L2 fruit bowl
fruit bowl
L2 bicycle
person on bicycle

LCM Generated Samples — Huber Loss (4 steps)

Huber cat sofa
cat on sofa
Huber car street
car on street
Huber fruit bowl
fruit bowl
Huber bicycle
person on bicycle
Key Finding: Both L2 and Huber loss variants achieve identical steady-state latency (~108ms), a ~27× speedup over standard 30-step inference (2,913ms). CLIP-T scores are comparable (L2: 0.252, Huber: 0.250), confirming that the loss function affects neither speed nor text-image alignment. L2 shows higher per-prompt variance (0.191–0.293 vs. 0.232–0.266 for Huber).

DDPO Results

DDPO uses reinforcement learning with an aesthetic reward function. I compare two variants: No LoRA (full U-Net RL fine-tuning) and LoRA (RL fine-tuning with low-rank adapters), both trained for 200 epochs on SD v1.5, evaluated with 50 inference steps at guidance 7.5.

PromptAesthetic ScoreCLIP-TLatency (ms)
No-LoRALoRANo-LoRALoRANo-LoRALoRA
portrait (soft lighting)6.145.670.2080.2173,629*16,589*
landscape (mountains)6.326.570.2520.2511,2161,190
dog in a park5.126.780.2050.2041,1991,190
futuristic city (neon)6.616.090.2600.2581,2021,181
Mean6.056.280.2310.2321,8125,038
Mean (excl. warmup)1,2061,187

* First image includes GPU warmup overhead (varies across runs).

DDPO Generated Samples — No-LoRA (full U-Net RL)

No-LoRA portrait
portrait
No-LoRA landscape
landscape
No-LoRA dog park
dog in park
No-LoRA city neon
city neon

DDPO Generated Samples — LoRA

LoRA portrait
portrait
LoRA landscape
landscape
LoRA dog park
dog in park
LoRA city neon
city neon
Key Finding: The LoRA variant achieves a higher mean aesthetic score (6.28 vs. 6.05) despite training fewer parameters, suggesting LoRA's constrained parameter space regularizes reward optimization. Both variants produce comparable CLIP-T (~0.23) and steady-state latency (~1,190ms). The "dog in a park" prompt shows the largest LoRA improvement (6.78 vs. 5.12).

Method Comparison

PropertyDreamBoothLoRATextual Inv.Custom Diff.LCMDDPO
What's trainedEntire U-NetLow-rank adaptersEmbedding onlyCross-attn K,V + tokenStudent modelU-Net (RL)
Trainable params~860M~1.6-6.4M~768-6,144~57M (~2%)Full modelFull model
Storage~3.4 GB~3 MB~3-24 KB~75 MB~3.4 GB~3.4 GB
Best CLIP-T0.2740.2490.2770.2580.2520.232
Best CLIP-I0.845N/A0.8570.741N/A*N/A*
Best DINO-I0.588N/A0.6900.081N/A*N/A*
Aesthetic6.28
Latency2,913ms835ms613ms644ms108ms1,187ms
InferenceNormalNormal (merged)NormalNormal~27× fasterNormal (50 steps)
Multi-conceptNoAdditiveNoNativeNoNo

* LCM and DDPO generate generic images (no subject); CLIP-I/DINO-I not applicable.

The Fidelity–Efficiency Spectrum

DreamBooth
~860M | ~3.4 GB
Custom Diff.
~57M | ~75 MB
LoRA
~1.6-6.4M | ~3 MB
Textual Inv.
~768-6K | ~3-24 KB

Surprising finding: Textual Inversion — the simplest and most lightweight method — achieved the highest CLIP-I (0.857) and DINO-I (0.690) across all experiments. The frozen model acts as a powerful regularizer, preventing the overfitting that can plague full fine-tuning approaches.

Speed champion: LCM reduces inference from 30 steps to just 4, achieving a 27× speedup (108ms vs 2,913ms). For Custom Diffusion, training only K and V projections outperforms training all cross-attention parameters — less can be more.

RL reward optimization: DDPO with LoRA achieves a higher aesthetic score (6.28) than full U-Net RL fine-tuning (6.05), demonstrating that constrained parameter spaces can regularize reward-driven optimization.