A Latent Space of Stochastic Diffusion Models for Zero-Shot Image Editing and Guidance

Link: arxiv, paper, gh-page, code

Abstract

Claim: Define a Latent Space of Stochastic Diffusion Models(DDPM)
Zero-shot Image Editing with DDPM latents
Guidance with DDPM latents

Deterministic/Stochastic Denoising

Latent Space of Diffusion Models

Deterministic Diffusion Models(DDIM): ODE-based, size of latent is same as image
Stochastic Diffusion Models(DDPM): SDE-based, add noises to latent

In Diffusion Models, when denoising, we have

Equation (12) in Denoising Diffusion Implicit Models

\[ x_{t-1} = \sqrt{\overline{\alpha}_{t-1}} \underbrace{\left( \frac{x_t-\sqrt{1-\overline{\alpha}_t}\varepsilon_{\theta}^{(t)}(x_t)}{\sqrt{\overline{\alpha}_t}} \right)}_{\text{“ predicted x0”}} + \underbrace{\sqrt{1-\overline{\alpha}_{t-1}-\sigma_t^2}\varepsilon_{\theta}^{(t)}(x_t)}_{\text{“direction pointing to xt”}} + \underbrace{\sigma_t \varepsilon_t}_{\text{random noise}} \]

\(\sigma_t^2\) is defined as \(\eta \tilde{\beta}_t\)。

When \(\eta=1\), that is DDPM
When \(\eta=0\), that is DDIM

Notations in Lil'Log

\[ \begin{gathered} \overline{\alpha}_t=\prod_{i=1}^t\alpha_i=\prod_{i=1}^t(1-\beta_i)\\ \tilde{\beta}_t=\frac{1-\overline{\alpha}_{t-1}}{1-\overline{\alpha}_t}\beta_t \end{gathered} \]

Use \(z\) to denote latent. In DDIM, we have

\[ \begin{gathered} {\color{blue} z := x_T} \sim \mathcal{N}(0, I)\\ x_{t-1} = \mu_T(x_t, t),\quad t=T,\cdots,1 \end{gathered} \]

In DDPM, define new latent space:

\[ \begin{gathered} {\color{blue} z := x_T\oplus \varepsilon_T \oplus \cdots \oplus \varepsilon_1} \sim \mathcal{N}(0, I)\\ x_{t-1} = \mu_T(x_t, t) + \sigma_t\odot{\color{blue} \varepsilon_t},\quad t=T,\cdots,1 \end{gathered} \]

Application

Unpaired Image-to-Image Translation

Forward Process: Input \(x\) (image of cat), with \(t\) condition ("cat")
Reverse Process: Output \(\hat{x}\) (image of dog), with \(\hat{t}\) condition ("dog")

\[ {\color{blue} z} \sim \operatorname{DPMEnc}({\color{blue} z}|{\color{red} x},\; G_t),\quad {\color{red} \hat{x}} = G_{\hat{t}}({\color{blue} z}). \]

Metrics

FID, KID: Distance between generated images and real images
PSNR: Quality
SSIM: Similarity

Details in Metrics.

Zero-shot Image Editing

In DDIM, noise \(\varepsilon_{\theta}(x_t, t)\) is predicted by network, but here in DDPM \(x_{t-1}\) is firstly sampled, then noise is calculated and denoising continues.

\[ \begin{gathered} \mathcal{S}_{\text{CLIP}}=\cos\langle \operatorname{CLIP}_{\text{img}}(\hat{x}),\; \operatorname{CLIP}_{\text{text}}(\hat{t}) \rangle\\ \mathcal{S}_{\text{D-CLIP}}=\cos\langle \operatorname{CLIP}_{\text{img}}(\hat{x}) - \operatorname{CLIP}_{\text{img}}(x),\; \operatorname{CLIP}_{\text{text}}(\hat{t}) - \operatorname{CLIP}_{\text{text}}(t) \rangle \end{gathered} \]

Metrics

CLIPScore: Similarity between text and image
PSNR: Quality
SSIM: Similarity between real image and generated image

Details in Metrics.

Plug-and-Play Guidance

Given a condition \(\mathcal{C}\), define the guided image distribution as energy-based model(EBM):

\[ p({\color{red} x}|\mathcal{C})\propto p_{\color{red} x}({\color{red} x})e^{-\lambda E({\color{red} x}|\mathcal{C})} \]

Sampling for \({\color{red} x}\sim p_{\color{red} x}({\color{red} x} | \mathcal{C})\) is equivalent to

\[ {\color{blue} z}\sim p_{\color{blue} z}({\color{blue} z} | \mathcal{C}),\quad {\color{red} x}=G({\color{blue} z}) \]

Can use any model-agnostic sampler to sample \({\color{blue} z}\sim p_{\color{blue} z}({\color{blue} z} | \mathcal{C})\), the author uses Langevin dynamics:

\[ \begin{gathered} {\color{blue} z^{\langle 0 \rangle}}\sim \mathcal{N}(0, I),\quad {\color{blue} z} := {\color{blue} z}^{\langle n \rangle}\\ {\color{blue} z^{\langle k+1 \rangle}} = {\color{blue} z^{\langle k \rangle}} + \frac{\sigma}{2}\nabla_{\color{blue} z}\left( \log_{p_{\color{blue} z}}({\color{blue} z^{\langle k \rangle}}) - E(G({\color{blue} z^{\langle k \rangle}}) | \mathcal{C}) \right) + \sqrt{\sigma}\omega^{\langle k \rangle}\\ \omega^{\langle k \rangle}\sim \mathcal{N}(0, I) \end{gathered} \]

The author uses \(n=200\), \(\sigma=0.05\), and defines a \(E_{\text{CLIP}}({\color{red} x}|t)\) for \(E({\color{red} x}|\mathcal{C})\):

\[ E_{\text{CLIP}}({\color{red} x}|t) = \frac{1}{L}\sum_{l=1}^L \left( 1-\cos\langle \operatorname{CLIP}_{\text{img}}(\operatorname{DiffAug}_l({\color{red} x})), \operatorname{CLIP}_{\text{text}}(t) \rangle \right) \]

\(\operatorname{DiffAug}_l\) stands for differentiable augmentation that mitigates the adversarial effect