Diffusion Models for Generative AI

Subrata Goswami
29 min readFeb 27, 2023

--

Recently diffusion based generative networks such as Stable Diffusion, DALL-E2, Imagen, etc have garnered lot of publicity. In this article, I try to cover some of the basics of diffusion models. Fundamentals of diffusion models are steeped in statistics and mathematics. Hence, a Statistics Refresher is included for some of the concepts.

The article is divided into a number of sections — Begining, Basic Concepts, DULNT and the DDPM codes, Derivations of a few Equations in the DDPM Paper, βt Schedule and Training, State of the Art(ish, ), Statistics Refresher , Glossary, and References. The last sections have more content than needed for understanding diffusion model.

Lot of the contents in this article , specially the Statistics Refresher section, were derived or liberally lifted from the excellent books and articles mentioned in the References section.

The equations in the article are written in a number of forms — pseudo-code, Mathcha etc. The pseudo-code equations are easier to write, but quickly becomes hard to understand. Writing Mathcha equation is somewhat laborious, but the end result is more pleasing.

Beginning:

Generative diffusion models are considered to have started with the following two papers.

  1. Deep Unsupervised Learning using Non-equilibrium Thermodynamics (DULNT) (March 2015)

This is the breakthrough paper with a number of original ideas . At a general level, the authors showed that is possible to model images as a combination of simple functions. Roughly analogous to how very short straight lines ( called splines) can approximate a complex curve. This paper is generally regarded as the first that attempted image generation through diffusion process. The authors used a Markov chain to gradually convert one distribution to another. Some of the basic forward and backward diffusion equations ( section 2. Algorithms ) are much more detailed in this paper. This paper also considers both Gaussian and Binomial noise.

2. Denoising Diffusion Probabilistic Models (DDPM)(June 2020)

This paper extends the previous paper’s work. Hence we will look at both the papers together in more detail here. Advances in the DDPM paper are : simplification of training. simplification of the forward diffusion by keeping diffusion parameters constant, comparison with auto-regressive decoding, derivation closed form expression that removes the need for Monte Carlo simulation.

Basic Concepts:

The original notion of the DULNT paper was to fit a probabilistic model to a dataset. Simple probabilistic models are not capable of capturing the essence of complex structured data. On the other hand, flexible and complex probabilistic models are not tractable (e.g. the normalization constant is not integrable). The authors of the DULNT paper devised a novel way to generate tractable complex probabilistic model. They used ideas from Simulated Annealing, Annealed Importance Sampling (AIS), Statistical Physics ( Jarzynski Equality ), diffusion equations of Fokker-Plank and Langevin Equation.

They used Markov chain to gradually convert one distribution into another. They do this by adding noise ( Gaussian or Binomial) to the original image in small steps till the resulting image becomes just noise. This step of adding noise is similar to how noise accumulates in Langevin Dynamics. Each step in the diffusion chain has an analytically tractable probability, hence the full chain is also analytically tractable. Kolmogorov forward and backward equations show that for many forward diffusion processes, the reverse diffusion processes can be described using the same functional form.

The authors formulate a joint probability distribution of all the steps of the diffusion. To evaluate this intractable joint probability distribution, the authors took cues from AIS and evaluate relative probability of forward and reverse trajectories and integrate over forward trajectory.

The forward process adds small amount of Gaussian Noise to the original image over a number of time steps till the image becomes almost a Gaussian. The reverse process goes in the other direction to arrive at the original image. The goal is to find best possible forward (noise) and reverse functions (de-noise) that maximizes the fit to the original from the corrupted image (almost pure noise).

Forward diffusion noising ( from Hugging Face annotated-diffision.ipynb)
Reverse diffusion denoising ( from DDPM)

The original image, x, is a sample from the data probability distribution, q(x₀),here xindicates that the distribution is “centered” at x₀ . For example x, can be a 3 dimensional tensor of pixel values ( in practice, raw pixel values are transformed into other forms).

Forward and reverse diffusion ( from DDPM)

This distribution is then gradually converted into to an analytically tractable distribution like Gaussian in T steps. At each step, a conditional probability, 𝑞(x𝑡|x𝑡−1) is used to generate a noisier image. If the conditional probability is Gaussian, as commonly is, then the distribution is just characterized by 2 numbers— mean and variance as in the following equation. Where, βt, is a time step dependent scalar constant, and I is the identity tensor of the same shape as x. βt increases with step t .

When t=1 , we have the following equation. q(x₀) ~ x₀ , or is proportional to x₀ .

x₁ is a sample from q(x₁) or q(x₁|x) as in the following equation, where ε is a sample from the the 0-mean 1-variance Gaussian.

The final corrupted data, p(xT ), after T steps is a Normal/Gaussian distribution with 0 mean and variance 1. Tn the following equation xT represents the image tensor at time step T.

The probability of generating the original data through the reverse process is an integration of the joint probability distribution of each reverse step with respect to data at each step.

p(x0, θ) = ∫p(x0,x1,x2,…xT)dx1dx2…dxT = ∫p(x0:T)dx1:T

In the above equation, p(x0:T) is the joint probability distribution of all the steps. x1..xT are the image tensor at those steps. also known as latent variables. θ is a set of parameters that describe the distribution. The above integral is intractable, but is made tractable by the Importance Sampling trick of using a ratio of functions.

p(x0, θ) = ∫q(x1,…xT) ( p(x0,…xT) /q(x1,..xT) dx1dx2…dxT

Equation (3) of DDPM , shown below, is a derivation from the above — detailed in DULNT and also in DDPM. The expectation, Eq, is the integral with respect to q above. The KL terms appears after application of Jensen’s inequality ( e.g. log-expectation to expectation-log ) and with most integrals evaluating to 1.

Equations 1,2 and 3 of DDPM paper

Mathematically, the goal then is to maximize the likelihood (or log likelihood or minimize the negative ), L, of p(x₀) , the prediction or generation, with respect to q(x₀) , the original.

L = ∫ln (p(x0)) q(x0) dx0

p(x(t-1)|xt) = N(x(t-1), μ (θ,xt,t),Σ(θ,xt,t))

p(x0:T)= p(xT ) Π p(x(t-1)|xt)

q(xt|x(t-1)) = N(xt, sqrt(1-βt)x(t-1),βtI)

q(x1:T|x0)= Π q(x(t)|x(t-1))

q(x(t)|x0)= N (x(t); sqrt(αbar(t))x0, (1-αbar(t))I) — see below for derivation of DDPM Equation 4.

The DULNT paper applies Jensen’s inequality to arrive at a lower bound , K, where L ≥ K . K is composed of KL divergence terms and entropies . The expressions that are optimized in the 2 papers are as follows.

Log likelihood and variational lower bound expression in the DULNT paper
Variational lower bound of negative log likelihood (L) in the DDPM paper

The DULNT paper ( section 2.2 Reverse Trajectory) mentions that for small β, the reverse of a diffusion process has identical functional form as forward. Hence, Gaussian in forward process implies Gaussian in reverse.

The DDPM paper keeps the βt’s constant in the forward trajectory. Whereas the DULNT paper learns the βt’s by forward ascent on K.

Both the DULNT and the DDPM paper optimizes random terms of L through SGD ( see later section on training for a explanation of this ). The DDPM paper, in Section 3, goes through each of the 3 terms above. LT is a constant wrt θ, and L0 is a discrete decoder that constrains the distribution to +-1/255 of the mean, so that mean remains within 1 step of pixel intensity. In a simplified version of L, LSIMPLE, the authors replaced the discrete L0 decoder with a Gaussian, and also did away with the time dependent multiplication factor.

The DULNT paper uses a custom convolution network. The DDPM paper uses a UNet ( PixelCNN++) type of network. The input data used is from images, and the output is also image. The network outputs the mean ( or mean of noise) at every pixel location, hence still retains the tensor shape of image. The UNet model holds the mean of all the reverse direction steps in its architecture. Hence, for the network to distinguish each step it needs to be fed the time step info in some form. A consequence of this is that parameters are shared across time — thus significantly reducing the size of the network and thus compute needs. The DDPM paper embeds time into the model through a sinusoidal positional embedding as used in Transformers.

The following shows the training and sampling algorithms of the DDPM paper. D

The training and sampling algorithms of the DDPM paper.

The training algorithm in the DULNT paper looks similar to the Algorithm 1 of the DDPM paper. The DULNT paper handles time dependency with a sequence of time dependent bump functions applied to the output of the convolution network.

In both papers, the training algorithm at each iteration is run at a random diffusion time step. The sampling is done by running the reverse diffusion through all the time steps one after another.

DULNT and DDPM codes :

An annotated code of the DDPM paper in PyTorch is available in Hugging Face ( URL in Reference section ). The following shows a summary of the UNet model used for denoising and sampling function. These are the 2 most significant parts of the code from which loss is calculated for convergence with an Adam optimizer.

================================================================================
Layer (type:depth-idx) Param #
================================================================================
Unet --
├─Conv2d: 1-1 900
├─Sequential: 1-2 --
│ └─SinusoidalPositionEmbeddings: 2-1 --
│ └─Linear: 2-2 3,248
│ └─GELU: 2-3 --
│ └─Linear: 2-4 12,656
├─ModuleList: 1-3 --
│ └─ModuleList: 2-5 --
│ │ └─ConvNextBlock: 3-1 26,882
│ │ └─ConvNextBlock: 3-2 33,040
│ │ └─Residual: 3-3 14,476
│ │ └─Conv2d: 3-4 12,572
│ └─ModuleList: 2-6 --
│ │ └─ConvNextBlock: 3-5 91,308
│ │ └─ConvNextBlock: 3-6 122,528
│ │ └─Residual: 3-7 28,952
│ │ └─Conv2d: 3-8 50,232
│ └─ModuleList: 2-7 --
│ │ └─ConvNextBlock: 3-9 355,096
│ │ └─ConvNextBlock: 3-10 470,848
│ │ └─Residual: 3-11 57,904
│ │ └─Identity: 3-12 --
├─ModuleList: 1-4 --
│ └─ModuleList: 2-8 --
│ │ └─ConvNextBlock: 3-13 332,192
│ │ └─ConvNextBlock: 3-14 122,528
│ │ └─Residual: 3-15 28,952
│ │ └─ConvTranspose2d: 3-16 50,232
│ └─ModuleList: 2-9 --
│ │ └─ConvNextBlock: 3-17 92,400
│ │ └─ConvNextBlock: 3-18 33,040
│ │ └─Residual: 3-19 14,476
│ │ └─ConvTranspose2d: 3-20 12,572
├─ConvNextBlock: 1-5 --
│ └─Sequential: 2-10 --
│ │ └─GELU: 3-21 --
│ │ └─Linear: 3-22 12,656
│ └─Conv2d: 2-11 5,600
│ └─Sequential: 2-12 --
│ │ └─GroupNorm: 3-23 224
│ │ └─Conv2d: 3-24 226,016
│ │ └─GELU: 3-25 --
│ │ └─GroupNorm: 3-26 448
│ │ └─Conv2d: 3-27 225,904
│ └─Identity: 2-13 --
├─Residual: 1-6 --
│ └─PreNorm: 2-14 --
│ │ └─Attention: 3-28 57,456
│ │ └─GroupNorm: 3-29 224
├─ConvNextBlock: 1-7 --
│ └─Sequential: 2-15 --
│ │ └─GELU: 3-30 --
│ │ └─Linear: 3-31 12,656
│ └─Conv2d: 2-16 5,600
│ └─Sequential: 2-17 --
│ │ └─GroupNorm: 3-32 224
│ │ └─Conv2d: 3-33 226,016
│ │ └─GELU: 3-34 --
│ │ └─GroupNorm: 3-35 448
│ │ └─Conv2d: 3-36 225,904
│ └─Identity: 2-18 --
├─Sequential: 1-8 --
│ └─ConvNextBlock: 2-19 --
│ │ └─Conv2d: 3-37 1,400
│ │ └─Sequential: 3-38 28,476
│ │ └─Identity: 3-39 --
│ └─Conv2d: 2-20 29
================================================================================
Total params: 2,996,315
Trainable params: 2,996,315
Non-trainable params: 0
================================================================================

The sampling code implements Algorithm 2 of the paper.
@torch.no_grad()
def p_sample(model, x, t, t_index):
betas_t = extract(betas, t, x.shape)
sqrt_one_minus_alphas_cumprod_t = extract(
sqrt_one_minus_alphas_cumprod, t, x.shape
)
sqrt_recip_alphas_t = extract(sqrt_recip_alphas, t, x.shape)

# Equation 11 in the paper
# Use our model (noise predictor) to predict the mean
model_mean = sqrt_recip_alphas_t * (
x - betas_t * model(x, t) / sqrt_one_minus_alphas_cumprod_t
)

if t_index == 0:
return model_mean
else:
posterior_variance_t = extract(posterior_variance, t, x.shape)
noise = torch.randn_like(x)
# Algorithm 2 line 4:
return model_mean + torch.sqrt(posterior_variance_t) * noise

The DULNT paper has an implementation in Theano (URL in the Reference section). Although it was written in 2015, and not in TensorFlow or PyTorch, it is not hard to follow the code. The network is called MLP and the variables names match the names in the paper. The following is a pseudo call stack of the code.

# in model.py
generate_forward_diffusion_sample(self, X_noiseless):
# choose a timestep in [1, self.trajectory_length-1].

cost(self, X_noiseless):
cost_single_t(self, X_noiseless):
X_noisy, t, mu_posterior, sigma_posterior = \
self.generate_forward_diffusion_sample(X_noiseless)
mu, sigma = self.get_mu_sigma(X_noisy, t)
negL_bound = self.get_negL_bound(mu, sigma, mu_posterior, sigma_posterior)

# in train.py
## initialize the model
dpm = model.DiffusionModel(spatial_width, n_colors, uniform_noise=uniform_noise, **model_args)
dpm.initialize()

## set up optimization
features = T.matrix('features', dtype=theano.config.floatX)
cost = dpm.cost(features)
blocks_model = blocks.model.Model(cost)

Derivations of a few equations in the DDPM paper :

Equation 3 of DDPM:

Derivation parallels equations 10–14 in the DULNT paper and also in equations 17–26 in DDPM paper. Both papers leave out some details. DDPM does not explicitly show Jensen’s Inequality step , the DULNT paper shows that in equations 11–12. Neither paper explicitly shows some integrals computing to 1. The KL terms results from application of these tricks.

Equation 4 of DDPM :

q(x(t)|x0) = Π q(x(t)|x(t-1))

x(1)= sqrt(1-β(1)) x(0) + sqrt(β(1))ε(1) , ε(1) is N(0,I) sample

x(2) = sqrt(1-β(2)) [sqrt(1-β(1)) x(0) + sqrt(β(1))ε(1)]+ sqrt(β

(2))εt(2)

x(2) = sqrt((1-β(1)) (1-β(2))) x(0) + sqrt(1-β(2))sqrt(β(1))ε(1)+ sqrt(β(2))ε(2)

invoking variance law of sum of Gaussian random variable on ε(2) and ε(2) (e.g. equal to sum of variances )

= sqrt((1-β(1)) (1-β(2))) x(0) + sqrt( (1-β(2)) β(1) +β(2))ε

= sqrt((1-β(1)) (1-β(2))) x(0) + sqrt(1- (1-β(2) (1-β(1)

… so on till t , gives the following

q(x(t)|x0)= N (x(t); sqrt(αbar(t))x0, (1-αbar(t))I)

where αbar(t)= (1-β(1)) (1-β(2)) … (1-β(t)) = Π(1-β(s))

Equation 8 of DDPM :

It is a straight forward application of KL divergence of 2 Gaussians. See KL Divergence in the Statistics Refresher below.

βt Schedule and Training:

There appears to be a number of reasons for choosing sqrt(1 — βt) to scale the mean. First is the closed form expression obtainable above. The second, is to gradually move the mean to 0. Setting to 0 removes any history of the original pixel value. However, diffusion models scales βt between 1e-4 and 2e-2 and diffuses for a fixed number of steps (e.g. 1000). Hence, the mean does not get scaled to 0 , but to a small number ( e.g. 6.4e-3 ). This scaling is also deterministic. The noise term that is added is random. As a result the noised image is random, but still likely a vestige of the original picture. The starting image, called x_start in the code, has values ranging between -1 and 1 for each element. The noise, which is a tensor of same shape as the image, varies between — ∞ and +∞ with a standard deviation of 1 before multiplication by the scaling term ( Equation 4 of DDPM )

As mentioned above, both the DULNT and the DDPM paper optimizes random terms of L (see the Log likelihood and variational lower bound expression above) through SGD. The rationale for this is that sum of squared values are minimized when each value is minimized. Also makes training simpler and less compute intensive. Another reason for optimizing individual terms of L is that the networks also learns to include individual steps thus preserving the priciples of AIS.

State of the Art(ish):

  1. High-Resolution Image Synthesis with Latent Diffusion Models (LDM):
LDM block diagram ( from LDM)

This paper introduced new features and changes to the base diffusion model. Namely as follows.

i) An encoder to generate features from image pixels. The diffusion model works in this feature space. This reduces the compute required and hence makes higher resolution image generation practical. The encoder from a VQVAE is used for the features .

The following figure shows the high level block diagram of a VQGAN (Vector Quantized GAN). It essentially has 4 parts — CNN encoder and decoder, a codebook and a transformer. The CNN encoder produces features of the input image. The codebook further simplifies the features into a fixed number of buckets — hence the name quantization. The transformer learns the relationship between the quantized features. The codebook length is denoted by Z, the encoder step down of the input image is denoted by f, and

VQGAN ( from Taming Transformers )

ii) A conditioning mechanisms ( e.g. a text prompt) to generate specific category of images.

Latent Diffusion code :

Unconditional Model ( wihout prompt) :

The high level torchinfo summary of the CelebAHQ model with VQVAE ( f=4, VQ (Z=8192, d=3) ) corresponding to the config celebahq-ldm-vq-4.yam l shows the 3 main parts — VQVAE, the UNet denoiser without any conditioner depicted in the LDM architecture picture shown earlier. The VQVAE does not appear to have the transformer or VQGAN for this specific model.

=====================================================================================                                                                                                                      
Layer (type:depth-idx) Param #
=====================================================================================
LatentDiffusion --
├─DiffusionWrapper: 1-1 --
│ └─UNetModel: 2-1 --
│ │ └─Sequential: 3-1 1,005,312
│ │ └─ModuleList: 3-2 73,170,272
│ │ └─TimestepEmbedSequential: 3-3 33,736,192
│ │ └─ModuleList: 3-4 166,137,888
│ │ └─Sequential: 3-5 6,499
├─LitEma: 1-2 --
├─VQModelInterface: 1-3 --
│ └─Encoder: 2-2 --
│ │ └─Conv2d: 3-6 (3,584)
│ │ └─ModuleList: 3-7 (11,824,000)
│ │ └─Module: 3-8 (10,494,976)
│ │ └─GroupNorm: 3-9 (1,024)
│ │ └─Conv2d: 3-10 (13,827)
│ └─Decoder: 2-3 --
│ │ └─Conv2d: 3-11 (14,336)
│ │ └─Module: 3-12 (10,494,976)
│ │ └─ModuleList: 3-13 (22,447,744)
│ │ └─GroupNorm: 3-14 (256)
│ │ └─Conv2d: 3-15 (3,459)
│ └─Identity: 2-4 --
│ └─VectorQuantizer2: 2-5 --
│ │ └─Embedding: 3-16 (24,576)
│ └─Conv2d: 2-6 (12)
│ └─Conv2d: 2-7 (12)
=====================================================================================
Total params: 329,378,945
Trainable params: 274,056,163
Non-trainable params: 55,322,782
=====================================================================================


| Name | Type | Params
-------------------------------------------------------
0 | model | DiffusionWrapper | 274 M
1 | model_ema | LitEma | 0
2 | first_stage_model | VQModelInterface | 55.3 M
-------------------------------------------------------
274 M Trainable params
55.3 M Non-trainable params
329 M Total params
1,317.516 Total estimated model params size (MB)

In the above code, ModuleList 3–2 and 3–4 are the down sampling and up sampling parts of UNet respectively. The following model summary shows one TimestepEmbedSequential layer in mode detail. This particular one contains the first attention block of the UNet decoder.

    │    │    └─TimestepEmbedSequential: 4-27               --
│ │ │ │ └─ResBlock: 5-53 --
│ │ │ │ │ └─Sequential: 6-176 --
│ │ │ │ │ │ └─GroupNorm32: 7-145 1,344
│ │ │ │ │ │ └─SiLU: 7-146 --
│ │ │ │ │ │ └─Conv2d: 7-147 2,709,952
│ │ │ │ │ └─Identity: 6-177 --
│ │ │ │ │ └─Identity: 6-178 --
│ │ │ │ │ └─Sequential: 6-179 --
│ │ │ │ │ │ └─SiLU: 7-148 --
│ │ │ │ │ │ └─Linear: 7-149 401,856
│ │ │ │ │ └─Sequential: 6-180 --
│ │ │ │ │ │ └─GroupNorm32: 7-150 896
│ │ │ │ │ │ └─SiLU: 7-151 --
│ │ │ │ │ │ └─Dropout: 7-152 --
│ │ │ │ │ │ └─Conv2d: 7-153 1,806,784
│ │ │ │ │ └─Conv2d: 6-181 301,504
│ │ │ │ └─AttentionBlock: 5-54 --
│ │ │ │ │ └─GroupNorm32: 6-182 896
│ │ │ │ │ └─Conv1d: 6-183 603,456
│ │ │ │ │ └─QKVAttentionLegacy: 6-184 --
│ │ │ │ │ └─Conv1d: 6-185 201,152
│ │ │ │ └─Upsample: 5-55 --
│ │ │ │ │ └─Conv2d: 6-186 1,806,784

Conditional Model ( with prompt) :

The configs/latent-diffusion/cin-ldm-vq-f8.yaml config in the git repo uses a simple class embedder and has the following parts and parameters.

===============================================================================================
Layer (type:depth-idx) Param #
===============================================================================================
LatentDiffusion --
├─DiffusionWrapper: 1-1 --
│ └─UNetModel: 2-1 --
│ │ └─Sequential: 3-1 1,312,768
│ │ └─ModuleList: 3-2 109,752,320
│ │ └─TimestepEmbedSequential: 3-3 61,901,824
│ │ └─ModuleList: 3-4 222,007,552
│ │ └─Sequential: 3-5 9,732
├─LitEma: 1-2 --
├─VQModelInterface: 1-3 --
│ └─Encoder: 2-2 --
│ │ └─Conv2d: 3-6 (3,584)
│ │ └─ModuleList: 3-7 (16,879,744)
│ │ └─Module: 3-8 (10,494,976)
│ │ └─GroupNorm: 3-9 (1,024)
│ │ └─Conv2d: 3-10 (18,436)
│ └─Decoder: 2-3 --
│ │ └─Conv2d: 3-11 (18,944)
│ │ └─Module: 3-12 (10,494,976)
│ │ └─ModuleList: 3-13 (29,736,320)
│ │ └─GroupNorm: 3-14 (256)
│ │ └─Conv2d: 3-15 (3,459)
│ └─Identity: 2-4 --
│ └─VectorQuantizer2: 2-5 --
│ │ └─Embedding: 3-16 (65,536)
│ └─Conv2d: 2-6 (20)
│ └─Conv2d: 2-7 (20)
├─ClassEmbedder: 1-4 --
│ └─Embedding: 2-8 512,000
===============================================================================================
Total params: 463,213,491
Trainable params: 395,496,196
Non-trainable params: 67,717,295
===============================================================================================

LatentDiffusion: Also optimizing conditioner params!

| Name | Type | Params
-------------------------------------------------------
0 | model | DiffusionWrapper | 394 M
1 | model_ema | LitEma | 0
2 | first_stage_model | VQModelInterface | 67.7 M
3 | cond_stage_model | ClassEmbedder | 512 K
-------------------------------------------------------
395 M Trainable params
67.7 M Non-trainable params
463 M Total params
1,852.854 Total estimated model params size (MB)

The input data has 5 more component in addition to image which has a shape of [1, 256, 256, 3] for batch size 1.

The DDPM part of the code, in ddpm.py, is somewhat similar to the code for the original DDPM paper. The file defines 3 classes — DDPM(pl.LightningModule) , LatentDiffusion(DDPM), and DiffusionWrapper(pl.LightningModule). As in the original DDPM paper, in the linear schedule, βt is varied between 1e-4 and 2e-2.

The input to the PyTorch Lightning fit callback method on_train_batch_start has 5 more component in addition to image as shown in the following trace The image at this point has a shape of [1, 256, 256, 3] for batch size 1. The class_label and human_label are the corresponding ImageNet for that image — 992 and agaric in this case. The image gets encoded to [1, 4, 32, 32] by the method first_stage_model.encode(x) . The condition string gets encoded by the method get_learned_conditioning() to a tensor of shape [1, 1, 512] .

latent-diffusion/main.py(749)<module>()
-> trainer.fit(model, data)
-> self.trainer._call_callback_hooks("on_train_batch_start", batch, batch_idx)
lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py(1597)_call_callback_hooks()
-> fn(self, self.lightning_module, *args, **kwargs)
> latent-diffusion/main.py(429)on_train_batch_start()
-> print("Batch train starting")

Input data:
p batch.keys()
dict_keys(['image', 'relpath', 'synsets', 'class_label', 'human_label', 'file_path_'])p batch['image'].shape, batch['class_label'], batch['human_label']
(torch.Size([1, 256, 256, 3]), tensor([992], device='cuda:0'), ['agaric'])

#.............................

latent-diffusion/main.py(749)<module>()
-> trainer.fit(model, data)
latent-diffusion/ldm/models/diffusion/ddpm.py(864)encode_first_stage()
-> return self.first_stage_model.encode(x)

p type(self)
<class 'ldm.models.diffusion.ddpm.LatentDiffusion'>

#..............................................

ib/python3.10/site-packages/torch/nn/modules/module.py(1110)_call_impl()
-> return forward_call(*input, **kwargs)
latent-diffusion/ldm/models/diffusion/ddpm.py(876)forward()
-> c = self.get_learned_conditioning(c)
latent-diffusion/ldm/models/diffusion/ddpm.py(563)get_learned_conditioning()
-> return c

#before the call to get_learned_conditioning
p c.keys()
dict_keys(['image', 'relpath', 'synsets', 'class_label', 'human_label', 'file_path_'])

#after the call to get_learned_conditioning
p c.shape
torch.Size([1, 1, 512])

For training of conditional network ( e.g. with text prompt) usually both the VQVAE and text embedder ( e.g. BERT, CLIP, etc.) are pre-trained. However. the ClasssEmbedder as it has only 51K parameters is trained togather with the UNet. UNet gets 2 types of inputs simultaneously — the image and the corresponding textual description in embedded form. The embedded text can be pre-generated also.

The following stack trace shows the forward call phase of the UNet model. The input to the UNet model has 3 parts — x (the noised image tensor), t ( the diffusion step), context/cc ( the condition cross-attention ). The shapes are [1, 4, 32, 32] , [1], and [1, 1, 512] for x,t, and cc respectively.

  latent-diffusion/ldm/models/diffusion/ddpm.py(880)forward()
-> return self.p_losses(x, c, t, *args, **kwargs)
latent-diffusion/ldm/models/diffusion/ddpm.py(1016)p_losses()
-> model_output = self.apply_model(x_noisy, t, cond)
latent-diffusion/ldm/models/diffusion/ddpm.py(988)apply_model()
-> x_recon = self.model(x_noisy, t, **cond)
lib/python3.10/site-packages/torch/nn/modules/module.py(1110)_call_impl()
-> return forward_call(*input, **kwargs)
> latent-diffusion/ldm/models/diffusion/ddpm.py(1411)forward()
-> out = self.diffusion_model(x, t, context=cc)

x.shape, t.shape, cc.shape
(torch.Size([1, 4, 32, 32]), torch.Size([1]), torch.Size([1, 1, 512]))


type(self.diffusion_model)
<class 'ldm.modules.diffusionmodules.openaimodel.UNetModel'>

A TensorFlow Keras version is also available in Kaggle ( see reference below ) with succinct descriptions of training on the Laion 6.5+ datasets.

2. Stable Diffusion (SDM) —

Stable Diffusion v1 is a specific configuration of the LDM architecture that uses a down sampling-factor 8 auto-encoder with an 860M parameter UNet and a CLIP ViT-L/14 text encoder . The model was pre-trained on 256x256 images and then fine tuned on 512x512 images.

Statistics Refresher:

The marginal probability is the probability of a single event occurring, independent of other events. A conditional probability, is the probability that an event occurs given that another specific event has already occurred. This means that the calculation for one variable is dependent on another variable. Marginal probability is the sum all conditional probability of a single event with respect to another event. The name marginal originates from writing down the sum on the margin of a row or column of a table of conditional probability.

Probability Space (Ω, 𝓕, P) is composed of 3 parts — the sample space of all possible outcomes; a set of events, an event is subset of the sample space; a probability function that assigns a probability to each event.

σ-algebra is a family of subsets , F, of Ω with the properties — Ø∈F and Ω∈F ; If AF, then (Ω -A) F; F is closed under countable unions and intersection, that is if A1, A2, … are events in F, then union of As and intersection of As are both in F.

Filtration is a family or sequence of σ-algebras that is increasing, that is, if s ≤ t , then Fs is a subset of Ft .

Random variable , X(w), is a single-valued real function that assigns a real number, called the value of X(w), to each sample point w∈Ω , where Ω is the sample space. Thus random variable is a mapping of the sample space onto the real line.

Random Process or Stochastic Process is a family of random variables defined over a given probability space and indexed by the time parameter t . A stochastic process is also called a random process. A stochastic process becomes a random variable when time is fixed at some particular value.

Given probability space (Ω,B,P), random variable is a measurable map

X:Ω→R

Stochastic process is a family of random variables, where T is a set or continuous space of time.

𝑋:Ω×T→R

Martingale is a stochastic process where all of its random variables have finite means, and best prediction of its future value is based on its current value.

A stochastic process [ X(t), t ∈ T ] is called a first-order Markov process if for any t0<t1…tn-1<tn, the conditional CDF of X(tn ) depends only on X(tn-1) . This means that, the future state is dependent only on the present and independent of the past. A Markov process can be either a discrete-state Markov process or a continuous-state. A discrete-state Markov process is also called a Markov chain.

Brownian Motion of random walk can be thought of as a continuous-time process in which over every infinitely small time interval, the entity moves one “step” in a certain direction. A Brownian Motion is sometimes also called a Wiener Process. Norbert Wiener first laid the mathematical foundation of Brownian Motion. Brownian motion is also a
stochastic process that models random continuous motion. The following expresses classical Brownian Motion algebraically. This equation intuitively implies that as time goes on, the particle has higher probability of being farther away as variance increases with time.

W(t)=N(0,σ²t)

B(t) = W(t)/σ , E(B(t))= 0, σ²(B(t)) = 1,

=> B(t) =N(0,t)

Diffusion process, is a Markov chain that gradually adds noise to the
data in the forward direction, opposite direction of sampling, until the signal is destroyed. Diffusion processes are continuous-time, continuous-state processes whose sample paths are everywhere continuous but nowhere differentiable. Diffusion can be considered as the spontaneous “spreading” of particles from region of high concentration to lower concentration. Brownian motion is a diffusion process on the interval (-∞ , +∞) with 0 mean and constant variance. The forward diffusion process provides probability of state in some future time. The backward diffusion provides probability of state in the past.

The Diffusion Equation, also called the Fokker-Planck equation, can be derived from random walk. In 1 dimension (x) it is as follows, where D₀ is the diffusion coefficient and μ is the drift .

When D₀ is 0 , the Diffusion Equation becomes the Heat Equation.

Langevin Equation (LE) describes the evolution of a system that is subjected to random forces. The LE for Brownian motion starts with the basic force equation that has a velocity based drag and a noise term, ma = -λv + N(0,σ²). Fokker-Planck equation can be derived from Langevin equation. Hence shows how to define a Gaussian diffusion process which has any target distribution as its equilibrium.

Stochastic gradient Langevin dynamics (Welling & Teh 2011) can produce samples from a probability density using only the gradients in a Markov chain of updates. Compared to standard SGD, stochastic gradient Langevin dynamics injects Gaussian noise into the parameter updates to avoid collapses into local minima.

Stochastic Differential Equation (SDE) — can be considered an Ordinary Differential Equation (ODE) with a noise term. ODE are deterministic description of a dynamical ( function of time or other variable ) system. SDE can be viewed as description of dynamics of a stochastic process.

Geometric Brownian Motion (GBM) — GBM is a continuous-time stochastic process in which the logarithm of a randomly varying parameter ( Y(t)) follows a Brownian motion (B(t)).

log(Y(t)) = X(t) = μ*t+ σ*B(t),

Latent variable is a random variable that is not observable directly.

In Bayesian statistics, the conditional p(z/x) of random variable z on random variable x is written as follows. The joint probability is p(z,x) and p(x) is the marginal ( unconditional ) probability of observable variable x, also called evidence. Inference in a Bayesian model amounts to computing the posterior p(z |x)

p(z/x) = p(z,x)/p(x) , p(x) = ∫p(z,x)dz

Probability refers to the occurrence of future events, while a likelihood refers to past events with known outcomes.

A Likelihood function of variable X1,X2,.. and parameter θ is defined as follows.

L(θ) = L(θ|X1,X2, …) = p(X1,X2, …, θ)

The value of the parameter of a distribution function that maximizes the probability of observing the data is called a maximum-likelihood estimate. Log likelihood is the natural log (ln) of the likelihood function. Often it is easier to work with log ( as probabilities mostly rely on multiplications ), and as ln is monotonic, the maxima of the likelihood and log likelihood coincide. Maximization of log likelihood is same a minimization of negative log likelihood.

The Kullback-Leibler (KL) divergence is a non-symmetric (KL from p to q can be different from q to p) measure of the difference between two probability distributions p(x) and q(x) and is defined as follows for discrete distributions. It is closely related to relative entropy, information divergence, and information for discrimination. In case of continuous variables, the summation (Σ) over x is replaced by an integral () . The divergence for 2 Gaussians can be derived by just plugging in the expression and plain algebraic reduction.

Kullback-Leibler (KL) divergence

Entropy — the statistical definition of entropy (Gibbs) is as follows. If p’s are equal, then Boltzman’s entropy results. Entropy is a measure of disorder — higher the entropy higher the disorder

H = -k Σp ln(p)

Information Entropy or Shannon Entropy is defined as follows.

H = minimial #bits = — Σp log2(p)

Variational Inference — the goal of variational inference is to approximate a conditional density of latent variables of observed variables. A family of densities over the latent variables is used, pameterized by free “variational parameters”. The optimization finds the member of this family (i.e. the setting of the parameters) that is closest in KL divergence to the conditional of interest. The fitted variational density then serves as a proxy for the exact conditional density. MCMC ( Markov Chain Monte Carlo) and variational inference (VI) are different approaches to solving the same problem. MCMC algorithms sample a Markov chain, variational algorithms solve an optimization problem. MCMC algorithms approximate the posterior with samples from the chain, variational algorithms approximate the posterior with the result of the optimization.

Evidence Lower Bound (ELBO), in variational Bayesian methods, often abbreviated ELBO, also sometimes called the variational lower bound or negative variational free energy) is a lower bound on the log-likelihood of observed data. It is derived by employing Jensen’s Inequality that exchanges log -expectation for expectation-log as follows. Integral is replaced with expectation (E) in the following.

p(x) = ∫p(x,z)dz = ∫ p(x/z)p(z) dz = ∫ p(x/z)(p(z)/q(z))q(z)dz

ln p(x) = ln ∫ p(x/z)(p(z)/q(z))q(z)dz =

ln Eq ( p(z/x)(p(z)/q(z))) ≥ Eq (ln p(z/x) ) — KL (q(z) || p(z))

Monte Carlo (MC) — MC methods are essentially drawing samples from a distribution and then using those sample values to compute task such as integration, expected value, variance etc. The samples are independent and are drawn from same distribution — independent and identically distributed ( IID).

Monte Chain Monte Carlo (MCMC) — A Marckov Chain is a sequence of random variables, where the value of variable depends only on the previous variable. It is memory less, as future value is dependent only on the current state.

Metropolis-Hastings (MH) — is a way of drawing samples from a target distribution. It handles distributions that are known upto a constant. In other words, the normalizing constant or partition function is not calculable. In that scenario, it is not possible to sample directly from the distribution function. It constructs a Markov Chain such that the probability of choosing the next state is based on the ratio probabilities of the generated propasal state ( not yet the next state) and current state . In the following equation, accept x(t+1) with probability a . Metropolis first came up with the idea which Hastings later extended.

a = min (p(x(t+1))/p(x(t) ,1)

Gibbs Sampling , is a type of Markov Chain sampler. Gibbs sampling starts with joint probability distribution function of many random variables and a random initiial values of thise variables. Then for the next state, each variable is sampled independently according to the joint probability distribution function where all the other variables are held constant. It is a special case of MH that accepts every proposed state unconditionally. it is helpfull in handling high-dimensional ( e.g. many variable) functions. Also, the distribution function has to be known exactly, not upto a constant as in MH.

Importance Sampling (IS) — often in many areas science and mathematics, function (e.g. probability density function ) integration needs to performed to normalize the function. This integration can be untractable analytically. Monte Carlo methods (e.g. sampling ) are used to numerically and approximately perform these type of integration. IS is a collection of Monte Carlo methods where a mathematical expectation (E) of a function f(x) with respect to a target distribution (h(y)) is approximated by a weighted average of random draws from another distribution called the proposal distribution (g(y)). These methods become important when the target distribution is not accessible for Monte Carlo (MC) integration (the normalizeing constant is not known, hence samples can not drawn to mimic the distribution), and a proposal distribution ( normalizign constant is known) is used to approximate the target distribution. The proposal distribution has higher values for big values of the function f(x). In other words, it provides more importance to those points that significantly effect expectation. Mathematically it is expressed as follows

Ef(h) = ∫ h(y)f(y)dy

∫ h(y)f(y)dy =∫ (h(y)f(y)/g(y))g(y)dy ~ 1/N Σi (h(yi)f(yi)/g(yi))

f(y)/g(y) is the importance weight in the weighted sum above. Importance sampling works well if g(y) is a good approximation of h(y), else the importance weights are not going to be appropriate. Finding a good proposal distribution can be hard.

Annealed Importance Sampling (AIS) addresses the shortcoming of IS by applying well known simple probability distribution function (pdf) over a number of steps. In other words anneals one distribution into another. AIS is a type of algorithm called Simulated Annealing. AIS essentially is a marriage of Simulated Annealing and Importance Sampling. Loosely it can viewed as analogous to fitting a curve with very short straight lines. AIS is considered same as Jarzynski Equality used in estimating free energy in statistical physics. The following equation is one example distribution function used at step t.

Example AIS function

Score function ( Stein) is the gradient of the log probability density with respect to data. Maximum likelihood training feasible, likelihood-based models either restricts their model architectures to make normalizing constant tractable, or use computationally expensive approximation of the normalizing constant. A score function does not require a tractable normalizing constant as the following expression shows.

Score function does not require normalization constant

Score matching (Hyvärinen (2005) ) methods reduces the L2 distance between a score function and the score of the true data distribution. Naive score matching suffers from by low contribution from low density region. Multiple noise perturbation score matching applies noise to the original data in number of scales ( e.g. higher and higher variance ) .

Naive score matching (Song)
Multiple noise perturbation score matching (Song)

Score matching with Langevin dynamics (SMLD) — estimates the score at each noise scale, and then uses Langevin dynamics to sample from a sequence of decreasing noise scales during generation.

Glossary:

Autoregressive Models (ARM) — Autoregression models forecast the variable using a linear combination of past values of the variable.

Autoencoder has 2 part — encoder and decoder. The encoder takes the input and outputs value/values . This is called the bottleneck, the latent space. The decoder takes this output as input, and outputs values that are supposed to be same as the input to the encoder.

Denoisng Autoencoder also has 2 stages as in autoencoder. However instead of feeding the data directly into the encode, it is fed a corrupted input data and the decoder is asked to reconstruct the original uncorrupted data.

Variational Auto-encoder (VAE) are Auto-encoders that outputs probability distribution (e.g. Gaussian mean and variance ) at the bottleneck. The decoder starts with a sample from the encoder output probability distribution and expands it to match the encoder input as in auto-encoders.

Vector Quantized Variational AutoEncoder ( VQ-VAE) — an autoregressive model. The latent space is quantized or discrete vector instead of continuous.

VQGAN (Vector Quantized GAN)

Inception Score (IS) — IS uses Inception V3 ( without the classifier layer) to generate the conditional label probability distribution, p(y|x) ∈ [0, 1]¹⁰⁰⁰ , to define a metric with 2 specific properties. The first property is that the conditional label distribution has low entropy. A low entropy indicates the network is able to distinguish among labels well. The second property is that the marginal probability, p(y ) = p(y|x = G(z))dz, should have high entropy. A high entropy indicates that the network can generate many labels with more or less equal probability. The expression , exp(ExKL(p(y|x)||p(y))), is defined as the IS.

Frechet Distance (FD) — is a measure of closeness between 2 distributions. It is also known as Wasserstein-2 distance .

Frichet Distance ( from Statistics 18 )

Fréchet Dnception Distance (FID)

While in Inception Score, the label probability is used. In FID, the output of coding layer of Inception is used an a Gaussian distribution. Then Frechet Distance is taken between the original and generated image Gaussians. Both original and generated are available during training of the model.

References:

Statistics :

  1. Score-based generative modeling through stochastic differential equations, https://arxiv.org/abs/2011.13456
  2. http://bactra.org/notebooks/stoch-diff-eqs.html
  3. https://www.sciencedirect.com/topics/mathematics/standard-brownian-motion
  4. Oliver Ibe, Markov Processes for Stochastic Modeling, Springer.
  5. Importance Sampling, https://dept.stat.lsa.umich.edu/~jasoneg/Stat406/lab7.pdf, http://www2.stat.duke.edu/~st118/Publication/impsamp.pdf, https://astrostatistics.psu.edu/su14/lectures/cisewski_is.pdf
  6. Variational Inference: A Review for Statisticians https://arxiv.org/pdf/1601.00670.pdf
  7. Maximum Likelihood Estimates, https://ocw.mit.edu/courses/18-05-introduction-to-probability-and-statistics-spring-2014/4a8de32565ebdefbb7963b4ebda904b2_MIT18_05S14_Reading10b.pdf
  8. Kullback-Leibler Divergence http://hanj.cs.illinois.edu/cs412/bk3/KL-divergence.pdf
  9. Importance sampling, https://www.statlect.com/asymptotic-theory/importance-sampling
  10. https://agustinus.kristia.de/techblog/2017/12/23/annealed-importance-sampling/
  11. Annealed Importance Sampling Meets Score Matching, https://www.deepmind.com/publications/annealed-importance-sampling-meets-score-matching
  12. Annealed Importance Sampling, https://arxiv.org/pdf/physics/9803008.pdf
  13. MCMC, MH, https://www.statlect.com/fundamentals-of-statistics/Metropolis-Hastings-algorithm, https://www.statlect.com/fundamentals-of-statistics/Markov-chain, shttps://www.statlect.com/fundamentals-of-statistics/Markov-Chain-Monte-Carlo, https://stephens999.github.io/fiveMinuteStats/MH_intro.html, https://www.cs.ubc.ca/~schmidtm/Courses/540-W20/L31.pdf
  14. Prof Chris Jarzynski — Scaling Down the Laws of Thermodynamics, https://www.youtube.com/watch?v=TKmC5GzxEgk
  15. A Statistical Definition of Entropy, https://web.mit.edu/16.unified/www/FALL/thermodynamics/notes/node56.html
  16. STAT 414, https://online.stat.psu.edu/stat414/
  17. Kullback Leibler divergence between two normal pdfs, https://www.youtube.com/watch?v=TNJwYuKjqVM
  18. The Frichet distance between two multivariate normal, https://core.ac.uk/download/pdf/82269844.pdf

Diffusion, Latent Variable, VAE:

  1. Understanding Diffusion Models: A Unified Perspective, https://arxiv.org/pdf/2208.11970.pdf
  2. How diffusion models work: the math from scratch, https://theaisummer.com/diffusion-models/
  3. Diffusion Models: A Comprehensive Survey of Methods and Applications, https://arxiv.org/pdf/2209.00796.pdf
  4. The theory behind Latent Variable Models: formulating a Variational Autoencoder, https://theaisummer.com/latent-variable-models/
  5. What are Diffusion Models?, https://lilianweng.github.io/posts/2021-07-11-diffusion-models/
  6. Extracting and Composing Robust Features with Denoising Autoencoders, https://www.cs.toronto.edu/~larocheh/publications/icml-2008-denoising-autoencoders.pdf
  7. A Connection Between Score Matching and Denoising Autoencoders, https://www.iro.umontreal.ca/~vincentp/Publications/smdae_techreport.pdf
  8. Langevin equation, https://encyclopediaofmath.org/wiki/Langevin_equation
  9. The Fokker-Plank equation: equivalence with the Langevin, https://www2.ph.ed.ac.uk/~dmarendu/ASP/Section15.pdf

Generative AI :

  1. Tutorial on Denoising Diffusion-based Generative Modeling: Foundations and Applications, https://www.youtube.com/watch?v=cS6JQpEY9cs
  2. MIT 6.S192 — Lecture 22: Diffusion Probabilistic Models, Jascha Sohl-Dickstein, https://www.youtube.com/watch?v=XCUlnHP1TNM
  3. MIT 6.S192 — Lecture 20: Generative art using diffusion, Prafulla Dhariwal, https://www.youtube.com/watch?v=xYJEvihz3OI
  4. https://huggingface.co/blog/annotated-diffusion, https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/annotated_diffusion.ipynb
  5. Generative Visual Prompt: Unifying Distributional Control of Pre-Trained Generative Models, https://chenwu98.github.io/PromptGen/
  6. https://imagen.research.google/, https://imagen.research.google/video/
  7. https://github.com/Stability-AI/diffusers
  8. CLIP (Contrastive Language-Image Pre-Training), https://github.com/openai/CLIP
  9. Latent Diffusion Models, https://github.com/compvis/latent-diffusion , https://github.com/CompVis/stable-diffusion
  10. https://en.wikipedia.org/wiki/Stable_Diffusion
  11. https://github.com/lucidrains/DALLE2-pytorch, https://github.com/lucidrains/x-transformers, https://github.com/lucidrains/imagen-pytorch
  12. The Annotated Diffusion Model, https://huggingface.co/blog/annotated-diffusion
  13. Generating Diverse High-Fidelity Images
    with VQ-VAE-2, https://arxiv.org/pdf/1906.00446.pdf
  14. Understanding VQ-VAE (DALL-E Explained Pt. 1), https://ml.berkeley.edu/blog/posts/vq-vae/
  15. DIFFEDIT: DIFFUSION-BASED SEMANTIC IMAGE EDITING WITH MASK GUIDANCE, https://arxiv.org/pdf/2210.11427.pdf
  16. High-Resolution Image Synthesis with Latent Diffusion Models, https://arxiv.org/pdf/2112.10752.pdf
  17. Diffusion Models Beat GANs on Image Synthesis, https://arxiv.org/pdf/2105.05233.pdf
  18. Denoising Diffusion Probabilistic Models, Jonathan Ho, Ajay Jain, and Pieter Abbeel, In NeurIPS, 2020. https://arxiv.org/pdf/2006.11239.pdf, https://github.com/hojonathanho/diffusion, https://hojonathanho.github.io/diffusion/, https://medium.com/mlearning-ai/enerating-images-with-ddpms-a-pytorch-implementation-cef5a2ba8cb1 , https://github.com/labmlai/annotated_deep_learning_paper_implementations,
  19. Deep Unsupervised Learning using Nonequilibrium Thermodynamics, https://arxiv.org/pdf/1503.03585.pdf ;https://github.com/Sohl-Dickstein/Diffusion-Probabilistic-Models
  20. Elucidating the Design Space of Diffusion-Based
    Generative Models, https://arxiv.org/pdf/2206.00364.pdf,
    https://github.com/NVlabs/edm
  21. symbols, https://medium.com/@greekalphabet/list-of-greek-alphabet-letters-22bf2f751700, https://www.rapidtables.com/math/symbols/Basic_Math_Symbols.html
  22. Online Mathematics Editor, https://www.mathcha.io/
  23. LION: Latent Point Diffusion Models for 3D Shape Generation, https://nv-tlabs.github.io/LION/
  24. Prafulla Dhariwal and Alex Nichol. Diffusion models beat
    GANs on image synthesis
  25. Yang Song, Jascha Sohl-Dickstein, Diederik P. Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Scorebased generative modeling through stochastic differential equations.
  26. https://twitter.com/s_scardapane/status/1405182634616049665
  27. Generative Modeling by Estimating Gradients of the Data Distribution, https://yang-song.net/blog/2021/score/
  28. DreamBooth: Fine Tuning Text-to-Image Diffusion Models
    for Subject-Driven Generation, https://arxiv.org/pdf/2208.12242.pdf
  29. Cold Diffusion: Inverting Arbitrary Image Transforms Without Noise, https://arxiv.org/abs/2208.09392, github.com/arpitbansal297/Cold-Diffusion-Models,
  30. DiffusionDB is the first large-scale text-to-image prompt dataset., https://poloclub.github.io/diffusiondb/#dataset-summary
  31. Diffusion Models in Vision: A Survey, https://arxiv.org/pdf/2209.04747.pdf
  32. Stable Diffusion colab notebook in HuggingFace , https://colab.research.google.com/github/huggingface/notebooks/blob/main/diffusers/stable_diffusion.ipynb
  33. Semi-Parametric Neural Image Synthesis, https://arxiv.org/abs/2204.11824
  34. Inception Score (IS) ; Improved Techniques for Training GANs, https://arxiv.org/pdf/1606.03498.pdf .
  35. FID ; GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium, https://arxiv.org/pdf/1706.08500.pdf
  36. Taming Transformers for High-Resolution Image Synthesis, https://github.com/CompVis/taming-transformers
  37. Train Latent Diffusion in Keras from Scratch, https://www.kaggle.com/code/apapiu/train-latent-diffusion-in-keras-from-scratch/notebook

--

--