How does DALL-E 2 work?

Recently, OpenAI released one of the astonishing deep learning model called DALL-E 2, which can create images using simple text.

So, let’s dive deep into it!

What is DALL-E 2? DALL-E 2 is an AI system that is capable of generating realistic and high-resolution images using a description in the form of natural language. It can also edit existing images using captions in the form of natural language.

Below are some of the examples of DALL-E 2:

A bowl of soup that is a portal to another dimension as digital art

An astronaut riding a horse in a photorealistic style

Teddy bears mixing sparkling chemicals as mad scientists as digital art

That’s amazing right! All the above images have been created using just a simple text description and you can create even more using natural text!

But how does this thing actually works?

Architecture: DALL-E 2 consists of two parts:

To convert captions or the text input into a representation of an image, which is called prior.
To convert this representation of an image into an actual image, which is called a decoder.

Image by author

The text and image embeddings present in the architecture are created by using one of the OpenAI’s model known as CLIP (Contrastive Language-Image Pre-training). It is a neural network model which returns the best possible caption given an image. We can also say that it does the reverse of what we are trying to achieve with DALL-E 2.

Image by author

CLIP is a contrastive model, therefore it doesn’t classify images rather it tries to match the image with its corresponding caption, therefore it is trained on image and caption pairs. The problem CLIP tries to solve is, maximizing the similarity score between the image embedding and its caption (text) embedding.

Matching pairs using CLIP

For CLIP to do the matching, it trains two encoders, an image encoder, and a text encoder.

An image encoder encodes images into image embeddings and the text encoder encodes captions (or the text) into text embeddings.

Image by author

If you are not familiar with the term embedding, just understand it as a way to represent a piece of information in a mathematical form.

In the below image, the black one is the CLIP text embedding which is generated by the CLIP text encoder by taking the caption as an input and the pink one is the CLIP image embedding generated by the prior.

Image by author

After the CLIP generates the CLIP text embedding with the help of the CLIP text encoder, the text embedding is then fed to the prior which further generates the CLIP image embedding. According to the research paper on DALL-E 2, the research scientists at OpenAI tried two types of prior for DALL-E 2, namely autoregressive prior and diffusion prior. After experimentation, they concluded that diffusion prior was performing better as compared to autoregressive prior.

In order to understand what is diffusion prior let’s first discuss what diffusion models are!

Diffusion models are a type of generative models, that take a type of data (say an image) and gradually add some noise to it until the image (data) is no longer visible or recognizable. Once that is achieved it reverses the same process and tries to reconstruct the image to its initial state. If you are interested in knowing more about diffusion models, here is a good blog for that.

Once the prior creates the CLIP image embedding, our next aim is to create the image itself from the CLIP image embedding and the decoder is responsible for that. While I was trying to understand the working of DALL-E 2, I wondered whether we can ignore prior and directly feed the CLIP text embedding to the decoder, or what would happen if we remove prior from the architecture?

Well, the research scientists from OpenAI also thought about it and experimented in 3 ways for the same. They got the following results:

Passing the image caption (input text) directly to the decoder:

Image by author

Generating a CLIP text embedding using the image caption and passing it to the decoder.

Image by author

Generating a CLIP text embedding using the image caption, feeding it to the prior, generating a CLIP image embedding, and then passing it to the decoder.

Image by author

From the above experiments, we can conclude that including a prior gives much more accurate and efficient images. But one may argue that directly passing the CLIP text embedding to the decoder gives acceptable output then why not choose that architecture? By doing so we would lose our ability to generate the variations of the image. If you wish to learn more about CLIP, please refer to this post by OpenAI.

Now let’s discuss about decoder!

The decoder is also a diffusion model, but a tuned or modified diffusion model. To serve the purpose of decoder OpenAI used another model called GLIDE, which is another image generation model developed by OpenAI. This image generation model is different from the traditional diffusion model as it not only uses the image (data) but also includes the text embeddings of the image and hence GLIDE will generate images based on this text embedding. If you wish to learn more about GLIDE, please refer to its research paper.

In DALL-E 2, the decoder not only includes the text embeddings but also includes the CLIP embeddings in order to support image generation. Once an initial 64 x 64 image is generated, there are two more steps to further upsample this image to 256 x 256 and later to 1024 x 1024.

Image by author

And that’s how high-resolution image generation takes place in DALL-E 2!

DALL-E 2 is also capable of generating variations of an image, which means retaining the main style and components of an image but changing the trivial details of the image.

Image by author

These variations are generated by creating the CLIP image embeddings using the CLIP image encoder and feeding these embeddings to the decoder.

Image by author

And that completes the working of DALL-E 2!

Model evaluation: After creating an AI model, it is important to evaluate it to understand whether our model is satisfying the required expectations or not. So let’s understand what the evaluation process for DALL-E 2 looks like!

DALL-E 2 was evaluated on something called as systematic human evaluations and this was done by comparing specific parameters like photorealism, caption similarity, and sample diversity.

Risks and limitations: Even though DALL-E 2 is powerful and is really good at what it does, there are a couple of limitations to it:

It is not yet capable of combining the physical attributes (color, shape, etc) of an object. For eg.

For the input “red cube on top of the blue cube,” it was somewhat confused as, to which cube should be red and which one should be blue.

DALL-E 2 output

Whereas another generative model like GLIDE was able to perform better for the same caption.

GLIDE output

It can’t generate logical and consistent text when asked to do so.

For eg.

For the input “A sign that says deep learning” it generated the following image.

A sign that says deep learning

It doesn’t generate detailed scenes in an image.

For eg.

For the input “A high quality photo of Times Square”, the screens in the generated image didn’t had any readable or understandable information on it.

A high quality photo of Times Square

There will always be the risk of creating fake images with inappropriate or malicious content.

To know more about the risks and limitations of DALL-E 2, please refer to this.

Well, that’s it for today! I hope you enjoyed this blog!

If you did, don’t forget to give some claps 😉. Also, feel free to reach out to me for any further questions.

References:

DALL·E 2 (openai.com)

dall-e-2.pdf (openai.com)

Linkedin Twitter