How do DALL-E, Midjourney, Stable Diffusion and other forms of generative AI work?

How do DALL-E, Midjourney, Stable Diffusion and other forms of generative AI work?

DALL-E is scary. Not so long ago, it was easy to conclude that AI technologies would never generate anything of a quality approaching human artistic composition or writing. Today, the generative modeling programs that power DALL-E 2 and Google’s LaMDA chatbot produce images and words that look suspiciously like the work of a real person. Dall-E creates artistic or photorealistic images of a variety of objects and scenes.

How do these image-generating models work? Do they function as a person and should we consider them intelligent?

How broadcast models work

Generative Pre-trained Transformer 3 (GPT-3) is at the cutting edge of AI technology. The proprietary computer code was developed by the misnamed OpenAI, a Bay Area tech operation that started out as a nonprofit before going for-profit and licensing GPT-3 to Microsoft. GPT-3 was designed to produce words, but OpenAI has tweaked a version to produce DALL-E and its sequel, DALL-E 2, using a technique called diffusion modeling.

Diffusion models perform two sequential processes. They ruin the images, then they try to reconstruct them. Programmers give the model real images with human-assigned meanings: dog, oil painting, banana, sky, 1960s sofa, etc. The model diffuses them – that is, moves them – through a long chain of sequential steps. In the ruin sequence, each step slightly alters the image given to it by the previous step, adding random noise in the form of meaningless scattered pixels, then passing it on to the next step. Repeated, again and again, it causes the original image to gradually fade into static and its meaning disappears.

We cannot predict how well, or even why, an AI like this works. We can only judge if his releases look good.

When this process is complete, the model performs it in reverse. Starting with the almost meaningless noise, he pushes the image back through the series of sequential steps, this time trying to reduce the noise and bring back the meaning. At each step, the performance of the model is judged by the probability that the least noisy image created at that step has the same meaning as the original real image.

While blurring the image is a mechanical process, making it clear is a search for something like meaning. The model is gradually “trained” by adjusting hundreds of billions of parameters – think of little dimmer knobs that adjust a lighting circuit from fully off to fully on – in the code’s neural networks to “increase” the steps which improve the likelihood of image meaning, and to “reject” steps that do not. Running this process over and over again on many images, adjusting the model parameters each time, ultimately adjusts the model to take a meaningless image and evolve it through a series of steps to an image that looks like the original input image.

Subscribe to get counterintuitive, surprising and impactful stories delivered to your inbox every Thursday

To produce images that have associated textual meanings, the words that describe the training images are taken through the sound and denoise chains at the same time. In this way, the model is trained not only to produce an image with a high probability of meaning, but with a high probability that the same descriptive words are associated with it. The creators of DALL-E formed it on a giant strip of images, with associated meanings, selected from all over the web. DALL-E can produce images that match such an odd range of input sentences because that’s what was on the internet.

Generative AI

These images were created using generative AI called Stable Diffusion, which is similar to DALL-E. The prompt used to generate the images: “Color photo of Abraham Lincoln drinking beer in front of the Seattle Space Needle with Taylor Swift.” Taylor Swift came out looking a little scary in the first picture, but maybe that’s what she looks like to Abraham Lincoln after a few beers. (Credit: Big Think, Stable Stream)

The inner workings of a diffusion model are complex. Despite the organic feel of his creations, the process is entirely mechanical, built on a foundation of probability calculations. (This document works through some of the equations. Warning: the calculations are difficult.)

Essentially, the calculations consist of breaking down difficult operations into separate, smaller, simpler steps that are almost as good but much faster for computers. The mechanics of the code are understandable, but the system of modified parameters that its neural networks pick up in the training process is complete gibberish. A set of parameters that produces good images is indistinguishable from a set that creates bad images – or nearly perfect images with an unknown but fatal flaw. Thus, we cannot predict how well, or even why, an AI like this works. We can only judge if his releases look good.

Are generative AI models smart?

It is therefore very difficult to say how much DALL-E looks like a person. The best answer is probably not at all. Humans don’t learn or create that way. We don’t take sensory data from the world and then reduce it to random noise; nor do we create new things by starting with total randomness and then denoising it. The towering linguist Noam Chomsky has pointed out that a generative model like GPT-3 does not produce words in meaningful language any differently than it would produce words in meaningless or impossible language. In this sense, she has no conception of the meaning of language, a fundamentally human trait.

Generative AI

These images were created using generative AI called Stable Diffusion, which is similar to DALL-E. The prompt used to generate the images: “portrait of conan obrien in the style of vincent van gogh”. (Credit: Big Think, Stable Stream)

Even if they are not like us, are they intelligent in any other way? In the sense that they can do very complex things, sort of. Again, a computer-automated lathe can create very complex metal parts. By the definition of the Turing test (i.e. determining whether one’s output is indistinguishable from that of a real person), it certainly could be. Again, extremely simplistic and hollow chat bot programs have been doing this for decades. Yet no one thinks machine tools or rudimentary chatbots are smart.

A better intuitive understanding of today’s generative model AI programs may be to think of them as extraordinarily capable idiot mimics. They are like a parrot that can listen to human speech and produce not just human words, but groups of words in the right patterns. If a parrot listened to soap operas for a million years, it could probably learn to string together emotionally overwhelmed and dramatic interpersonal dialogue. If you spent those millions of years giving it crackers to come up with better phrases and yelling at it for the bad ones, it might still get better.

Or consider another analogy. DALL-E is like a painter who lives his whole life in a gray windowless room. You show him millions of landscape paintings with the names of colors and subjects attached. Then you give him paint with color labels and ask him to match colors and create patterns that statistically mimic the subject’s labels. He makes millions of random paintings, each comparing them to an actual landscape, then tweaks his technique until they start to look realistic. However, he couldn’t tell you anything about what a real landscape is.

Another way to get an overview of scattering models is to look at the images produced by a simpler model. DALL-E 2 is the most sophisticated in its category. The first version of DALL-E often produced images that were almost correct, but clearly not quite, such as giraffe-dragons whose wings did not attach properly to their bodies. A less powerful open source competitor is known to produce disturbing, dreamlike, bizarre and not quite realistic images. The flaws inherent in the meaningless statistical mashups of a diffusion model are not hidden like those of the much more refined DALL-E 2.

The future of generative AI

Whether you find it marvelous or horrifying, it looks like we’ve just entered an era where computers can generate fake images and convincing sentences. It is bizarre that an image with meaning to a person can be generated from mathematical operations on almost meaningless statistical noise. While the machinations are lifeless, the result looks like something more. We’ll see if DALL-E and other generative models evolve into something with some kind of deeper intelligence, or if they can just be the world’s biggest dumb imitators.

#DALLE #Midjourney #Stable #Diffusion #forms #generative #work

Leave a Comment

Your email address will not be published.