Mastering Stable Diffusion Image Generation with Associative Prompting

Joachim Holwech•6/20/2023

Writing prompts for generative AI models, such as Stable Diffusion or Midjourney, can be challenging, especially if you want to generate high-quality images. For beginners, this can often be a source of frustration. We are bombarded with amazing images all over the internet, but when you try it yourself, the results are just... really bad.

As one of the founders of visoid.com, I've spent a fair share of my time writing and perfecting the prompts we use, ensuring that we can consistently output high-quality images. In this article, I want to share with you how I approach prompting and how you can achieve consistently good output by selecting the right words for your prompt. This guide is focused on Stable Diffusion, but I'm sure these tricks are useful for most models out there, as they're all based on the same fundamental concept of being trained on labeled data.

Prompts are important

If you run Stable Diffusion using the Automatic1111 interface, you're probably aware of how many settings there are to choose from. While choosing the correct settings will certainly improve the output quality, I would argue that the single most important variable is the prompt. Using the correct words in your prompt can significantly improve your results, and similarly, using even a single bad word can potentially make the model trip, resulting in an ugly image.

How not to do it

Typically, most beginners will approach prompts similarly to how they communicate with humans or language models, like ChatGPT. We instinctively want to tell the model what we want directly, either in short or full sentences. To give an example, here is a prompt I could have written in my early days of testing Stable Diffusion:

A cosy cabin by the lake with a beautiful sunset in the background. The cabin is surrounded by pine forrest, tall grass and flowers.

Example output using the above prompt. While the output isn’t too bad, it’s not great either. We can do better.
Model: Realistic Vision 2, Scheduler: DPM++ 2M Karras, Sampling steps: 40.

While the model will be able to pick out certain words from the prompt and give you output that resembles what you’re asking for, it won’t be amazingly good. Furthermore, you won’t get exactly what you’re asking for. Unfortunately, models like Stable Diffusion cannot be accurately controlled using just the prompts. If you want a person at a specific position in the image, you would have to use other tools like ControlNet to achieve this.

Stable Diffusion doesn’t understand language the same way as we humans do. Even ChatGPT, which is also an AI model, perceives language differently than Stable Diffusion. So to master prompting with Stable Diffusion, we first have to learn the language that Stable Diffusion understands.

It’s all about association

To be able to write really good prompts, it’s important first to understand how Stable Diffusion was trained. Stable Diffusion was trained on a large dataset of over 2.3 billion images. Accompanying each image is a text description of the image in the form of a title, text, or tags. The text labels don’t have a standard format, and there is a huge variety in how the images have been labeled.

Through the training process, Stable Diffusion learns to associate certain words with certain types of images. For example, if we have “sunset” in our prompt, this will push the model towards images found in the training dataset that have labels containing the word “sunset”. Similarly, if we have “photo” in our prompt, the output image will look more like a photo.

As you can see, the most important factor in a good prompt is to include words that Stable Diffusion associates with certain images it was trained on. Or, said in another way, we have to find out which labels were used on images in the training dataset, that resemble the output we’re trying to generate.

In the next section, we’ll go into the specifics of how we construct a good prompt. Before we do that, I highly recommend having a look at this image database that contains a subset of the images that Stable Diffusion was trained on. Having a look at how the images are labeled can be very useful when building your prompt. Furthermore, I often search the database for images that look similar to what I want to generate and include words from these labels in my prompt.

Prompt format

I generally write the prompts in a comma-separated list of words, “sunset, photo, cozy cabin, …” etc., since Stable Diffusion doesn’t really follow sentences that well anyways. It can sometimes make sense with incomplete sentences if you need specific elements in the images to have a certain relationship to each other. For example, writing “person in front of cozy cabin” might yield more correct placement of the person versus writing “person, cozy cabin”. I normally don’t use full sentences, as it can be a source of noise. Many of the words in a sentence don’t have meaning on their own, and they don’t push the output in the direction we want.

First step

Let’s say our goal is to make the output look like a real photo. How can we achieve this? When building a prompt from scratch, I like to start by describing the content of the image first. If we take the above example, we can now rewrite this as:

cosy cabin, lake, beautiful sunset, pine forrest, tall grass, flowers

Output is similar to the previous example, so that least we didn’t make things worse!
Model: Realistic Vision 2, Scheduler: DPM++ 2M Karras, Sampling steps: 40.

As you can see, we've trimmed it down and removed all excess words. The prompt is more focused and serves as a good base for us to build upon. The image also contains all the elements that we described, so we haven’t lost any information. While the image looks slightly like a photo, it has a "painting style" that we are going to try to get rid of in the next section.

Adding style words

Although the output looks okay, it doesn't feel quite real yet. We can further improve this image by adding style words to the prompt. These are words that push the output towards a certain style. Popular words could be the names of certain artists or the type of medium that the image is created on. In our case, we want the image to look more like a photo, so we can add words that the model associates with photos and cameras:

cosy cabin, lake, beautiful sunset, pine forrest, tall grass, flowers, photography, photo, DSLR, RAW

The image certainly looks more like a photography now.
Model: Realistic Vision 2, Scheduler: DPM++ 2M Karras, Sampling steps: 40.

We could have used many more words in this example, but a few specific ones take us 80% of the way. I added four words to the prompt in this example that are typically associated with photography. In addition to "photo" and "photography", I added "DSLR" and "RAW", which are words often associated with more professional photography. This can give the output a better composition and a more "professional" look. You could even try to remove "photo" and "photography" from the prompt, as this would put more emphasis on the professional side of photography.

Adding quality words

This last part consists of words that improve the quality of the output. They're more generic and can be included in most prompts where you're looking for a detailed output. These words are words that Stable Diffusion associates with high-quality images and can improve the output quality even further:

cosy cabin, lake, beautiful sunset, pine forrest, tall grass, flowers, photography, photo, DSLR, RAW, 4k, 8k, 16k, uhd, professional, beautiful, sharp focus, high resolution

In this case, we see a slight improvement in sharpness and the composition is slightly improved.
Model: Realistic Vision 2, Scheduler: DPM++ 2M Karras, Sampling steps: 40.

As you can see, we use words that are associated with high quality, high resolution, and good image composition. Depending on what you're trying to create, there are a lot of different words you can use to get the image quality you want.

Negative prompt

Finally, we can add a negative prompt. This is nice to use if you want to push the model away from certain elements or styles. In general, most people list a long text of words that are associated with bad image quality. Here you can also add certain styles that you don't want the output to look like, for example, "painting" or "watercolor". Here is a “default” negative prompt that you can use:

ugly, low resolution, deformed, out of frame, watermark, sketch, drawing, white edges, out of focus, lens glare, morbid, blurry, jpeg artifacts, low quality, painting, water color, analog, bad, jpeg-artifact, jpeg-artifacts, (((artifacts))), bad image quality, video glitches, noise, color fringing, chromatic aberration, blur, watermark, logo, low resolution, glitch, halos, vignette

The image might have improved slightly, but since our output was already quite ok, the effect isn’t that big.
Model: Realistic Vision 2, Scheduler: DPM++ 2M Karras, Sampling steps: 40.

I find the negative prompt to be quite useful in getting more consistency, as it reduces the likelihood that the model will output low-quality images. Be aware that Stable Diffusion 1.5 and earlier don't weigh negative prompts very highly, and they often have little effect on the output.

Final touch

You might have noticed that while the image quality as improved, the cabin itself still looks quite bad. This is probably because Stable Diffusion has been trained on few, or mostly low-quality images of "cozy cabins" in the context we are placing it in. In this case, I would try different words that might yield us a nicer building, for example, adding words that are associated with typical nice photos of cabins.

When I search the database for "cozy cabin", I get a good mix of images, many of which have this ugly, oversaturated yellow light. After some tweaking in an attempt to get rid of the lights, I ended up with the following prompt:

+ summer cabin, lake, beautiful sunset, pine forrest, tall grass, flowers, photography, photo, DSLR, RAW, 4k, 8k, 16k, uhd, professional, beautiful, sharp focus, high resolution

- over saturated, ((yellow lights)), ((yellow windows)), ((night)), ((winter)), old, destroyed, ugly, low resolution, deformed, out of frame, watermark, sketch, drawing, white edges, out of focus, lens glare, morbid, blurry, jpeg artifacts, low quality, painting, water color, analog, bad, jpeg-artifact, jpeg-artifacts, (((artifacts))), bad image quality, video glitches, noise, color fringing, chromatic aberration, blur, watermark, logo, low resolution, glitch, halos, vignette

The texture is more realistic and we got rid of the over-saturated window lights.

The prompt was changed from "cosy cabin" to "summer cabin" in an attempt to get rid of the yellow tint. Summer is typically associated with more daylight, while "cosy" is more of a winter thing, with darker days where lights are more visible. Additionally, we added some negative words to steer the model away from winter and night photos that more often contain yellow lights. For good measure, we added "old, destroyed" in an attempt to make the actual construction look nicer.

Overall, I would say that Stable Diffusion isn't that good at generating images of log cabins. With more time spent testing out different prompts, I’m pretty sure it would be possible to find word combinations that improve the quality of the cabin.

Conclusion

In this article, we managed to go from an image that didn't look real at all to one that looks close to a real photo. Furthermore, we showed how we can nudge the model in different directions using specific words in our prompt.

The key takeaway is that Stable Diffusion doesn't understand language the same way we humans do. Instead of telling it what we want, we have to build up a prompt with words that the model associates with what we want. Once you start thinking in this way, it becomes much easier to generate great output.

With that, we've reached the end of this tutorial. If you find it challenging to generate photorealistic images from Stable Diffusion, I can recommend giving Visoid a try. We're building a tool where you can easily generate high-quality images without having to know all the nitty-gritty details of Stable Diffusion. While we mainly focus on architectural design, the application works surprisingly well with other use cases too.

Subscribe to be notified when Joachim Holwech posts a new article.