Skip to content
Articles
TechnicalFoundationsImage ModelsComfyUI

From Prompt to Puppy: How AI Image Generation Works

A look under the hood of image models, using a simple Z-Image-Turbo workflow in ComfyUI.

May 24, 202612 min readDifficulty: Advanced beginner to intermediate4/5
In this article
ComfyUI workflow generating a cute puppy with Z-Image-Turbo
The example workflow generates one image from a tiny prompt: A cute puppy. The workflow is simple enough to inspect, but it contains most of the important parts of modern image generation.

The Short Version

When you ask an AI image system for a cute puppy, it can feel like the model simply reads the sentence and draws a picture. That is not really what happens.

A modern image model usually works like this:

  • A text encoder turns your prompt into numerical meaning.
  • The model starts with noise in a compressed latent space.
  • A sampler decides how the model moves from noise toward structure.
  • The diffusion model predicts how to clean the image step by step.
  • A VAE decoder turns the final latent result into visible pixels.

Metaphor

The prompt is not the painting. It is the brief given to a restoration team. The model starts with a wall of static and slowly uncovers an image that fits the brief.

Our Example: A Cute Puppy In ComfyUI

In the ComfyUI workflow, we generate one image from a tiny prompt: A cute puppy. The output looks simple, but the workflow shows the whole chain.

  • Load Diffusion Model: loads Z-Image-Turbo, the image model used for generation.
  • Load CLIP or text encoder: loads the model that converts text into conditioning.
  • Load VAE: loads the model that converts between latent space and visible images.
  • Empty latent image: creates the starting latent canvas at the chosen resolution.
  • Text encode prompt: turns the prompt into conditioning data.
  • KSampler: runs the denoising process.
  • VAE Decode: converts the final latent into pixels.
  • Save Image: writes the final puppy image.

ComfyUI is useful because it makes the machine visible. A chat interface hides most of this. That is convenient, but less educational.

How An Image Model Is Trained

Before a model can generate a puppy, it has to learn what visual patterns tend to match words like puppy, cute, fur, paws, eyes, floor, photo, and soft lighting.

Training usually involves huge collections of images and captions. The model is shown images, text, and noise. In classic diffusion training, it learns a task that sounds strange at first: given a noisy version of an image, predict the noise that was added.

Newer systems can use related targets, such as velocity or flow, rather than literally predicting added noise at every step. The practical idea is similar: the model learns how to move a corrupted visual state toward a cleaner one that matches the conditioning.

That training game becomes powerful. If a model gets very good at recognizing which direction the noisy state should move, it can later start from pure noise and walk backward toward a clean image.

Metaphor

Training is like teaching someone to restore damaged photographs. At first, you show them photos with a little dust. Then photos with scratches. Then photos almost completely covered in static. After enough practice, they learn what kinds of shapes usually hide underneath the damage.

For text-to-image models, the caption matters too. The model is not just learning how to clean noise. It is learning how to clean noise in a direction that matches language.

What The Text Encoder Does

A prompt like A cute puppy is not sent into the diffusion model as plain English. The text encoder turns it into embeddings, which are long lists of numbers that represent meaning.

Those numbers act like a steering signal. The phrase points the model toward puppy-like, soft-looking, friendly visual concepts.

Metaphor

Imagine giving directions to a sculptor who only understands coordinates. You do not say cute directly. The encoder turns cute puppy into a set of coordinates in meaning space.

This is why prompt wording matters. Puppy points somewhere different from dog. Cute puppy points somewhere different from wet angry dog. Studio photo of a golden retriever puppy points somewhere more specific again.

What Latent Space Means

Latent space is a compressed working space. The model usually does not work directly with every final pixel. Instead, it works with a dense representation called a latent.

A latent is not an image you can look at. It is more like a bundle of visual instructions that still needs to be decoded.

  • The VAE encoder can compress an image into latent space.
  • The VAE decoder can expand a latent back into visible pixels.
  • In this text-to-image workflow, we start with an empty latent and decode only at the end.

Metaphor

Pixel space is the finished cake. Latent space is the recipe plus the batter. You cannot serve it yet, but the important structure is already there.

How The Model Works After Training

Once the model is trained, generation starts from noise. At the beginning, the latent image is basically static. It has no puppy yet.

The model then repeatedly asks: given this noisy latent, the current noise level, and the prompt conditioning, what should be removed, predicted, or changed?

Each step makes the latent slightly less random and slightly more image-like. Large structure tends to form early. Details tend to sharpen later.

Metaphor

Imagine a foggy window. At first, you see almost nothing. With each wipe, shapes appear: floor, background, small body, head, eyes, fur. The sampler decides how the wiping happens. The model decides what should appear as the fog clears.

What A Sampler Does

The sampler is the procedure that controls the denoising journey. The model predicts useful changes. The sampler decides how to apply those predictions over time.

Same prompt, same model, same seed, different sampler: you can get a different image.

  • Some samplers feel sharper.
  • Some feel smoother.
  • Some handle low step counts better.
  • Some are more stable for specific model families.

Metaphor

The model is the engine. The sampler is the driver. The same car can feel different if one driver takes smooth turns and another makes aggressive corrections.

What The Scheduler Does

Scheduler plot showing sigma values across denoising steps
A scheduler plot makes the noise schedule visible. The important idea is the curve: how the sampler moves from high noise toward low noise across the steps.

The scheduler controls how noise levels change across the steps. In many workflows, sigma is a way of describing the current noise level.

High noise early means the model is still deciding big structure. Low noise late means the model is refining details.

Metaphor

The scheduler is the map of the road. It decides whether the journey starts with big fast turns, small careful turns, or a balanced route.

A scheduler that spends more attention at high noise can affect composition and large shapes. A scheduler that gives more attention to low noise can affect texture and finishing detail.

Sampler Settings And What They Change

SettingWhat it controlsWhat changes in practice
SeedThe starting noise.Same seed helps reproduce a result. New seed explores a new image.
StepsThe number of denoising passes.More can refine the image, but too many may waste time or overwork it.
GuidanceHow strongly the model follows the prompt.Too low can drift. Too high can look harsh or forced.
SamplerThe denoising method.Changes sharpness, stability, texture, and speed.
SchedulerHow noise levels are spaced over the steps.Changes how early structure forms and how late details settle.
DenoiseHow much freedom the model has to change the latent.1.0 means full generation. Lower values preserve more input structure.
ResolutionThe shape and size of the canvas.Affects framing, memory use, render time, and sometimes composition.

Seed

The seed controls the starting noise. If you keep the same seed and the same settings, you should usually get the same image. If you change the seed, the model starts from a different noise pattern.

Metaphor

The seed is the block of marble. Same sculptor, same instructions, different block, different final statue.

Steps

Steps control how many denoising passes happen. Too few steps can look unfinished. Too many steps can waste time or push the image into an overworked look. Turbo models are often designed to work well with fewer steps.

Metaphor

Steps are brush passes. One pass is rough. Ten may be enough. A hundred can start ruining the painting.

Guidance

Guidance controls how strongly the model follows the prompt. Higher is not automatically better. Too low can drift. Too high can make the image look stiff, crunchy, or over-forced.

Metaphor

Guidance is how loudly you give instructions. Too quiet, and the model improvises. Too loud, and it starts gripping the pencil too hard.

Denoise

In pure text-to-image workflows, denoise is often 1.0. That means full generation from noise. In image-to-image workflows, lower denoise values preserve more of the starting image.

Metaphor

Denoise is renovation strength. At 0.2, you repaint the room. At 1.0, you demolish the house and rebuild it.

Why Z-Image-Turbo Is A Useful Baseline

Z-Image-Turbo is useful for this kind of explanation because it is a fast image generation model built from the Z-Image family through few-step distillation. That makes sampler and scheduler choices easy to test. The point is not that every model behaves exactly like this one. The point is that the workflow exposes the major pieces clearly.

Most modern image generation systems still revolve around the same ideas:

  • Text becomes conditioning.
  • Noise becomes structure.
  • The sampler controls the denoising path.
  • The decoder turns latent information into pixels.

ComfyUI Compared To Asking ChatGPT For An Image

When you ask ChatGPT to generate an image, the experience is much simpler. You type something likeCreate an image of a cute puppy sitting on a wooden floor.Then an image appears.

Behind the scenes, many of the same broad ideas can still apply: prompt interpretation, image generation, safety checks, output sizing, model defaults, and sometimes prompt refinement.

The difference is control.

ComfyUI shows the controls

  • Model
  • Text encoder
  • VAE
  • Seed
  • Resolution
  • Sampler and scheduler
  • Steps, guidance, and denoise

ChatGPT hides the controls

  • Cleaner user experience
  • Less setup
  • Fewer technical choices
  • More conversational iteration
  • Less direct access to sampler-level behavior

Metaphor

ChatGPT is like ordering a finished meal from a good kitchen. ComfyUI is like standing inside the kitchen and adjusting the oven, pan, ingredients, timing, and plating yourself.

A Simple Way To Learn

If you are learning AI image generation, do not only change the prompt. Change one setting at a time.

  1. Keep the same prompt.
  2. Keep the same seed.
  3. Keep the same resolution.
  4. Change only the steps.
  5. Then change only the scheduler.
  6. Then change only the sampler.
  7. Then change only guidance.

This turns image generation from guessing into testing. The fixed puppy seed is useful because it lets you see what each setting does without the entire image changing randomly.

Final Thought

AI image generation is often presented as a mystery, but the underlying process is mechanical and learnable.

The model does not simply draw a puppy. It uses trained visual patterns, language conditioning, latent space, noise schedules, sampler behavior, and decoding to arrive at an image.

Once you understand that, prompts become only one part of the craft. The real control comes from understanding the machine underneath.

Sources and further reading

Keep reading

Related articles

All guides
Diagram showing a node graph as visible code with typed inputs and outputs
TechnicalFoundations

ComfyUI Is Code You Can See

Learn how ComfyUI represents code as nodes and edges, why that structure is useful, and how similar node-based tools work across creative software, automation, and game development.

13 min readDifficulty 3/5
Diagram of LLM matrix optimization with forward pass, loss, gradient, and learning-rate scheduler
TechnicalFoundations

How an LLM Finds a Lower-Loss Solution

See the matrix calculations behind LLM training, from attention and logits to cross-entropy loss, gradients, AdamW updates, and learning-rate schedules.

15 min readDifficulty 5/5

Next: turn understanding into better creative control

Movey is built around controlled AI creative workflows. Use Director when you want to plan scenes, keep references consistent, generate motion, and shape a final cut instead of relying on one prompt.

From Prompt to Puppy: How AI Image Generation Works | Movey