From Prompt to Puppy: How AI Image Generation Works
A look under the hood of image models, using a simple Z-Image-Turbo workflow in ComfyUI.
In this article

The Short Version
When you ask an AI image system for a cute puppy, it can feel like the model simply reads the sentence and draws a picture. That is not really what happens.
A modern image model usually works like this:
- A text encoder turns your prompt into numerical meaning.
- The model starts with noise in a compressed latent space.
- A sampler decides how the model moves from noise toward structure.
- The diffusion model predicts how to clean the image step by step.
- A VAE decoder turns the final latent result into visible pixels.
Metaphor
Our Example: A Cute Puppy In ComfyUI
In the ComfyUI workflow, we generate one image from a tiny prompt: A cute puppy. The output looks simple, but the workflow shows the whole chain.
- Load Diffusion Model: loads Z-Image-Turbo, the image model used for generation.
- Load CLIP or text encoder: loads the model that converts text into conditioning.
- Load VAE: loads the model that converts between latent space and visible images.
- Empty latent image: creates the starting latent canvas at the chosen resolution.
- Text encode prompt: turns the prompt into conditioning data.
- KSampler: runs the denoising process.
- VAE Decode: converts the final latent into pixels.
- Save Image: writes the final puppy image.
ComfyUI is useful because it makes the machine visible. A chat interface hides most of this. That is convenient, but less educational.
How An Image Model Is Trained
Before a model can generate a puppy, it has to learn what visual patterns tend to match words like puppy, cute, fur, paws, eyes, floor, photo, and soft lighting.
Training usually involves huge collections of images and captions. The model is shown images, text, and noise. In classic diffusion training, it learns a task that sounds strange at first: given a noisy version of an image, predict the noise that was added.
Newer systems can use related targets, such as velocity or flow, rather than literally predicting added noise at every step. The practical idea is similar: the model learns how to move a corrupted visual state toward a cleaner one that matches the conditioning.
That training game becomes powerful. If a model gets very good at recognizing which direction the noisy state should move, it can later start from pure noise and walk backward toward a clean image.
Metaphor
For text-to-image models, the caption matters too. The model is not just learning how to clean noise. It is learning how to clean noise in a direction that matches language.
What The Text Encoder Does
A prompt like A cute puppy is not sent into the diffusion model as plain English. The text encoder turns it into embeddings, which are long lists of numbers that represent meaning.
Those numbers act like a steering signal. The phrase points the model toward puppy-like, soft-looking, friendly visual concepts.
Metaphor
This is why prompt wording matters. Puppy points somewhere different from dog. Cute puppy points somewhere different from wet angry dog. Studio photo of a golden retriever puppy points somewhere more specific again.
What Latent Space Means
Latent space is a compressed working space. The model usually does not work directly with every final pixel. Instead, it works with a dense representation called a latent.
A latent is not an image you can look at. It is more like a bundle of visual instructions that still needs to be decoded.
- The VAE encoder can compress an image into latent space.
- The VAE decoder can expand a latent back into visible pixels.
- In this text-to-image workflow, we start with an empty latent and decode only at the end.
Metaphor
How The Model Works After Training
Once the model is trained, generation starts from noise. At the beginning, the latent image is basically static. It has no puppy yet.
The model then repeatedly asks: given this noisy latent, the current noise level, and the prompt conditioning, what should be removed, predicted, or changed?
Each step makes the latent slightly less random and slightly more image-like. Large structure tends to form early. Details tend to sharpen later.
Metaphor
What A Sampler Does
The sampler is the procedure that controls the denoising journey. The model predicts useful changes. The sampler decides how to apply those predictions over time.
Same prompt, same model, same seed, different sampler: you can get a different image.
- Some samplers feel sharper.
- Some feel smoother.
- Some handle low step counts better.
- Some are more stable for specific model families.
Metaphor
What The Scheduler Does

The scheduler controls how noise levels change across the steps. In many workflows, sigma is a way of describing the current noise level.
High noise early means the model is still deciding big structure. Low noise late means the model is refining details.
Metaphor
A scheduler that spends more attention at high noise can affect composition and large shapes. A scheduler that gives more attention to low noise can affect texture and finishing detail.
Sampler Settings And What They Change
| Setting | What it controls | What changes in practice |
|---|---|---|
| Seed | The starting noise. | Same seed helps reproduce a result. New seed explores a new image. |
| Steps | The number of denoising passes. | More can refine the image, but too many may waste time or overwork it. |
| Guidance | How strongly the model follows the prompt. | Too low can drift. Too high can look harsh or forced. |
| Sampler | The denoising method. | Changes sharpness, stability, texture, and speed. |
| Scheduler | How noise levels are spaced over the steps. | Changes how early structure forms and how late details settle. |
| Denoise | How much freedom the model has to change the latent. | 1.0 means full generation. Lower values preserve more input structure. |
| Resolution | The shape and size of the canvas. | Affects framing, memory use, render time, and sometimes composition. |
Seed
The seed controls the starting noise. If you keep the same seed and the same settings, you should usually get the same image. If you change the seed, the model starts from a different noise pattern.
Metaphor
Steps
Steps control how many denoising passes happen. Too few steps can look unfinished. Too many steps can waste time or push the image into an overworked look. Turbo models are often designed to work well with fewer steps.
Metaphor
Guidance
Guidance controls how strongly the model follows the prompt. Higher is not automatically better. Too low can drift. Too high can make the image look stiff, crunchy, or over-forced.
Metaphor
Denoise
In pure text-to-image workflows, denoise is often 1.0. That means full generation from noise. In image-to-image workflows, lower denoise values preserve more of the starting image.
Metaphor
Why Z-Image-Turbo Is A Useful Baseline
Z-Image-Turbo is useful for this kind of explanation because it is a fast image generation model built from the Z-Image family through few-step distillation. That makes sampler and scheduler choices easy to test. The point is not that every model behaves exactly like this one. The point is that the workflow exposes the major pieces clearly.
Most modern image generation systems still revolve around the same ideas:
- Text becomes conditioning.
- Noise becomes structure.
- The sampler controls the denoising path.
- The decoder turns latent information into pixels.
ComfyUI Compared To Asking ChatGPT For An Image
When you ask ChatGPT to generate an image, the experience is much simpler. You type something likeCreate an image of a cute puppy sitting on a wooden floor.Then an image appears.
Behind the scenes, many of the same broad ideas can still apply: prompt interpretation, image generation, safety checks, output sizing, model defaults, and sometimes prompt refinement.
The difference is control.
ComfyUI shows the controls
- Model
- Text encoder
- VAE
- Seed
- Resolution
- Sampler and scheduler
- Steps, guidance, and denoise
ChatGPT hides the controls
- Cleaner user experience
- Less setup
- Fewer technical choices
- More conversational iteration
- Less direct access to sampler-level behavior
Metaphor
A Simple Way To Learn
If you are learning AI image generation, do not only change the prompt. Change one setting at a time.
- Keep the same prompt.
- Keep the same seed.
- Keep the same resolution.
- Change only the steps.
- Then change only the scheduler.
- Then change only the sampler.
- Then change only guidance.
This turns image generation from guessing into testing. The fixed puppy seed is useful because it lets you see what each setting does without the entire image changing randomly.
Final Thought
AI image generation is often presented as a mystery, but the underlying process is mechanical and learnable.
The model does not simply draw a puppy. It uses trained visual patterns, language conditioning, latent space, noise schedules, sampler behavior, and decoding to arrive at an image.
Once you understand that, prompts become only one part of the craft. The real control comes from understanding the machine underneath.
Sources and further reading
Keep reading
Related articles

ComfyUI Is Code You Can See
Learn how ComfyUI represents code as nodes and edges, why that structure is useful, and how similar node-based tools work across creative software, automation, and game development.

How an LLM Finds a Lower-Loss Solution
See the matrix calculations behind LLM training, from attention and logits to cross-entropy loss, gradients, AdamW updates, and learning-rate schedules.
Next: turn understanding into better creative control
Movey is built around controlled AI creative workflows. Use Director when you want to plan scenes, keep references consistent, generate motion, and shape a final cut instead of relying on one prompt.