How an LLM Finds a Lower-Loss Solution
A matrix-first look at logits, cross-entropy, gradients, AdamW, and where learning-rate schedulers actually enter the training loop.
In this article

The Short Version
An LLM does not calculate one perfect global optimum in a clean closed-form equation. Training is too large, too noisy, and too non-convex for that. Instead, the model repeatedly moves toward lower loss on many batches of text.
The loop is simple in shape:
1. Run tokens through matrix operations.
2. Convert hidden states into next-token probabilities.
3. Measure error with cross-entropy loss.
4. Backpropagate gradients through the matrices.
5. Let an optimizer such as AdamW choose the parameter update.
6. Let a learning-rate scheduler scale the update over time.Metaphor
What Optimum Means In Practice
In classical optimization, an optimum is the parameter setting that minimizes an objective function. For an LLM, the parameters are all learned weights. We can gather them into one huge vector:
\theta = [W_E, W_Q, W_K, W_V, W_O, W_1, W_2, W_U, b, ...]
Goal:
\theta^* = \arg\min_\theta J(\theta)The objective Jis usually the expected next-token loss over the training data. We do not evaluate all possible text at once. We sample batches, calculate a noisy estimate of the loss, and update from that.
Dataset objective:
J(\theta) = E_{(x,y) ~ data} [ L(f_\theta(x), y) ]
Mini-batch estimate:
\hat{J}_B(\theta) = (1 / B) \sum_{i=1}^{B} L(f_\theta(x_i), y_i)That is why people often say training is stochastic. Each batch is a small window into the whole data distribution. The optimizer is always steering from partial evidence.
From Tokens To Matrices
Start with a tiny sequence of tokens. The tokenizer turns text into token IDs. The embedding table turns each token ID into a vector.
Token IDs:
[t_1, t_2, t_3, ..., t_T]
Embedding table:
E \in R^{V x d_model}
Embedded sequence:
X = E[token_ids]
X \in R^{T x d_model}Here, V is the vocabulary size, Tis the sequence length, and d_modelis the width of each token representation.
Metaphor
The Core Matrix Step: Attention
A transformer block uses learned matrices to create queries, keys, and values. These are not separate databases. They are different projections of the same token representations.
Input:
X \in R^{T x d_model}
Learned projection matrices:
W_Q \in R^{d_model x d_k}
W_K \in R^{d_model x d_k}
W_V \in R^{d_model x d_v}
Projected matrices:
Q = X W_Q
K = X W_K
V = X W_VAttention compares every query with every key. The result is a square score matrix. For a decoder-style language model, a causal mask is added before softmax so each position can only use earlier positions and itself.
Scores:
S = (Q K^T) / sqrt(d_k) + M_causal
S \in R^{T x T}
Causal mask:
(M_causal)_{i,j} =
0 if j <= i
-inf if j > i
Attention weights:
A = softmax(S)
Attention output:
H = A V
H \in R^{T x d_v}The Transformer paper introduced this scaled dot-product attention form. The scaling bysqrt(d_k)keeps dot products from becoming too large, which helps the softmax stay in a useful gradient range.
From Hidden State To Loss
After several transformer blocks, the model has a hidden vector for each token position. To predict the next token, it maps each valid hidden vector into vocabulary-sized logits.
Final hidden states:
H \in R^{B x T x d_model}
Unembedding matrix:
W_U \in R^{d_model x V}
Flatten valid prediction positions:
H_flat \in R^{M x d_model}
Logits for valid positions:
Z = H_flat W_U + b
Z \in R^{M x V}
Probabilities:
P = softmax(Z)A logit is an unnormalized score for a token. Softmax converts those scores into probabilities. Here, M is the number of valid next-token prediction positions after padding and ignored positions are removed.
Single valid position:
L_m = -log P[m, y_m]
Average loss:
L = -(1 / M) \sum_{m=1}^{M} log P[m, y_m]
Equivalent one-hot form:
L = -(1 / M) \sum_{m=1}^{M} \sum_{v=1}^{V} Y[m,v] log P[m,v]Lower loss means the model assigned higher probability to the true next tokens in the training batch. The whole training loop exists to change the matrices so this number tends to go down.
A Concrete Gradient Calculation
The cleanest place to see the gradient is the final vocabulary projection. For the flattened valid prediction positions, letY be the one-hot matrix of correct tokens. A useful cross-entropy plus softmax result is:
Softmax probabilities:
P = softmax(Z)
One-hot labels:
Y \in R^{M x V}
Gradient with respect to logits:
dL / dZ = (P - Y) / MIf the correct token should have probability 1 but the model gives it 0.30, that token gets a negative correction. If the model gives too much probability to a wrong token, that wrong token gets a positive correction. The gradient says how the logits should move.
Now apply the matrix derivative for the output matrix:
Logits:
Z = H_flat W_U + b
Let:
G_Z = dL / dZ
Then:
dL / dW_U = H_flat^T G_Z
dL / db = sum_rows(G_Z)
dL / dH_flat = G_Z W_U^TThis is the key mechanical idea. The model output was made by matrix multiplication, so the correction is also expressed through matrix multiplication. Backpropagation keeps applying the chain rule backward through every matrix, softmax, normalization, residual connection, and feed forward layer.
Metaphor
The Optimizer Step
Once backpropagation has produced gradients for all parameters, collect them into one gradient object:
Parameters:
\theta_t
Gradient from current batch:
g_t = \nabla_\theta \hat{J}_B(\theta_t)The simplest update is stochastic gradient descent:
\theta_{t+1} = \theta_t - \eta_t g_tThe learning rate eta_tcontrols the step size. In large transformer training, Adam or AdamW is more common because it keeps moving averages of gradients and squared gradients.
Adam moments:
m_t = beta_1 m_{t-1} + (1 - beta_1) g_t
v_t = beta_2 v_{t-1} + (1 - beta_2) (g_t \odot g_t)
Bias correction:
\hat{m}_t = m_t / (1 - beta_1^t)
\hat{v}_t = v_t / (1 - beta_2^t)
Adam update:
\theta_{t+1} = \theta_t - \eta_t \hat{m}_t / (sqrt(\hat{v}_t) + epsilon)Adam changes the effective step per parameter. If one weight has repeatedly noisy or large gradients, its squared-gradient estimate can reduce the step. If another weight has a cleaner direction, it can move more confidently.
AdamW adds decoupled weight decay. Instead of mixing weight decay into the adaptive gradient calculation, it applies parameter shrinkage as a separate part of the update.
One common AdamW-style sketch:
\theta_{t+1} =
(1 - \eta_t * lambda) * \theta_t
- \eta_t * \hat{m}_t / (sqrt(\hat{v}_t) + epsilon)The practical effect is that the model still follows the gradient, while the optimizer also discourages weights from growing without bound. The AdamW paper is specifically about why this decoupling matters for adaptive optimizers such as Adam.
Where Schedulers Come Into Play
In image generation, a sampler scheduler often controls a denoising trajectory. In LLM training, the common scheduler is different: it controls the learning rate over training steps.
Generic optimizer update:
\theta_{t+1} = OptimizerStep(\theta_t, g_t, eta_t)
The scheduler sets:
eta_t = schedule(t)The scheduler does not usually decide the gradient direction. Backpropagation does that. The scheduler decides how strongly the optimizer should trust that direction at this stage of training.
| Schedule | Formula sketch | Training behavior |
|---|---|---|
| Constant | eta_t = eta_0 | Simple, but can be too aggressive early or too high late. |
| Linear warmup | eta_t = eta_max * t / warmup | Starts cautiously while gradients and optimizer moments stabilize. |
| Inverse square root | eta_t proportional to 1 / sqrt(t) | Used in the original Transformer schedule after warmup. |
| Cosine decay | eta_t moves along a cosine curve | Large steps early, gradually smaller steps near the end. |
| Warm restarts | repeat cosine cycles | Periodically raises the learning rate to explore again. |
The original Transformer paper used Adam with a warmup plus inverse-square-root schedule:
Transformer schedule:
eta_t =
d_model^{-0.5} * min(t^{-0.5}, t * warmup_steps^{-1.5})Before warmup ends, the second term grows linearly with step. After warmup, the first term decays as training progresses. This gives the optimizer a gentle start and then smaller refinement steps.
Metaphor
A Tiny Numerical Example
Suppose one output layer weight has value0.80. The batch gradient says increasing this weight raises the loss, so the gradient is positive.
Current weight:
\theta_t = 0.80
Gradient:
g_t = 0.25
Learning rate from scheduler:
eta_t = 0.001
Plain SGD update:
\theta_{t+1} = 0.80 - 0.001 * 0.25
\theta_{t+1} = 0.79975That looks tiny because one step is tiny. LLM training is the accumulation of many small, coordinated changes across billions of parameters. AdamW would further scale that step using moment estimates, and the scheduler would change the learning rate as training moves from warmup to peak learning to decay.
Why There Is No Simple Closed-Form Answer
If this were ordinary linear regression, we could sometimes solve for the best weights with a compact matrix equation. LLMs are different:
- The model contains many nonlinear operations, including softmax, activation functions, and normalization.
- The training objective is non-convex, so there are many basins, saddles, and flat regions.
- The data is sampled in batches, so each gradient is noisy.
- The model is overparameterized, so many different parameter settings can behave similarly.
- The goal is useful generalization, not only perfect memorization of the training set.
So the useful mental model is not one grand equation that produces the final LLM. It is an iterative machine: matrix prediction, loss measurement, gradient calculation, optimizer update, scheduled step size, repeat.
Sources
This article is a simplified explanation, but the core formulas are grounded in the standard transformer and optimizer literature.
- Attention Is All You Need for scaled dot-product attention, transformer architecture, Adam use, and the original warmup plus inverse-square-root learning-rate schedule.
- Adam: A Method for Stochastic Optimization for first and second moment estimates, bias correction, and the Adam update rule.
- Decoupled Weight Decay Regularization for AdamW and the distinction between L2 regularization and decoupled weight decay.
- SGDR: Stochastic Gradient Descent with Warm Restarts for cosine schedules and warm restart scheduling ideas.
Keep reading
Related articles

ComfyUI Is Code You Can See
Learn how ComfyUI represents code as nodes and edges, why that structure is useful, and how similar node-based tools work across creative software, automation, and game development.

From Prompt to Puppy: How AI Image Generation Works
Learn how an AI image model is trained, how prompts become embeddings, what latent space means, and how samplers turn noise into a final image.