Skip to content
Articles
TechnicalFoundationsLLMsOptimization

How an LLM Finds a Lower-Loss Solution

A matrix-first look at logits, cross-entropy, gradients, AdamW, and where learning-rate schedulers actually enter the training loop.

June 24, 202615 min readDifficulty: Advanced5/5
In this article
Diagram of LLM matrix optimization with forward pass, loss, gradient, and learning-rate scheduler
A simplified view of LLM training: matrix operations make predictions, loss measures the error, gradients point to a lower-loss direction, and the scheduler controls how large each optimizer step is.

The Short Version

An LLM does not calculate one perfect global optimum in a clean closed-form equation. Training is too large, too noisy, and too non-convex for that. Instead, the model repeatedly moves toward lower loss on many batches of text.

The loop is simple in shape:

1. Run tokens through matrix operations.
2. Convert hidden states into next-token probabilities.
3. Measure error with cross-entropy loss.
4. Backpropagate gradients through the matrices.
5. Let an optimizer such as AdamW choose the parameter update.
6. Let a learning-rate scheduler scale the update over time.

Metaphor

Think of the loss landscape as a foggy mountain range. The model is not handed a map to the deepest valley. It feels the local slope under its feet, takes a controlled step, checks again, and repeats this millions or billions of times.

What Optimum Means In Practice

In classical optimization, an optimum is the parameter setting that minimizes an objective function. For an LLM, the parameters are all learned weights. We can gather them into one huge vector:

\theta = [W_E, W_Q, W_K, W_V, W_O, W_1, W_2, W_U, b, ...]

Goal:

\theta^* = \arg\min_\theta J(\theta)

The objective Jis usually the expected next-token loss over the training data. We do not evaluate all possible text at once. We sample batches, calculate a noisy estimate of the loss, and update from that.

Dataset objective:

J(\theta) = E_{(x,y) ~ data} [ L(f_\theta(x), y) ]

Mini-batch estimate:

\hat{J}_B(\theta) = (1 / B) \sum_{i=1}^{B} L(f_\theta(x_i), y_i)

That is why people often say training is stochastic. Each batch is a small window into the whole data distribution. The optimizer is always steering from partial evidence.

From Tokens To Matrices

Start with a tiny sequence of tokens. The tokenizer turns text into token IDs. The embedding table turns each token ID into a vector.

Token IDs:

[t_1, t_2, t_3, ..., t_T]

Embedding table:

E \in R^{V x d_model}

Embedded sequence:

X = E[token_ids]
X \in R^{T x d_model}

Here, V is the vocabulary size, Tis the sequence length, and d_modelis the width of each token representation.

Metaphor

The embedding table is a dictionary where every token gets coordinates. The word itself is no longer a word inside the model. It is a point in a learned space where nearby directions tend to carry related usage patterns.

The Core Matrix Step: Attention

A transformer block uses learned matrices to create queries, keys, and values. These are not separate databases. They are different projections of the same token representations.

Input:

X \in R^{T x d_model}

Learned projection matrices:

W_Q \in R^{d_model x d_k}
W_K \in R^{d_model x d_k}
W_V \in R^{d_model x d_v}

Projected matrices:

Q = X W_Q
K = X W_K
V = X W_V

Attention compares every query with every key. The result is a square score matrix. For a decoder-style language model, a causal mask is added before softmax so each position can only use earlier positions and itself.

Scores:

S = (Q K^T) / sqrt(d_k) + M_causal
S \in R^{T x T}

Causal mask:

(M_causal)_{i,j} =
  0      if j <= i
  -inf   if j > i

Attention weights:

A = softmax(S)

Attention output:

H = A V
H \in R^{T x d_v}

The Transformer paper introduced this scaled dot-product attention form. The scaling bysqrt(d_k)keeps dot products from becoming too large, which helps the softmax stay in a useful gradient range.

From Hidden State To Loss

After several transformer blocks, the model has a hidden vector for each token position. To predict the next token, it maps each valid hidden vector into vocabulary-sized logits.

Final hidden states:

H \in R^{B x T x d_model}

Unembedding matrix:

W_U \in R^{d_model x V}

Flatten valid prediction positions:

H_flat \in R^{M x d_model}

Logits for valid positions:

Z = H_flat W_U + b
Z \in R^{M x V}

Probabilities:

P = softmax(Z)

A logit is an unnormalized score for a token. Softmax converts those scores into probabilities. Here, M is the number of valid next-token prediction positions after padding and ignored positions are removed.

Single valid position:

L_m = -log P[m, y_m]

Average loss:

L = -(1 / M) \sum_{m=1}^{M} log P[m, y_m]

Equivalent one-hot form:

L = -(1 / M) \sum_{m=1}^{M} \sum_{v=1}^{V} Y[m,v] log P[m,v]

Lower loss means the model assigned higher probability to the true next tokens in the training batch. The whole training loop exists to change the matrices so this number tends to go down.

A Concrete Gradient Calculation

The cleanest place to see the gradient is the final vocabulary projection. For the flattened valid prediction positions, letY be the one-hot matrix of correct tokens. A useful cross-entropy plus softmax result is:

Softmax probabilities:

P = softmax(Z)

One-hot labels:

Y \in R^{M x V}

Gradient with respect to logits:

dL / dZ = (P - Y) / M

If the correct token should have probability 1 but the model gives it 0.30, that token gets a negative correction. If the model gives too much probability to a wrong token, that wrong token gets a positive correction. The gradient says how the logits should move.

Now apply the matrix derivative for the output matrix:

Logits:

Z = H_flat W_U + b

Let:

G_Z = dL / dZ

Then:

dL / dW_U = H_flat^T G_Z
dL / db   = sum_rows(G_Z)
dL / dH_flat = G_Z W_U^T

This is the key mechanical idea. The model output was made by matrix multiplication, so the correction is also expressed through matrix multiplication. Backpropagation keeps applying the chain rule backward through every matrix, softmax, normalization, residual connection, and feed forward layer.

Metaphor

The loss is the complaint. The gradient is the annotated route back through the factory, showing which machine settings contributed to the bad output and how each setting should shift.

The Optimizer Step

Once backpropagation has produced gradients for all parameters, collect them into one gradient object:

Parameters:

\theta_t

Gradient from current batch:

g_t = \nabla_\theta \hat{J}_B(\theta_t)

The simplest update is stochastic gradient descent:

\theta_{t+1} = \theta_t - \eta_t g_t

The learning rate eta_tcontrols the step size. In large transformer training, Adam or AdamW is more common because it keeps moving averages of gradients and squared gradients.

Adam moments:

m_t = beta_1 m_{t-1} + (1 - beta_1) g_t
v_t = beta_2 v_{t-1} + (1 - beta_2) (g_t \odot g_t)

Bias correction:

\hat{m}_t = m_t / (1 - beta_1^t)
\hat{v}_t = v_t / (1 - beta_2^t)

Adam update:

\theta_{t+1} = \theta_t - \eta_t \hat{m}_t / (sqrt(\hat{v}_t) + epsilon)

Adam changes the effective step per parameter. If one weight has repeatedly noisy or large gradients, its squared-gradient estimate can reduce the step. If another weight has a cleaner direction, it can move more confidently.

AdamW adds decoupled weight decay. Instead of mixing weight decay into the adaptive gradient calculation, it applies parameter shrinkage as a separate part of the update.

One common AdamW-style sketch:

\theta_{t+1} =
  (1 - \eta_t * lambda) * \theta_t
  - \eta_t * \hat{m}_t / (sqrt(\hat{v}_t) + epsilon)

The practical effect is that the model still follows the gradient, while the optimizer also discourages weights from growing without bound. The AdamW paper is specifically about why this decoupling matters for adaptive optimizers such as Adam.

Where Schedulers Come Into Play

In image generation, a sampler scheduler often controls a denoising trajectory. In LLM training, the common scheduler is different: it controls the learning rate over training steps.

Generic optimizer update:

\theta_{t+1} = OptimizerStep(\theta_t, g_t, eta_t)

The scheduler sets:

eta_t = schedule(t)

The scheduler does not usually decide the gradient direction. Backpropagation does that. The scheduler decides how strongly the optimizer should trust that direction at this stage of training.

ScheduleFormula sketchTraining behavior
Constanteta_t = eta_0Simple, but can be too aggressive early or too high late.
Linear warmupeta_t = eta_max * t / warmupStarts cautiously while gradients and optimizer moments stabilize.
Inverse square rooteta_t proportional to 1 / sqrt(t)Used in the original Transformer schedule after warmup.
Cosine decayeta_t moves along a cosine curveLarge steps early, gradually smaller steps near the end.
Warm restartsrepeat cosine cyclesPeriodically raises the learning rate to explore again.

The original Transformer paper used Adam with a warmup plus inverse-square-root schedule:

Transformer schedule:

eta_t =
d_model^{-0.5} * min(t^{-0.5}, t * warmup_steps^{-1.5})

Before warmup ends, the second term grows linearly with step. After warmup, the first term decays as training progresses. This gives the optimizer a gentle start and then smaller refinement steps.

Metaphor

The gradient is the steering wheel. AdamW is the suspension system that adapts to rough ground. The scheduler is the speed control: slow at the start, faster when stable, slower again when the car needs precision.

A Tiny Numerical Example

Suppose one output layer weight has value0.80. The batch gradient says increasing this weight raises the loss, so the gradient is positive.

Current weight:

\theta_t = 0.80

Gradient:

g_t = 0.25

Learning rate from scheduler:

eta_t = 0.001

Plain SGD update:

\theta_{t+1} = 0.80 - 0.001 * 0.25
\theta_{t+1} = 0.79975

That looks tiny because one step is tiny. LLM training is the accumulation of many small, coordinated changes across billions of parameters. AdamW would further scale that step using moment estimates, and the scheduler would change the learning rate as training moves from warmup to peak learning to decay.

Why There Is No Simple Closed-Form Answer

If this were ordinary linear regression, we could sometimes solve for the best weights with a compact matrix equation. LLMs are different:

  • The model contains many nonlinear operations, including softmax, activation functions, and normalization.
  • The training objective is non-convex, so there are many basins, saddles, and flat regions.
  • The data is sampled in batches, so each gradient is noisy.
  • The model is overparameterized, so many different parameter settings can behave similarly.
  • The goal is useful generalization, not only perfect memorization of the training set.

So the useful mental model is not one grand equation that produces the final LLM. It is an iterative machine: matrix prediction, loss measurement, gradient calculation, optimizer update, scheduled step size, repeat.

Sources

This article is a simplified explanation, but the core formulas are grounded in the standard transformer and optimizer literature.

Keep reading

Related articles

All guides
Diagram showing a node graph as visible code with typed inputs and outputs
TechnicalFoundations

ComfyUI Is Code You Can See

Learn how ComfyUI represents code as nodes and edges, why that structure is useful, and how similar node-based tools work across creative software, automation, and game development.

13 min readDifficulty 3/5
How an LLM Finds a Lower-Loss Solution | Movey