Train and Evaluation loss drop between epochs

Hi everyone.

I’m training a LoRA adapter for a model using the SFT trainer and also set an evaluation set. In my case, the train and evaluation sets are completely unrelated.

What I observe is a drop in loss between the epochs. It makes perfect sense for the train loss. the model already saw the examples, thus the loss is expected to drop. But I can’t reason about why is there a drop in the evaluation loss, as the examples are not related and there is no sign of contamination.

Any ideas and advice would be highly appreciated, thank you in advance.

Adding the train/eval loss from W&B, I’ve ran 2 training epochs in this session on 4 GPUs on the same node.

2 Likes

It seems this happens fairly often due to various factors…


What you are seeing is usually normal.

The short reason is that eval loss is not asking “have I seen this exact example before?” It is asking “how much probability does the model assign to the target next tokens in this eval set?” In causal language modeling, the objective is next-token prediction, and perplexity or cross-entropy is just the average negative log-likelihood over tokens. If training makes the model better at predicting the kinds of tokens that appear in eval, eval loss goes down even when the train and eval examples are completely disjoint. Hugging Face’s docs are explicit that causal LM is next-token prediction, and that perplexity is the exponentiated average negative log-likelihood of a tokenized sequence. (Hugging Face)

Why “unrelated datasets” can still produce lower eval loss

“Unrelated” at the sample level is often still related at the token-distribution level.

That matters because SFT is not learning a database lookup table of full examples. It is adjusting token probabilities. So if your train and eval sets share any of these, eval loss can fall:

  • the same language,
  • the same instruction or chat format,
  • the same assistant style,
  • the same EOS and separator behavior,
  • similar answer lengths,
  • overlapping domain vocabulary,
  • similar punctuation or boilerplate,
  • the same chat template.

Chat-model docs make the underlying point very directly: beneath the chat abstraction, these models are still just language models continuing a sequence of tokens. (Hugging Face)

A concrete example helps. Suppose the train set teaches the LoRA adapter that assistant responses usually start with a certain pattern, end with EOS more cleanly, follow a certain template marker, and use a more consistent style. Then a held-out eval response that follows the same overall format becomes easier to predict token by token, even if the topic is different. That lowers eval loss without any contamination. (Hugging Face)

Why the drop looks like it happens between epochs

This part is often just logging cadence.

In Trainer, eval_strategy="epoch" means evaluation is run at the end of each epoch. So W&B will naturally show a staircase-like eval curve: flat, then a lower point, then flat again. That does not mean the model only improved at epoch boundaries. It means you only measured it there. The docs also expose eval_on_start, which gives you a true pre-training baseline and is one of the best sanity checks for this exact situation. (Hugging Face)

So if your plot shows train loss moving continuously and eval loss stepping down once per epoch, that is exactly what the Trainer configuration would produce when eval is epoch-based. (Hugging Face)

Why eval loss can be lower or cleaner than train loss

Train and eval are not measured under identical conditions.

PyTorch changes model behavior under model.eval(). In particular, dropout becomes an identity function during evaluation, and Module.eval() affects modules like Dropout and BatchNorm. That makes eval loss less noisy and often lower than the in-training minibatch loss, even before you get into any real generalization story. (PyTorch Docs)

So the “train loss went down because the model saw the examples” story is only part of it. Another part is simply that train-mode loss and eval-mode loss are different measurements. (PyTorch Docs)

Why this is especially easy to misread with SFTTrainer

With TRL SFT, the scalar loss may be narrower than your intuition.

The docs say SFTTrainer can compute loss only on the completion part or only on the assistant part. The collator sets ignored tokens to -100, dynamically pads to the batch max, and supports eval_packing, which defaults to the training packing behavior when left as None. So your eval loss may really mean:

  • “loss on completion tokens only,” or
  • “loss on assistant tokens only,” or
  • “loss on packed sequences over a masked subset of tokens,”

not “loss on the full raw example in the everyday sense.” (Hugging Face)

That matters because style and formatting improvements often show up first on exactly those supervised regions. So you can get a real drop in eval loss even if the eval prompts and answers are otherwise unrelated in content. (Hugging Face)

There is one more important pitfall here. TRL’s own code warns that if you use packing with unsupported attention implementations, it can cause cross-contamination between samples inside a packed sequence. That is not train/eval leakage in the usual data-split sense, but it is a real implementation hazard worth ruling out. (GitHub)

Why the scalar can move even if model quality barely changed

For causal LMs, the reported loss is sensitive to token counting.

Hugging Face documented that for token-level tasks like causal LM, the correct normalization is over the total number of non-padding tokens, not a naive average of batch losses. There is also a Transformers issue showing that the same eval set can report different eval_loss values when only the eval batch size changes, precisely because the loss is computed over non-padding tokens in a causal LM setup. (Hugging Face)

This means two important things for your case:

  1. The loss is influenced by the length distribution of sequences and the amount of masked or padded content.
  2. A moderate change in eval loss does not always map cleanly to a moderate change in actual output quality. (Hugging Face)

That also means your 4-GPU run is not suspicious by itself. Trainer supports distributed training on multiple GPUs. Multi-GPU does not imply contamination. It just means batching, token counts, and metric aggregation are happening in a distributed setup, so the scalar is still a token-level aggregate, not some pure per-example semantic score. (Hugging Face)

What I think is most likely in your run

Based on the setup you described, my ranking is:

Most likely

You are seeing real generalization plus epoch-end measurement.

The adapter is learning token-level patterns that transfer to the eval set, and the curve looks like a drop “between epochs” because that is when evaluation is being run. (Hugging Face)

Next most likely

The size of the drop is being shaped by masking and token averaging.

If you are using prompt-completion or conversational SFT, the loss may be over only completions or assistant spans, and the final scalar depends on how many supervised non-padding tokens ended up in each eval batch. (Hugging Face)

Also plausible

Your eval set is “unrelated” conceptually, but still very similar in format distribution.

If both train and eval use the same chat template or tokenization path, the model can improve quickly on structural tokens and stylistic continuations. Hugging Face’s chat-template docs explicitly warn that these structural tokens matter enough that duplicating special tokens can hurt performance. (Hugging Face)

Lower probability, but worth checking

Packing or masking is making the metric misleading.

This becomes more plausible if you enabled packing, left eval_packing=None, used assistant_only_loss, or used an attention implementation that is not in the safe path for packed SFT. (Hugging Face)

What would make me suspicious instead

I would start worrying if any of these happen:

  • Changing only per_device_eval_batch_size noticeably changes eval_loss. That is a known causal-LM pathology signal. (GitHub)
  • You are using packed eval and the eval set is small or heterogeneous. That can make the metric harder to interpret. (Hugging Face)
  • You use assistant_only_loss=True, but your chat template does not properly support assistant masking. TRL states that assistant-only loss relies on templates that can return the assistant token mask. (Hugging Face)
  • You apply chat templating and then tokenize again with extra special tokens. Transformers warns that duplicated BOS or EOS can hurt performance and distort training. (Hugging Face)
  • Tokenization differs between runs or environments. Transformers says most AutoTokenizer loads resolve to the fast Rust backend, so pinning the tokenizer path matters if you want reproducibility. Perplexity docs also note that tokenization directly affects the metric. (Hugging Face)

What I would do next

These checks are high value and fast.

1. Get a true pre-training baseline

Set eval_on_start=True, or manually call trainer.evaluate() before training. That tells you whether the first epoch genuinely improved the held-out set or whether the curve is mostly a sampling artifact. (Hugging Face)

2. Run one control experiment with step-based eval

Use eval_strategy="steps" for one short run. If the staircase becomes a smooth downward trend, you have confirmed that the epoch-boundary visual effect was mostly due to evaluation cadence. (Hugging Face)

3. Fix eval batch size and do not compare across changes

For causal LM loss, batch size can move eval_loss on the same data. Keep per_device_eval_batch_size fixed while diagnosing. (GitHub)

4. Disable packed eval once

If you train with packing, try one run with eval_packing=False. TRL documents that eval packing inherits training packing unless you override it. (Hugging Face)

5. Inspect how many tokens actually count toward loss

In SFT this is critical. If most labels are -100, then the scalar is dominated by a narrow slice of tokens, often assistant or completion tokens only. That can make the metric look better than the full-sequence behavior really is. (Hugging Face)

6. Verify your template and special-token path

Use the same template and tokenization path for train and eval. If you format with apply_chat_template(tokenize=False), avoid adding special tokens again later. (Hugging Face)

7. Compare generations, not just loss

Take a fixed set of held-out eval prompts and compare outputs from:

  • base model,
  • end of epoch 1,
  • end of epoch 2.

If the answers become cleaner, better formatted, or more aligned with the desired assistant behavior while eval loss falls, that is strong evidence the drop is real enough to trust. This is an inference step from the training objective rather than a direct doc statement, but it is the right way to validate whether the lower token-level loss corresponds to useful behavior. (Hugging Face)

Bottom line

Your observation does not require contamination to explain.

The most likely explanation is:

  • the LoRA adapter is learning token-distribution patterns that transfer to the eval set,
  • eval is only being measured at epoch boundaries,
  • eval mode is inherently cleaner than train mode,
  • and SFTTrainer plus causal-LM loss can amplify the visual effect because of masking, packing, and token-level averaging. (Hugging Face)

So the answer for your case is: a drop in eval loss across epochs on unseen, seemingly unrelated data is entirely plausible, and in this stack it is often expected. It becomes suspicious only when the drop is large, unstable under eval-batch-size changes, or disconnected from actual held-out generations. (GitHub)

1 Like