Apertus Greek Tokenizer Extension

This repo has four front-stage artifacts.

Path Meaning
greek-extension-tokenizer/ The selected modern Greek extension tokenizer, not the original Apertus tokenizer.
cpt-training-dataset/ The CPT data recipe, source graph, and hydration paths.
experiment-checkpoints/ HF-format checkpoints for the experiment arms.
benchmark-evals/ Benchmark summaries and plots.

Everything else is under supporting-material/.

Greek Extension Tokenizer

greek-extension-tokenizer/ contains ModernGreek-148k, the selected tokenizer for these experiments:

  • base Apertus vocab: 131072;
  • added modern Greek C3 tokens: 17408;
  • total vocab: 148480;
  • tokenizer.json SHA-256: 358ae3f29ac17c99769d6d437339e28657d5fcaed3486f8550feed3d6adfc394.

The original Apertus tokenizer is only used by the Vanilla-* checkpoints as a control. The optional polytonic tokenizer lives under supporting-material/optional-tokenizers/.

CPT Training Dataset

cpt-training-dataset/ describes CPT-7B-mix, built from:

  • fffoivos/glossapi-greek-nanochat-pretraining-dataset;
  • nanochat internal dedup metadata;
  • Apertus-overlap drop overlay from fffoivos/apertus-c3-dedup-audit-dedup-20260519t010924z;
  • non-Greek replay, code, and math.

Bulk recipe: 70% Greek, 24% non-Greek replay, 4% code, 2% math.

Measured source-token count with the selected tokenizer:

Source slice Tokenizer Rows Tokens, no EOD Tokens, +1 EOD/doc
HPLT/ell_Grek_ge8_no_mt_clean60 ModernGreek-148k 48,728,774 44,195,950,025 44,244,678,799

See cpt-training-dataset/token-counts.json for the exact run metadata.

Experiment Checkpoints

experiment-checkpoints/ contains one folder per checkpoint we care about. Each has a manifest.json + README.md with the source Clariden path. 5B is the bakeoff final endpoint; 2B and 3.5B are iso-token snapshots.

Checkpoint Meaning
TokenDistil-Init/ Token Distillation initialization before CPT.
TokenDistil-2B/ Token Distillation after the 2B bakeoff.
TokenDistil-3.5B/ Token Distillation after the 3.5B continuation.
TokenDistil-5B/ Token Distillation at bakeoff-final 5B endpoint.
Vanilla-2B/ Original-tokenizer control after the 2B bakeoff.
Vanilla-3.5B/ Original-tokenizer control after the 3.5B continuation.
Vanilla-5B/ Original-tokenizer control at bakeoff-final 5B endpoint.
ReTok-2B/ ReTok baseline after the 2B bakeoff.
ReTok-3.5B/ ReTok baseline after the 3.5B continuation (stopped here — dominated by TD).
Centroid-2B/ Centroid baseline after the 2B bakeoff (stopped here — broken arm).

Large model weights are uploaded to Hugging Face in these folders. They are not mirrored in the GitHub source repository.

Benchmark Evals

The current result anchor is:

benchmark-evals/bakeoff-final/    # 5.0B endpoint, canonical headline
benchmark-evals/3.5B-comparison/  # 3.5B iso-token (Vanilla/ReTok/TD)
benchmark-evals/native-greek-suite/ # vetted native-Greek decision suite

Loss-reading rule: raw Megatron lm loss is per-token cross entropy and is not comparable between the original 131,072-token Vanilla tokenizer and the 148,480-token extended tokenizer arms. Cross-arm loss conclusions use heldout BPB from the tokenizer-fair eval jobs plus downstream benchmark scores. Older files may call BPB BPC; that is a legacy bits-per-byte label, not bits per character. Raw training loss is only a health and within-arm trace unless dense bpb training logs are present.

Greek aggregate rule: the Greek-specific headline now uses benchmark-evals/native-greek-suite/, which includes vetted native MCQ tasks and excludes explicit MT diagnostics. The older bakeoff-final/ Greek aggregate is a fallback lm-eval slice and should not be treated as the native-Greek selection headline.

At 5.0B (bakeoff-final/):

Arm Greek no-MT aggregate English retention Multilingual Heldout BPB, lower better
Vanilla-5B 0.4076 0.6799 0.4936 0.4602
TokenDistil-5B 0.4204 0.6903 0.4976 0.4872
ReTok-3.5B (stopped) 0.3984 0.6786 0.4864 0.5390
Centroid-2B (stopped) 0.2566 0.6836 0.4888 0.8994

Reading for the older fallback suite: TokenDistil-5B leads all three downstream aggregates over Vanilla-5B. Vanilla-5B retains tokenizer-fair heldout BPB leadership; gap narrowing (0.110 → 0.027 over the bakeoff).

Native-Greek suite reading:

Arm Native MCQ headline MCQ + Plutus greek-nlp supporting mean
Apertus-Base 0.4817 0.4902 0.2150
Vanilla-5B 0.4305 0.4329 0.1679
TokenDistil-5B 0.4109 0.4160 0.1733

For Greek-specific selection, Vanilla is ahead of TokenDistil on the native MCQ headline, while Apertus-Base remains above all continued checkpoints. Full native-suite tables are in benchmark-evals/native-greek-suite/.

Caveat — the bakeoff was not rule-bound. The pre-commit decision-rule thresholds from old_experiments_plan.md v0.12 §10 Q8 (X / M_progress / M_ext / M_van / T) were never locked before results came in. The 5B headline above is an honest description of the numbers, not an adjudicated winner. See:

supporting-material/provenance/decisions/PLAN_VS_RESULTS_RECONCILIATION_20260526.md
supporting-material/provenance/decisions/CPT_MASTER_20260526.md

Supporting Material

Path Meaning
supporting-material/provenance/decisions/ Plans + plan-vs-results reconciliation + master synthesis
supporting-material/provenance/evals/ Eval recipe, loss-measurement policy, per-stage result docs
supporting-material/provenance/token-distillation/ TD plan + run log
supporting-material/provenance/tokenizer-selection/ Cutoff sweep + chosen-cutoff report + firing-count audit
supporting-material/provenance/dataset-build/ Mix recipe + dataset build runbook + manifest
supporting-material/provenance/conversion-roundtrip/ R17 HF↔Megatron verification JSONs
supporting-material/optional-tokenizers/ 154k modern + polytonic tokenizer (parked)
supporting-material/source-code/ Pointer back to the GitHub source repo
supporting-material/archive/ Legacy layout artifacts + checksums

Source

Runnable scripts live in GitHub:

https://github.com/fffoivos/glossapi-tokenizer-extension/tree/main/subprojects/03_apertus_extension_and_embedding_adaptation
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for fffoivos/apertus-tokenizer-extension

Finetuned
(17)
this model