Apertus Greek Tokenizer Extension

This repo has four front-stage artifacts.

Path	Meaning
`greek-extension-tokenizer/`	The selected modern Greek extension tokenizer, not the original Apertus tokenizer.
`cpt-training-dataset/`	The CPT data recipe, source graph, and hydration paths.
`experiment-checkpoints/`	HF-format checkpoints for the experiment arms.
`benchmark-evals/`	Benchmark summaries and plots.

Everything else is under supporting-material/.

Greek Extension Tokenizer

greek-extension-tokenizer/ contains ModernGreek-148k, the selected tokenizer for these experiments:

base Apertus vocab: 131072;
added modern Greek C3 tokens: 17408;
total vocab: 148480;
tokenizer.json SHA-256: 358ae3f29ac17c99769d6d437339e28657d5fcaed3486f8550feed3d6adfc394.

The original Apertus tokenizer is only used by the Vanilla-* checkpoints as a control. The optional polytonic tokenizer lives under supporting-material/optional-tokenizers/.

CPT Training Dataset

cpt-training-dataset/ describes CPT-7B-mix, built from:

fffoivos/glossapi-greek-nanochat-pretraining-dataset;
nanochat internal dedup metadata;
Apertus-overlap drop overlay from fffoivos/apertus-c3-dedup-audit-dedup-20260519t010924z;
non-Greek replay, code, and math.

Bulk recipe: 70% Greek, 24% non-Greek replay, 4% code, 2% math.

Measured source-token count with the selected tokenizer:

Source slice	Tokenizer	Rows	Tokens, no EOD	Tokens, +1 EOD/doc
`HPLT/ell_Grek_ge8_no_mt_clean60`	`ModernGreek-148k`	48,728,774	44,195,950,025	44,244,678,799

See cpt-training-dataset/token-counts.json for the exact run metadata.

Experiment Checkpoints

experiment-checkpoints/ contains one folder per checkpoint we care about. Each has a manifest.json + README.md with the source Clariden path. 5B is the bakeoff final endpoint; 2B and 3.5B are iso-token snapshots.

Checkpoint	Meaning
`TokenDistil-Init/`	Token Distillation initialization before CPT.
`TokenDistil-2B/`	Token Distillation after the 2B bakeoff.
`TokenDistil-3.5B/`	Token Distillation after the 3.5B continuation.
`TokenDistil-5B/`	Token Distillation at bakeoff-final 5B endpoint.
`Vanilla-2B/`	Original-tokenizer control after the 2B bakeoff.
`Vanilla-3.5B/`	Original-tokenizer control after the 3.5B continuation.
`Vanilla-5B/`	Original-tokenizer control at bakeoff-final 5B endpoint.
`ReTok-2B/`	ReTok baseline after the 2B bakeoff.
`ReTok-3.5B/`	ReTok baseline after the 3.5B continuation (stopped here — dominated by TD).
`Centroid-2B/`	Centroid baseline after the 2B bakeoff (stopped here — broken arm).

Large model weights are uploaded to Hugging Face in these folders. They are not mirrored in the GitHub source repository.

Benchmark Evals

The current result anchor is:

benchmark-evals/bakeoff-final/    # 5.0B endpoint, canonical headline
benchmark-evals/3.5B-comparison/  # 3.5B iso-token (Vanilla/ReTok/TD)
benchmark-evals/native-greek-suite/ # vetted native-Greek decision suite

Loss-reading rule: raw Megatron lm loss is per-token cross entropy and is not comparable between the original 131,072-token Vanilla tokenizer and the 148,480-token extended tokenizer arms. Cross-arm loss conclusions use heldout BPB from the tokenizer-fair eval jobs plus downstream benchmark scores. Older files may call BPB BPC; that is a legacy bits-per-byte label, not bits per character. Raw training loss is only a health and within-arm trace unless dense bpb training logs are present.

Greek aggregate rule: the Greek-specific headline now uses benchmark-evals/native-greek-suite/, which includes vetted native MCQ tasks and excludes explicit MT diagnostics. The older bakeoff-final/ Greek aggregate is a fallback lm-eval slice and should not be treated as the native-Greek selection headline.

At 5.0B (bakeoff-final/):

Arm	Greek no-MT aggregate	English retention	Multilingual	Heldout BPB, lower better
Vanilla-5B	0.4076	0.6799	0.4936	0.4602
TokenDistil-5B	0.4204	0.6903	0.4976	0.4872
ReTok-3.5B (stopped)	0.3984	0.6786	0.4864	0.5390
Centroid-2B (stopped)	0.2566	0.6836	0.4888	0.8994

Reading for the older fallback suite: TokenDistil-5B leads all three downstream aggregates over Vanilla-5B. Vanilla-5B retains tokenizer-fair heldout BPB leadership; gap narrowing (0.110 → 0.027 over the bakeoff).

Native-Greek suite reading:

Arm	Native MCQ headline	MCQ + Plutus	greek-nlp supporting mean
Apertus-Base	0.4817	0.4902	0.2150
Vanilla-5B	0.4305	0.4329	0.1679
TokenDistil-5B	0.4109	0.4160	0.1733

For Greek-specific selection, Vanilla is ahead of TokenDistil on the native MCQ headline, while Apertus-Base remains above all continued checkpoints. Full native-suite tables are in benchmark-evals/native-greek-suite/.

Caveat — the bakeoff was not rule-bound. The pre-commit decision-rule thresholds from old_experiments_plan.md v0.12 §10 Q8 (X / M_progress / M_ext / M_van / T) were never locked before results came in. The 5B headline above is an honest description of the numbers, not an adjudicated winner. See:

supporting-material/provenance/decisions/PLAN_VS_RESULTS_RECONCILIATION_20260526.md
supporting-material/provenance/decisions/CPT_MASTER_20260526.md

Supporting Material

Path	Meaning
`supporting-material/provenance/decisions/`	Plans + plan-vs-results reconciliation + master synthesis
`supporting-material/provenance/evals/`	Eval recipe, loss-measurement policy, per-stage result docs
`supporting-material/provenance/token-distillation/`	TD plan + run log
`supporting-material/provenance/tokenizer-selection/`	Cutoff sweep + chosen-cutoff report + firing-count audit
`supporting-material/provenance/dataset-build/`	Mix recipe + dataset build runbook + manifest
`supporting-material/provenance/conversion-roundtrip/`	R17 HF↔Megatron verification JSONs
`supporting-material/optional-tokenizers/`	154k modern + polytonic tokenizer (parked)
`supporting-material/source-code/`	Pointer back to the GitHub source repo
`supporting-material/archive/`	Legacy layout artifacts + checksums

Source

Runnable scripts live in GitHub:

https://github.com/fffoivos/glossapi-tokenizer-extension/tree/main/subprojects/03_apertus_extension_and_embedding_adaptation

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for fffoivos/apertus-tokenizer-extension

Base model

swiss-ai/Apertus-8B-2509

Finetuned

(17)

this model