Bifrost Translation-Source Classifier
Predicts which language an English text was originally translated from. Given English text, the model detects cultural and stylistic traces of the original source language.
Intended Use
This classifier is part of the Bifrost pipeline. It identifies culturally relevant content for translation into target languages.
Training
- Base model:
jhu-clsp/mmBERT-base - Frozen base: True (only classification head trained)
- Training samples per language: 10,000
- Validation samples per language: 1,000
- Max sequence length: 512
- Learning rate: 0.001
- Epochs: 20 (with early stopping, patience 3)
During training, a random 512-token window is sampled from each document, exposing the model to different parts of longer texts across epochs. Validation uses a deterministic window per document for comparable losses.
Performance (held-out test set)
- Test loss: 1.4607
- Test accuracy: 63.0%
Labels (180 classes)
aebafralsamhanpapcarbargarsaryarzasmastazbazjbakbarbelbenbewbhobodbosbulcatcebceschechvckbcmncnhcoscrhcymdandeudivdzoekkellengepoeusfaofasfijfilfinfrafryfurgazglagleglgglkgrcgswgujhachathauhawhbohebhifhilhinhnehrvhsbhunhyehywibaiboiloindislitajavjpnkalkankatkazkhakhkkhmkinkirkiukmrkorlaolatlimlinlitltzlugluslvsmaimalmarmhrmkdmltmrimwwmyanapndendsnewnldnnonobnpinrmnyaocioryosspanpappbtpltpnbpolporrohronruerunrussahsanscnsdhsinslkslvsmesmosnasndsomsotspasrdsrpsunsweswhtamtatteltgkthatirtukturtyvudmuigukrurduznuzsviexhoyddyoryuezeazsmzul
Training Data
Built from HuggingFaceFW/finetranslations (translated texts) and HuggingFaceFW/fineweb (native English). 10,000 train + 1,000 val samples per language.
- Downloads last month
- 44
Model tree for NbAiLab/bifrost-translation-source-classifier
Base model
jhu-clsp/mmBERT-baseDatasets used to train NbAiLab/bifrost-translation-source-classifier
Evaluation results
- Test Accuracyself-reported63.0%
- Test Lossself-reported1.461