Bifrost Translation-Source Classifier

Predicts which language an English text was originally translated from. Given English text, the model detects cultural and stylistic traces of the original source language.

Intended Use

This classifier is part of the Bifrost pipeline. It identifies culturally relevant content for translation into target languages.

Training

  • Base model: jhu-clsp/mmBERT-base
  • Frozen base: True (only classification head trained)
  • Training samples per language: 10,000
  • Validation samples per language: 1,000
  • Max sequence length: 512
  • Learning rate: 0.001
  • Epochs: 20 (with early stopping, patience 3)

During training, a random 512-token window is sampled from each document, exposing the model to different parts of longer texts across epochs. Validation uses a deterministic window per document for comparable losses.

Performance (held-out test set)

  • Test loss: 1.4607
  • Test accuracy: 63.0%

Labels (180 classes)

  • aeb
  • afr
  • als
  • amh
  • anp
  • apc
  • arb
  • arg
  • ars
  • ary
  • arz
  • asm
  • ast
  • azb
  • azj
  • bak
  • bar
  • bel
  • ben
  • bew
  • bho
  • bod
  • bos
  • bul
  • cat
  • ceb
  • ces
  • che
  • chv
  • ckb
  • cmn
  • cnh
  • cos
  • crh
  • cym
  • dan
  • deu
  • div
  • dzo
  • ekk
  • ell
  • eng
  • epo
  • eus
  • fao
  • fas
  • fij
  • fil
  • fin
  • fra
  • fry
  • fur
  • gaz
  • gla
  • gle
  • glg
  • glk
  • grc
  • gsw
  • guj
  • hac
  • hat
  • hau
  • haw
  • hbo
  • heb
  • hif
  • hil
  • hin
  • hne
  • hrv
  • hsb
  • hun
  • hye
  • hyw
  • iba
  • ibo
  • ilo
  • ind
  • isl
  • ita
  • jav
  • jpn
  • kal
  • kan
  • kat
  • kaz
  • kha
  • khk
  • khm
  • kin
  • kir
  • kiu
  • kmr
  • kor
  • lao
  • lat
  • lim
  • lin
  • lit
  • ltz
  • lug
  • lus
  • lvs
  • mai
  • mal
  • mar
  • mhr
  • mkd
  • mlt
  • mri
  • mww
  • mya
  • nap
  • nde
  • nds
  • new
  • nld
  • nno
  • nob
  • npi
  • nrm
  • nya
  • oci
  • ory
  • oss
  • pan
  • pap
  • pbt
  • plt
  • pnb
  • pol
  • por
  • roh
  • ron
  • rue
  • run
  • rus
  • sah
  • san
  • scn
  • sdh
  • sin
  • slk
  • slv
  • sme
  • smo
  • sna
  • snd
  • som
  • sot
  • spa
  • srd
  • srp
  • sun
  • swe
  • swh
  • tam
  • tat
  • tel
  • tgk
  • tha
  • tir
  • tuk
  • tur
  • tyv
  • udm
  • uig
  • ukr
  • urd
  • uzn
  • uzs
  • vie
  • xho
  • ydd
  • yor
  • yue
  • zea
  • zsm
  • zul

Training Data

Built from HuggingFaceFW/finetranslations (translated texts) and HuggingFaceFW/fineweb (native English). 10,000 train + 1,000 val samples per language.

Downloads last month
44
Safetensors
Model size
0.3B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for NbAiLab/bifrost-translation-source-classifier

Finetuned
(88)
this model

Datasets used to train NbAiLab/bifrost-translation-source-classifier

Evaluation results