ColBERT-Amharic-Base

This is a PyLate model finetuned from rasyosef/roberta-base-amharic. It was presented in the paper The Multilingual Curse at the Retrieval Layer: Evidence from Amharic.

The model maps sentences & paragraphs to sequences of 128-dimensional dense vectors and can be used for semantic textual retrieval using the MaxSim operator.

Model Details

Usage

Late-Interaction (PyLate)

First install the PyLate library:

pip install -U pylate

Then you can load the model and run inference:

from pylate import models

# Download from the 🤗 Hub
model = models.ColBERT(
    model_name_or_path="rasyosef/colbert-amharic-base",
)

# Run inference
sentences = [
    'ለውጭ ገበያ በሚቀርበው የኢትዮጵያ ቡና ላይ የተጋረጠው ፈተና',
    'የኢትዮጵያ ዋነኛ የውጭ ምንዛሬ ምንጭ የሆነው ወደ ውጭ የሚላክ ቡና ዘርፍ በአሁኑ ጊዜ ከፍተኛ ውጥረት ውስጥ ገብቷል።',
    'የቻይናው ፕሬዝዳንት ዚ ጂንፒንግ ከትራምፕ ጋር ባደረጉት ጉባኤ ትኩረታቸው በሁለቱ ሀገራት መካከል ለወራት ከተፈጠረ ውጥረት እና የንግድ ጦርነት በኋላ የተረገጋጋ ግንኙነትን ማስቀጠል ነበር።',
]

embeddings = model.encode(
    sentences,
    is_query=True,
)

print(embeddings[0].shape)
# (32, 128)

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities)

Evaluation

Metric Value
cosine_recall@5 0.902
cosine_recall@10 0.930
cosine_ndcg@10 0.835
cosine_mrr@10 0.803

Citation

@inproceedings{alemneh2026amharicir,
  title     = {The Multilingual Curse at the Retrieval Layer: Evidence from Amharic},
  author    = {Alemneh, Yosef Worku and Mekonnen, Kidist Amde and de Rijke, Maarten},
  booktitle = {Proceedings of the 1st Workshop on Multilinguality in the Era of Large Language Models (MeLLM), ACL 2026},
  year      = {2026},
}
Downloads last month
138
Safetensors
Model size
0.1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for rasyosef/colbert-amharic-base

Finetuned
(12)
this model

Dataset used to train rasyosef/colbert-amharic-base

Collection including rasyosef/colbert-amharic-base

Paper for rasyosef/colbert-amharic-base

Evaluation results

  • Cosine Recall@5 on Amharic Passage Retrieval Dataset V2
    self-reported
    0.902
  • Cosine Recall@10 on Amharic Passage Retrieval Dataset V2
    self-reported
    0.930
  • Cosine Ndcg@10 on Amharic Passage Retrieval Dataset V2
    self-reported
    0.835
  • Cosine Mrr@10 on Amharic Passage Retrieval Dataset V2
    self-reported
    0.803