rasyosef/Amharic-Passage-Retrieval-Dataset-V2
Viewer • Updated • 68.3k • 91
How to use rasyosef/colbert-amharic-base with sentence-transformers:
from pylate import models
queries = [
"Which planet is known as the Red Planet?",
"What is the largest planet in our solar system?",
]
documents = [
["Mars is the Red Planet.", "Venus is Earth's twin."],
["Jupiter is the largest planet.", "Saturn has rings."],
]
model = models.ColBERT(model_name_or_path="rasyosef/colbert-amharic-base")
queries_emb = model.encode(queries, is_query=True)
docs_emb = model.encode(documents, is_query=False)This is a PyLate model finetuned from rasyosef/roberta-base-amharic. It was presented in the paper The Multilingual Curse at the Retrieval Layer: Evidence from Amharic.
The model maps sentences & paragraphs to sequences of 128-dimensional dense vectors and can be used for semantic textual retrieval using the MaxSim operator.
First install the PyLate library:
pip install -U pylate
Then you can load the model and run inference:
from pylate import models
# Download from the 🤗 Hub
model = models.ColBERT(
model_name_or_path="rasyosef/colbert-amharic-base",
)
# Run inference
sentences = [
'ለውጭ ገበያ በሚቀርበው የኢትዮጵያ ቡና ላይ የተጋረጠው ፈተና',
'የኢትዮጵያ ዋነኛ የውጭ ምንዛሬ ምንጭ የሆነው ወደ ውጭ የሚላክ ቡና ዘርፍ በአሁኑ ጊዜ ከፍተኛ ውጥረት ውስጥ ገብቷል።',
'የቻይናው ፕሬዝዳንት ዚ ጂንፒንግ ከትራምፕ ጋር ባደረጉት ጉባኤ ትኩረታቸው በሁለቱ ሀገራት መካከል ለወራት ከተፈጠረ ውጥረት እና የንግድ ጦርነት በኋላ የተረገጋጋ ግንኙነትን ማስቀጠል ነበር።',
]
embeddings = model.encode(
sentences,
is_query=True,
)
print(embeddings[0].shape)
# (32, 128)
# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities)
| Metric | Value |
|---|---|
| cosine_recall@5 | 0.902 |
| cosine_recall@10 | 0.930 |
| cosine_ndcg@10 | 0.835 |
| cosine_mrr@10 | 0.803 |
@inproceedings{alemneh2026amharicir,
title = {The Multilingual Curse at the Retrieval Layer: Evidence from Amharic},
author = {Alemneh, Yosef Worku and Mekonnen, Kidist Amde and de Rijke, Maarten},
booktitle = {Proceedings of the 1st Workshop on Multilinguality in the Era of Large Language Models (MeLLM), ACL 2026},
year = {2026},
}
Base model
rasyosef/roberta-base-amharic