Nomic Embed: Training a Reproducible Long Context Text Embedder
Paper • 2402.01613 • Published • 18
How to use nomic-ai/modernbert-embed-base-unsupervised with sentence-transformers:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("nomic-ai/modernbert-embed-base-unsupervised")
sentences = [
"That is a happy person",
"That is a happy dog",
"That is a very happy person",
"Today is a sunny day"
]
embeddings = model.encode(sentences)
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [4, 4]modernbert-embed-unsupervised is the unsupervised checkpoint trained with the contrastors library
for 1 epoch over the 235M weakly-supervised contrastive pairs curated in Nomic Embed.
We suggest using moderbert-embed for embedding tasks.
The modernbert-unsupervised model performs similarly to the nomic-embed-text-v1_unsup model
| Model | Average (56) | Classification (12) | Clustering (11) | Pair Classification (3) | Reranking (4) | Retrieval (15) | STS (10) | Overall |
|---|---|---|---|---|---|---|---|---|
| nomic-embed-text-v1_unsup | 59.9 | 71.2 | 42.5 | 83.7 | 55.0 | 48.0 | 80.8 | 30.7 |
| modernbert-embed-unsupervised | 60.03 | 72.11 | 44.34 | 82.78 | 55.0 | 47.05 | 80.33 | 31.2 |
Base model
answerdotai/ModernBERT-base