Embedding Implementations¶

Arshai provides reference implementations for generating text embeddings using various providers. These implementations follow the IEmbedding interface, allowing you to swap providers easily.

Important

These are Reference Implementations

The embedding implementations are examples showing how to:

Implement the IEmbedding interface
Integrate with different embedding providers
Handle batch processing and error cases

You can use these as-is, extend them, or build your own implementations.

Overview¶

Available Implementations:

OpenAI Embeddings - High-quality general-purpose embeddings
VoyageAI Embeddings - Specialized embeddings for various domains
MGTE Embeddings - Multi-granularity text embeddings

All implementations provide:

Batch text embedding
Query embedding (single text)
Configurable dimensions (where supported)
Async support
Error handling

Installation¶

Install with embedding support:

# Core package (includes interfaces)
pip install arshai

# For OpenAI embeddings
pip install arshai openai

# For VoyageAI embeddings
pip install arshai voyageai

# For MGTE embeddings
pip install arshai sentence-transformers

OpenAI Embeddings¶

High-quality embeddings from OpenAI’s API.

Configuration¶

import os
from arshai.embeddings.openai_embeddings import OpenAIEmbedding
from arshai.core.interfaces.iembedding import EmbeddingConfig

# Set API key
os.environ["OPENAI_API_KEY"] = "your-api-key"

# Create configuration
config = EmbeddingConfig(
    model_name="text-embedding-3-small",  # or text-embedding-3-large, text-embedding-ada-002
    batch_size=100
)

# Create embedding instance
embedder = OpenAIEmbedding(config)

print(f"Embedding dimension: {embedder.dimension}")
# Output: 1536

Available Models¶

Model	Dimension	Best For
`text-embedding-3-small`	1536	Fast, cost-effective embeddings
`text-embedding-3-large`	3072	Highest quality embeddings
`text-embedding-ada-002`	1536	Legacy model (still supported)

Basic Usage¶

from arshai.embeddings.openai_embeddings import OpenAIEmbedding
from arshai.core.interfaces.iembedding import EmbeddingConfig

# Initialize
config = EmbeddingConfig(model_name="text-embedding-3-small")
embedder = OpenAIEmbedding(config)

# Embed documents
documents = [
    "Artificial intelligence is transforming technology",
    "Machine learning powers modern AI systems",
    "Deep learning uses neural networks"
]

result = embedder.embed_documents(documents)

print(f"Generated {len(result['embeddings'])} embeddings")
print(f"Embedding dimension: {len(result['embeddings'][0])}")
print(f"Tokens used: {result['total_tokens']}")

# Embed query
query_result = embedder.embed_query("What is AI?")
print(f"Query embedding dimension: {len(query_result['embedding'])}")

Async Usage¶

import asyncio

async def embed_async():
    config = EmbeddingConfig(model_name="text-embedding-3-small")
    embedder = OpenAIEmbedding(config)

    documents = ["Document 1", "Document 2", "Document 3"]

    # Async embedding
    result = await embedder.embed_documents_async(documents)
    print(f"Embedded {len(result['embeddings'])} documents asynchronously")

asyncio.run(embed_async())

VoyageAI Embeddings¶

Specialized embeddings for different domains and use cases.

Configuration¶

import os
from arshai.embeddings.voyageai_embedding import VoyageAIEmbedding
from arshai.core.interfaces.iembedding import EmbeddingConfig

# Set API key
os.environ["VOYAGE_API_KEY"] = "your-api-key"

# Create configuration
config = EmbeddingConfig(
    model_name="voyage-3-large",
    batch_size=100
)

# Create embedding instance
embedder = VoyageAIEmbedding(config)

Available Models¶

Flexible Dimension Models:

Model	Default Dimension	Allowed Dimensions
`voyage-3-large`	1024	256, 512, 1024, 2048
`voyage-3.5`	1024	256, 512, 1024, 2048
`voyage-3.5-lite`	1024	256, 512, 1024, 2048
`voyage-code-3`	1024	256, 512, 1024, 2048

Domain-Specific Models:

Model	Dimension	Specialization
`voyage-finance-2`	1024	Financial documents
`voyage-law-2`	1024	Legal documents
`voyage-code-2`	1536	Code and programming
`voyage-multilingual-2`	1024	Multilingual text

Basic Usage¶

from arshai.embeddings.voyageai_embedding import VoyageAIEmbedding
from arshai.core.interfaces.iembedding import EmbeddingConfig

# General purpose
config = EmbeddingConfig(model_name="voyage-3-large")
embedder = VoyageAIEmbedding(config)

# Embed documents
documents = ["AI is revolutionary", "ML powers innovation"]
result = embedder.embed_documents(documents)

# Domain-specific (legal)
legal_config = EmbeddingConfig(model_name="voyage-law-2")
legal_embedder = VoyageAIEmbedding(legal_config)

legal_docs = [
    "The defendant pleaded guilty to charges",
    "Court ruled in favor of the plaintiff"
]
legal_result = legal_embedder.embed_documents(legal_docs)

Custom Dimensions¶

# Use custom dimension (for supported models)
config = EmbeddingConfig(
    model_name="voyage-3-large",
    dimension=512  # Choose from [256, 512, 1024, 2048]
)
embedder = VoyageAIEmbedding(config)

result = embedder.embed_documents(["Sample text"])
print(f"Embedding dimension: {len(result['embeddings'][0])}")
# Output: 512

MGTE Embeddings¶

Multi-granularity text embeddings using sentence transformers.

Configuration¶

from arshai.embeddings.mgte_embeddings import MGTEEmbedding
from arshai.core.interfaces.iembedding import EmbeddingConfig

# Create configuration
config = EmbeddingConfig(
    model_name="Alibaba-NLP/gte-Qwen2-1.5B-instruct",
    batch_size=32
)

# Create embedding instance (downloads model on first use)
embedder = MGTEEmbedding(config)

Available Models¶

# Default model
config = EmbeddingConfig(model_name="Alibaba-NLP/gte-Qwen2-1.5B-instruct")

# Other Sentence Transformer models
config = EmbeddingConfig(model_name="sentence-transformers/all-MiniLM-L6-v2")

Basic Usage¶

from arshai.embeddings.mgte_embeddings import MGTEEmbedding
from arshai.core.interfaces.iembedding import EmbeddingConfig

# Initialize (model cached after first download)
config = EmbeddingConfig(
    model_name="Alibaba-NLP/gte-Qwen2-1.5B-instruct"
)
embedder = MGTEEmbedding(config)

# Embed documents
documents = [
    "Natural language processing enables machines to understand text",
    "Embeddings convert text into numerical vectors"
]

result = embedder.embed_documents(documents)
print(f"Dimension: {embedder.dimension}")

Advanced Usage¶

Batch Processing¶

def batch_embed_large_dataset(embedder, documents: list, batch_size: int = 100):
    """Embed large dataset in batches."""

    all_embeddings = []

    for i in range(0, len(documents), batch_size):
        batch = documents[i:i + batch_size]
        result = embedder.embed_documents(batch)
        all_embeddings.extend(result['embeddings'])

        print(f"Processed {len(all_embeddings)}/{len(documents)} documents")

    return all_embeddings

# Usage
large_dataset = ["Document " + str(i) for i in range(1000)]
embeddings = batch_embed_large_dataset(embedder, large_dataset)

Similarity Search¶

import numpy as np

def cosine_similarity(vec1, vec2):
    """Calculate cosine similarity between two vectors."""
    return np.dot(vec1, vec2) / (np.linalg.norm(vec1) * np.linalg.norm(vec2))

def find_similar_documents(query: str, documents: list, embedder, top_k: int = 3):
    """Find most similar documents to query."""

    # Embed query
    query_result = embedder.embed_query(query)
    query_embedding = query_result['embedding']

    # Embed documents
    doc_result = embedder.embed_documents(documents)
    doc_embeddings = doc_result['embeddings']

    # Calculate similarities
    similarities = [
        cosine_similarity(query_embedding, doc_emb)
        for doc_emb in doc_embeddings
    ]

    # Get top-k
    top_indices = np.argsort(similarities)[-top_k:][::-1]

    results = [
        {"document": documents[i], "similarity": similarities[i]}
        for i in top_indices
    ]

    return results

# Usage
query = "What is artificial intelligence?"
documents = [
    "AI is the simulation of human intelligence",
    "Machine learning is a subset of AI",
    "Python is a programming language"
]

similar_docs = find_similar_documents(query, documents, embedder)
for doc in similar_docs:
    print(f"Similarity: {doc['similarity']:.3f} - {doc['document']}")

Caching Embeddings¶

import pickle
from pathlib import Path

class CachedEmbedder:
    """Wrapper that caches embeddings to disk."""

    def __init__(self, embedder, cache_dir: str = ".embedding_cache"):
        self.embedder = embedder
        self.cache_dir = Path(cache_dir)
        self.cache_dir.mkdir(exist_ok=True)

    def _get_cache_key(self, text: str) -> str:
        """Generate cache key from text."""
        import hashlib
        return hashlib.md5(text.encode()).hexdigest()

    def embed_with_cache(self, text: str):
        """Embed with disk caching."""
        cache_key = self._get_cache_key(text)
        cache_file = self.cache_dir / f"{cache_key}.pkl"

        # Check cache
        if cache_file.exists():
            with open(cache_file, 'rb') as f:
                return pickle.load(f)

        # Generate embedding
        result = self.embedder.embed_query(text)
        embedding = result['embedding']

        # Save to cache
        with open(cache_file, 'wb') as f:
            pickle.dump(embedding, f)

        return embedding

# Usage
cached_embedder = CachedEmbedder(embedder)
embedding1 = cached_embedder.embed_with_cache("Sample text")  # Generates
embedding2 = cached_embedder.embed_with_cache("Sample text")  # From cache

Error Handling¶

from openai import OpenAIError

def safe_embed(embedder, documents: list, max_retries: int = 3):
    """Embed with retry logic."""

    for attempt in range(max_retries):
        try:
            result = embedder.embed_documents(documents)
            return result

        except OpenAIError as e:
            print(f"Attempt {attempt + 1} failed: {e}")
            if attempt == max_retries - 1:
                raise

            import time
            time.sleep(2 ** attempt)  # Exponential backoff

# Usage
try:
    result = safe_embed(embedder, documents)
except OpenAIError as e:
    print(f"Failed after retries: {e}")

Choosing an Embedding Provider¶

Use OpenAI when:

You need high-quality general-purpose embeddings
You’re already using OpenAI for LLMs
You want reliable, well-tested embeddings
Cost is not the primary concern

Use VoyageAI when:

You have domain-specific content (finance, legal, code)
You need flexible embedding dimensions
You want specialized models for your use case
You need multilingual support

Use MGTE when:

You want to run embeddings locally
You need offline operation
You want to avoid API costs
You have GPU resources available
Privacy is a concern

Performance Comparison¶

Provider	Speed	Quality	Cost
OpenAI	Fast	High	API calls
VoyageAI	Fast	Specialized	API calls
MGTE	Medium	Good	Free (local)

Building Custom Embeddings¶

Implement IEmbedding interface:

from arshai.core.interfaces.iembedding import IEmbedding, EmbeddingConfig
from typing import List, Dict, Any

class CustomEmbedding(IEmbedding):
    """Custom embedding implementation."""

    def __init__(self, config: EmbeddingConfig):
        self.model_name = config.model_name
        self.batch_size = config.batch_size
        self._dimension = 768  # Your model's dimension

    @property
    def dimension(self) -> int:
        return self._dimension

    def embed_documents(self, texts: List[str]) -> Dict[str, Any]:
        """Embed multiple documents."""
        # Your implementation here
        embeddings = [self._embed_single(text) for text in texts]

        return {
            "embeddings": embeddings,
            "total_tokens": len(texts) * 100  # Approximate
        }

    def embed_query(self, text: str) -> Dict[str, Any]:
        """Embed single query."""
        embedding = self._embed_single(text)

        return {
            "embedding": embedding,
            "total_tokens": 100  # Approximate
        }

    def _embed_single(self, text: str) -> List[float]:
        """Your embedding logic."""
        # Implement your embedding generation
        pass

    async def embed_documents_async(self, texts: List[str]) -> Dict[str, Any]:
        """Async version."""
        return self.embed_documents(texts)

    async def embed_query_async(self, text: str) -> Dict[str, Any]:
        """Async version."""
        return self.embed_query(text)

Best Practices¶

Consistent Models Use the same embedding model for documents and queries.
Batch Processing Process multiple documents at once for better performance.
Cache Results Cache embeddings for frequently accessed documents.
Error Handling Implement retry logic for API-based embeddings.
Monitor Costs Track API usage for cost management.
Choose Appropriate Dimensions Higher dimensions = better quality but more storage/compute.
Test Different Providers Benchmark providers for your specific use case.

Integration with Vector Databases¶

See Vector Database - Milvus Client for using embeddings with vector stores.

# Quick example
from arshai.embeddings.openai_embeddings import OpenAIEmbedding
from arshai.vector_db.milvus_client import MilvusClient

# Create embedder
embedder = OpenAIEmbedding(EmbeddingConfig(model_name="text-embedding-3-small"))

# Generate embeddings
documents = ["Doc 1", "Doc 2", "Doc 3"]
result = embedder.embed_documents(documents)

# Store in vector database
# (See vector-databases documentation for details)

Next Steps¶

Vector Storage: See Vector Database - Milvus Client for storing and searching embeddings
RAG Systems: Build retrieval-augmented generation systems
Semantic Search: Implement semantic search for your application

Remember: These are reference implementations. The framework provides the IEmbedding interface - you can implement it for any embedding provider or custom model.