Embedding Implementations

Arshai provides reference implementations for generating text embeddings using various providers. These implementations follow the IEmbedding interface, allowing you to swap providers easily.

Important

These are Reference Implementations

The embedding implementations are examples showing how to:

  • Implement the IEmbedding interface

  • Integrate with different embedding providers

  • Handle batch processing and error cases

You can use these as-is, extend them, or build your own implementations.

Overview

Available Implementations:

  1. OpenAI Embeddings - High-quality general-purpose embeddings

  2. VoyageAI Embeddings - Specialized embeddings for various domains

  3. MGTE Embeddings - Multi-granularity text embeddings

All implementations provide:

  • Batch text embedding

  • Query embedding (single text)

  • Configurable dimensions (where supported)

  • Async support

  • Error handling

Installation

Install with embedding support:

# Core package (includes interfaces)
pip install arshai

# For OpenAI embeddings
pip install arshai openai

# For VoyageAI embeddings
pip install arshai voyageai

# For MGTE embeddings
pip install arshai sentence-transformers

OpenAI Embeddings

High-quality embeddings from OpenAI’s API.

Configuration

import os
from arshai.embeddings.openai_embeddings import OpenAIEmbedding
from arshai.core.interfaces.iembedding import EmbeddingConfig

# Set API key
os.environ["OPENAI_API_KEY"] = "your-api-key"

# Create configuration
config = EmbeddingConfig(
    model_name="text-embedding-3-small",  # or text-embedding-3-large, text-embedding-ada-002
    batch_size=100
)

# Create embedding instance
embedder = OpenAIEmbedding(config)

print(f"Embedding dimension: {embedder.dimension}")
# Output: 1536

Available Models

Model

Dimension

Best For

text-embedding-3-small

1536

Fast, cost-effective embeddings

text-embedding-3-large

3072

Highest quality embeddings

text-embedding-ada-002

1536

Legacy model (still supported)

Basic Usage

from arshai.embeddings.openai_embeddings import OpenAIEmbedding
from arshai.core.interfaces.iembedding import EmbeddingConfig

# Initialize
config = EmbeddingConfig(model_name="text-embedding-3-small")
embedder = OpenAIEmbedding(config)

# Embed documents
documents = [
    "Artificial intelligence is transforming technology",
    "Machine learning powers modern AI systems",
    "Deep learning uses neural networks"
]

result = embedder.embed_documents(documents)

print(f"Generated {len(result['embeddings'])} embeddings")
print(f"Embedding dimension: {len(result['embeddings'][0])}")
print(f"Tokens used: {result['total_tokens']}")

# Embed query
query_result = embedder.embed_query("What is AI?")
print(f"Query embedding dimension: {len(query_result['embedding'])}")

Async Usage

import asyncio

async def embed_async():
    config = EmbeddingConfig(model_name="text-embedding-3-small")
    embedder = OpenAIEmbedding(config)

    documents = ["Document 1", "Document 2", "Document 3"]

    # Async embedding
    result = await embedder.embed_documents_async(documents)
    print(f"Embedded {len(result['embeddings'])} documents asynchronously")

asyncio.run(embed_async())

VoyageAI Embeddings

Specialized embeddings for different domains and use cases.

Configuration

import os
from arshai.embeddings.voyageai_embedding import VoyageAIEmbedding
from arshai.core.interfaces.iembedding import EmbeddingConfig

# Set API key
os.environ["VOYAGE_API_KEY"] = "your-api-key"

# Create configuration
config = EmbeddingConfig(
    model_name="voyage-3-large",
    batch_size=100
)

# Create embedding instance
embedder = VoyageAIEmbedding(config)

Available Models

Flexible Dimension Models:

Model

Default Dimension

Allowed Dimensions

voyage-3-large

1024

256, 512, 1024, 2048

voyage-3.5

1024

256, 512, 1024, 2048

voyage-3.5-lite

1024

256, 512, 1024, 2048

voyage-code-3

1024

256, 512, 1024, 2048

Domain-Specific Models:

Model

Dimension

Specialization

voyage-finance-2

1024

Financial documents

voyage-law-2

1024

Legal documents

voyage-code-2

1536

Code and programming

voyage-multilingual-2

1024

Multilingual text

Basic Usage

from arshai.embeddings.voyageai_embedding import VoyageAIEmbedding
from arshai.core.interfaces.iembedding import EmbeddingConfig

# General purpose
config = EmbeddingConfig(model_name="voyage-3-large")
embedder = VoyageAIEmbedding(config)

# Embed documents
documents = ["AI is revolutionary", "ML powers innovation"]
result = embedder.embed_documents(documents)

# Domain-specific (legal)
legal_config = EmbeddingConfig(model_name="voyage-law-2")
legal_embedder = VoyageAIEmbedding(legal_config)

legal_docs = [
    "The defendant pleaded guilty to charges",
    "Court ruled in favor of the plaintiff"
]
legal_result = legal_embedder.embed_documents(legal_docs)

Custom Dimensions

# Use custom dimension (for supported models)
config = EmbeddingConfig(
    model_name="voyage-3-large",
    dimension=512  # Choose from [256, 512, 1024, 2048]
)
embedder = VoyageAIEmbedding(config)

result = embedder.embed_documents(["Sample text"])
print(f"Embedding dimension: {len(result['embeddings'][0])}")
# Output: 512

MGTE Embeddings

Multi-granularity text embeddings using sentence transformers.

Configuration

from arshai.embeddings.mgte_embeddings import MGTEEmbedding
from arshai.core.interfaces.iembedding import EmbeddingConfig

# Create configuration
config = EmbeddingConfig(
    model_name="Alibaba-NLP/gte-Qwen2-1.5B-instruct",
    batch_size=32
)

# Create embedding instance (downloads model on first use)
embedder = MGTEEmbedding(config)

Available Models

# Default model
config = EmbeddingConfig(model_name="Alibaba-NLP/gte-Qwen2-1.5B-instruct")

# Other Sentence Transformer models
config = EmbeddingConfig(model_name="sentence-transformers/all-MiniLM-L6-v2")

Basic Usage

from arshai.embeddings.mgte_embeddings import MGTEEmbedding
from arshai.core.interfaces.iembedding import EmbeddingConfig

# Initialize (model cached after first download)
config = EmbeddingConfig(
    model_name="Alibaba-NLP/gte-Qwen2-1.5B-instruct"
)
embedder = MGTEEmbedding(config)

# Embed documents
documents = [
    "Natural language processing enables machines to understand text",
    "Embeddings convert text into numerical vectors"
]

result = embedder.embed_documents(documents)
print(f"Dimension: {embedder.dimension}")

Advanced Usage

Batch Processing

def batch_embed_large_dataset(embedder, documents: list, batch_size: int = 100):
    """Embed large dataset in batches."""

    all_embeddings = []

    for i in range(0, len(documents), batch_size):
        batch = documents[i:i + batch_size]
        result = embedder.embed_documents(batch)
        all_embeddings.extend(result['embeddings'])

        print(f"Processed {len(all_embeddings)}/{len(documents)} documents")

    return all_embeddings

# Usage
large_dataset = ["Document " + str(i) for i in range(1000)]
embeddings = batch_embed_large_dataset(embedder, large_dataset)

Caching Embeddings

import pickle
from pathlib import Path

class CachedEmbedder:
    """Wrapper that caches embeddings to disk."""

    def __init__(self, embedder, cache_dir: str = ".embedding_cache"):
        self.embedder = embedder
        self.cache_dir = Path(cache_dir)
        self.cache_dir.mkdir(exist_ok=True)

    def _get_cache_key(self, text: str) -> str:
        """Generate cache key from text."""
        import hashlib
        return hashlib.md5(text.encode()).hexdigest()

    def embed_with_cache(self, text: str):
        """Embed with disk caching."""
        cache_key = self._get_cache_key(text)
        cache_file = self.cache_dir / f"{cache_key}.pkl"

        # Check cache
        if cache_file.exists():
            with open(cache_file, 'rb') as f:
                return pickle.load(f)

        # Generate embedding
        result = self.embedder.embed_query(text)
        embedding = result['embedding']

        # Save to cache
        with open(cache_file, 'wb') as f:
            pickle.dump(embedding, f)

        return embedding

# Usage
cached_embedder = CachedEmbedder(embedder)
embedding1 = cached_embedder.embed_with_cache("Sample text")  # Generates
embedding2 = cached_embedder.embed_with_cache("Sample text")  # From cache

Error Handling

from openai import OpenAIError

def safe_embed(embedder, documents: list, max_retries: int = 3):
    """Embed with retry logic."""

    for attempt in range(max_retries):
        try:
            result = embedder.embed_documents(documents)
            return result

        except OpenAIError as e:
            print(f"Attempt {attempt + 1} failed: {e}")
            if attempt == max_retries - 1:
                raise

            import time
            time.sleep(2 ** attempt)  # Exponential backoff

# Usage
try:
    result = safe_embed(embedder, documents)
except OpenAIError as e:
    print(f"Failed after retries: {e}")

Choosing an Embedding Provider

Use OpenAI when:

  • You need high-quality general-purpose embeddings

  • You’re already using OpenAI for LLMs

  • You want reliable, well-tested embeddings

  • Cost is not the primary concern

Use VoyageAI when:

  • You have domain-specific content (finance, legal, code)

  • You need flexible embedding dimensions

  • You want specialized models for your use case

  • You need multilingual support

Use MGTE when:

  • You want to run embeddings locally

  • You need offline operation

  • You want to avoid API costs

  • You have GPU resources available

  • Privacy is a concern

Performance Comparison

Provider

Speed

Quality

Cost

OpenAI

Fast

High

API calls

VoyageAI

Fast

Specialized

API calls

MGTE

Medium

Good

Free (local)

Building Custom Embeddings

Implement IEmbedding interface:

from arshai.core.interfaces.iembedding import IEmbedding, EmbeddingConfig
from typing import List, Dict, Any

class CustomEmbedding(IEmbedding):
    """Custom embedding implementation."""

    def __init__(self, config: EmbeddingConfig):
        self.model_name = config.model_name
        self.batch_size = config.batch_size
        self._dimension = 768  # Your model's dimension

    @property
    def dimension(self) -> int:
        return self._dimension

    def embed_documents(self, texts: List[str]) -> Dict[str, Any]:
        """Embed multiple documents."""
        # Your implementation here
        embeddings = [self._embed_single(text) for text in texts]

        return {
            "embeddings": embeddings,
            "total_tokens": len(texts) * 100  # Approximate
        }

    def embed_query(self, text: str) -> Dict[str, Any]:
        """Embed single query."""
        embedding = self._embed_single(text)

        return {
            "embedding": embedding,
            "total_tokens": 100  # Approximate
        }

    def _embed_single(self, text: str) -> List[float]:
        """Your embedding logic."""
        # Implement your embedding generation
        pass

    async def embed_documents_async(self, texts: List[str]) -> Dict[str, Any]:
        """Async version."""
        return self.embed_documents(texts)

    async def embed_query_async(self, text: str) -> Dict[str, Any]:
        """Async version."""
        return self.embed_query(text)

Best Practices

  1. Consistent Models Use the same embedding model for documents and queries.

  2. Batch Processing Process multiple documents at once for better performance.

  3. Cache Results Cache embeddings for frequently accessed documents.

  4. Error Handling Implement retry logic for API-based embeddings.

  5. Monitor Costs Track API usage for cost management.

  6. Choose Appropriate Dimensions Higher dimensions = better quality but more storage/compute.

  7. Test Different Providers Benchmark providers for your specific use case.

Integration with Vector Databases

See Vector Database - Milvus Client for using embeddings with vector stores.

# Quick example
from arshai.embeddings.openai_embeddings import OpenAIEmbedding
from arshai.vector_db.milvus_client import MilvusClient

# Create embedder
embedder = OpenAIEmbedding(EmbeddingConfig(model_name="text-embedding-3-small"))

# Generate embeddings
documents = ["Doc 1", "Doc 2", "Doc 3"]
result = embedder.embed_documents(documents)

# Store in vector database
# (See vector-databases documentation for details)

Next Steps

  • Vector Storage: See Vector Database - Milvus Client for storing and searching embeddings

  • RAG Systems: Build retrieval-augmented generation systems

  • Semantic Search: Implement semantic search for your application

Remember: These are reference implementations. The framework provides the IEmbedding interface - you can implement it for any embedding provider or custom model.