Ondine - LLM Dataset Engine¶

 ▄▄▄▄▄▄▄▄▄▄▄  ▄▄        ▄  ▄▄▄▄▄▄▄▄▄▄   ▄▄▄▄▄▄▄▄▄▄▄  ▄▄        ▄  ▄▄▄▄▄▄▄▄▄▄▄
▐░░░░░░░░░░░▌▐░░▌      ▐░▌▐░░░░░░░░░░▌ ▐░░░░░░░░░░░▌▐░░▌      ▐░▌▐░░░░░░░░░░░▌
▐░█▀▀▀▀▀▀▀█░▌▐░▌░▌     ▐░▌▐░█▀▀▀▀▀▀▀█░▌ ▀▀▀▀█░█▀▀▀▀ ▐░▌░▌     ▐░▌▐░█▀▀▀▀▀▀▀▀▀
▐░▌       ▐░▌▐░▌▐░▌    ▐░▌▐░▌       ▐░▌    ▐░▌     ▐░▌▐░▌    ▐░▌▐░▌
▐░▌       ▐░▌▐░▌ ▐░▌   ▐░▌▐░▌       ▐░▌    ▐░▌     ▐░▌ ▐░▌   ▐░▌▐░█▄▄▄▄▄▄▄▄▄
▐░▌       ▐░▌▐░▌  ▐░▌  ▐░▌▐░▌       ▐░▌    ▐░▌     ▐░▌  ▐░▌  ▐░▌▐░░░░░░░░░░░▌
▐░▌       ▐░▌▐░▌   ▐░▌ ▐░▌▐░▌       ▐░▌    ▐░▌     ▐░▌   ▐░▌ ▐░▌▐░█▀▀▀▀▀▀▀▀▀
▐░▌       ▐░▌▐░▌    ▐░▌▐░▌▐░▌       ▐░▌    ▐░▌     ▐░▌    ▐░▌▐░▌▐░▌
▐░█▄▄▄▄▄▄▄█░▌▐░▌     ▐░▐░▌▐░█▄▄▄▄▄▄▄█░▌▄▄▄▄█░█▄▄▄▄ ▐░▌     ▐░▐░▌▐░█▄▄▄▄▄▄▄▄▄
▐░░░░░░░░░░░▌▐░▌      ▐░░▌▐░░░░░░░░░░▌ ▐░░░░░░░░░░░▌▐░▌      ▐░░▌▐░░░░░░░░░░░▌
 ▀▀▀▀▀▀▀▀▀▀▀  ▀        ▀▀  ▀▀▀▀▀▀▀▀▀▀   ▀▀▀▀▀▀▀▀▀▀▀  ▀        ▀▀  ▀▀▀▀▀▀▀▀▀▀▀

Production-grade SDK for batch processing tabular datasets with LLMs. Built on LlamaIndex for provider abstraction, adds batch orchestration, automatic cost tracking, checkpointing, and YAML configuration for dataset transformation at scale.

Features¶

Quick API: 3-line hello world with smart defaults and auto-detection
Simple API: Fluent builder pattern for full control when needed
Reliability: Automatic retries, checkpointing, error policies (99.9% completion rate)
Cost Control: Pre-execution estimation, budget limits, real-time tracking
Observability: LlamaIndex-powered automatic LLM tracking (Langfuse, OpenTelemetry), progress bars, cost reports
Extensibility: Plugin architecture, custom stages, multiple LLM providers
Production Ready: Zero data loss on crashes, resume from checkpoint
Multiple Providers: OpenAI, Azure OpenAI, Anthropic Claude, Groq, MLX (Apple Silicon), and custom APIs
Local Inference: Run models locally with MLX (Apple Silicon) or Ollama - 100% free, private, offline-capable
Multi-Column Processing: Generate multiple output columns with composition or JSON parsing
Custom Providers: Integrate any OpenAI-compatible API (Together.AI, vLLM, Ollama, custom endpoints)

Quick Start¶

Option 1: Quick API (Recommended)¶

The simplest way to get started - just provide your data, prompt, and model:

from ondine import QuickPipeline

# Process data with smart defaults
pipeline = QuickPipeline.create(
    data="data.csv",
    prompt="Clean this text: {description}",
    model="gpt-4o-mini"
)

# Execute pipeline
result = pipeline.execute()
print(f"Processed {result.metrics.processed_rows} rows")
print(f"Total cost: ${result.costs.total_cost:.4f}")

What's auto-detected:

Input columns from {placeholders} in prompt
Provider from model name (gpt-4 → openai, claude → anthropic)
Parser type (JSON for multi-column, text for single column)
Sensible batch size and concurrency for the provider

Option 2: Builder API (Full Control)¶

For advanced use cases requiring explicit configuration:

from ondine import PipelineBuilder

# Build with explicit settings
pipeline = (
    PipelineBuilder.create()
    .from_csv("data.csv", input_columns=["description"],
              output_columns=["cleaned"])
    .with_prompt("Clean this text: {description}")
    .with_llm(provider="openai", model="gpt-4o-mini")
    .with_batch_size(100)
    .with_concurrency(5)
    .build()
)

# Estimate cost before running
estimate = pipeline.estimate_cost()
print(f"Estimated cost: ${estimate.total_cost:.4f}")

# Execute pipeline
result = pipeline.execute()
print(f"Total cost: ${result.costs.total_cost:.4f}")

Installation¶

Install with pip or uv:

pip install ondine

Or with optional dependencies:

# For Apple Silicon local inference
pip install ondine[mlx]

# Observability is now built-in (OpenTelemetry + Langfuse)
# No separate install needed!

# For development
pip install ondine[dev]

Next Steps¶

Installation Guide - Detailed installation instructions
Quickstart - Your first pipeline in 5 minutes
Core Concepts - Understanding pipelines, stages, and specifications
Execution Modes - When to use sync, async, or streaming
API Reference - Complete API documentation

Use Cases¶

Ondine excels at:

Data cleaning and normalization (PII detection, standardization)
Content enrichment (classification, tagging, summarization)
Extraction tasks (structured data from unstructured text)
Translation and localization at scale
Synthetic data generation with cost controls
Quality assurance (validation, scoring, feedback)

Why Ondine?¶

Production-Grade: Checkpointing, auto-retry, budget controls, observability
Developer-Friendly: Fluent API, YAML config, CLI tools, extensive examples
Cost-Aware: Pre-run estimation, real-time tracking, budget limits
Reliable: 99.9% completion rate in production workloads
Flexible: Multiple providers, custom stages, extensible architecture
Well-Tested: 95%+ code coverage, integration tests with real APIs

License¶

MIT License - see LICENSE for details.