Ondine - LLM Dataset Engine¶
▄▄▄▄▄▄▄▄▄▄▄ ▄▄ ▄ ▄▄▄▄▄▄▄▄▄▄ ▄▄▄▄▄▄▄▄▄▄▄ ▄▄ ▄ ▄▄▄▄▄▄▄▄▄▄▄
▐░░░░░░░░░░░▌▐░░▌ ▐░▌▐░░░░░░░░░░▌ ▐░░░░░░░░░░░▌▐░░▌ ▐░▌▐░░░░░░░░░░░▌
▐░█▀▀▀▀▀▀▀█░▌▐░▌░▌ ▐░▌▐░█▀▀▀▀▀▀▀█░▌ ▀▀▀▀█░█▀▀▀▀ ▐░▌░▌ ▐░▌▐░█▀▀▀▀▀▀▀▀▀
▐░▌ ▐░▌▐░▌▐░▌ ▐░▌▐░▌ ▐░▌ ▐░▌ ▐░▌▐░▌ ▐░▌▐░▌
▐░▌ ▐░▌▐░▌ ▐░▌ ▐░▌▐░▌ ▐░▌ ▐░▌ ▐░▌ ▐░▌ ▐░▌▐░█▄▄▄▄▄▄▄▄▄
▐░▌ ▐░▌▐░▌ ▐░▌ ▐░▌▐░▌ ▐░▌ ▐░▌ ▐░▌ ▐░▌ ▐░▌▐░░░░░░░░░░░▌
▐░▌ ▐░▌▐░▌ ▐░▌ ▐░▌▐░▌ ▐░▌ ▐░▌ ▐░▌ ▐░▌ ▐░▌▐░█▀▀▀▀▀▀▀▀▀
▐░▌ ▐░▌▐░▌ ▐░▌▐░▌▐░▌ ▐░▌ ▐░▌ ▐░▌ ▐░▌▐░▌▐░▌
▐░█▄▄▄▄▄▄▄█░▌▐░▌ ▐░▐░▌▐░█▄▄▄▄▄▄▄█░▌▄▄▄▄█░█▄▄▄▄ ▐░▌ ▐░▐░▌▐░█▄▄▄▄▄▄▄▄▄
▐░░░░░░░░░░░▌▐░▌ ▐░░▌▐░░░░░░░░░░▌ ▐░░░░░░░░░░░▌▐░▌ ▐░░▌▐░░░░░░░░░░░▌
▀▀▀▀▀▀▀▀▀▀▀ ▀ ▀▀ ▀▀▀▀▀▀▀▀▀▀ ▀▀▀▀▀▀▀▀▀▀▀ ▀ ▀▀ ▀▀▀▀▀▀▀▀▀▀▀
Production-grade SDK for batch processing tabular datasets with LLMs. Built on LlamaIndex for provider abstraction, adds batch orchestration, automatic cost tracking, checkpointing, and YAML configuration for dataset transformation at scale.
Features¶
- Quick API: 3-line hello world with smart defaults and auto-detection
- Simple API: Fluent builder pattern for full control when needed
- Reliability: Automatic retries, checkpointing, error policies (99.9% completion rate)
- Cost Control: Pre-execution estimation, budget limits, real-time tracking
- Observability: LlamaIndex-powered automatic LLM tracking (Langfuse, OpenTelemetry), progress bars, cost reports
- Extensibility: Plugin architecture, custom stages, multiple LLM providers
- Production Ready: Zero data loss on crashes, resume from checkpoint
- Multiple Providers: OpenAI, Azure OpenAI, Anthropic Claude, Groq, MLX (Apple Silicon), and custom APIs
- Local Inference: Run models locally with MLX (Apple Silicon) or Ollama - 100% free, private, offline-capable
- Multi-Column Processing: Generate multiple output columns with composition or JSON parsing
- Custom Providers: Integrate any OpenAI-compatible API (Together.AI, vLLM, Ollama, custom endpoints)
Quick Start¶
Option 1: Quick API (Recommended)¶
The simplest way to get started - just provide your data, prompt, and model:
from ondine import QuickPipeline
# Process data with smart defaults
pipeline = QuickPipeline.create(
data="data.csv",
prompt="Clean this text: {description}",
model="gpt-4o-mini"
)
# Execute pipeline
result = pipeline.execute()
print(f"Processed {result.metrics.processed_rows} rows")
print(f"Total cost: ${result.costs.total_cost:.4f}")
What's auto-detected:
- Input columns from
{placeholders}in prompt - Provider from model name (gpt-4 → openai, claude → anthropic)
- Parser type (JSON for multi-column, text for single column)
- Sensible batch size and concurrency for the provider
Option 2: Builder API (Full Control)¶
For advanced use cases requiring explicit configuration:
from ondine import PipelineBuilder
# Build with explicit settings
pipeline = (
PipelineBuilder.create()
.from_csv("data.csv", input_columns=["description"],
output_columns=["cleaned"])
.with_prompt("Clean this text: {description}")
.with_llm(provider="openai", model="gpt-4o-mini")
.with_batch_size(100)
.with_concurrency(5)
.build()
)
# Estimate cost before running
estimate = pipeline.estimate_cost()
print(f"Estimated cost: ${estimate.total_cost:.4f}")
# Execute pipeline
result = pipeline.execute()
print(f"Total cost: ${result.costs.total_cost:.4f}")
Installation¶
Install with pip or uv:
Or with optional dependencies:
# For Apple Silicon local inference
pip install ondine[mlx]
# Observability is now built-in (OpenTelemetry + Langfuse)
# No separate install needed!
# For development
pip install ondine[dev]
Next Steps¶
- Installation Guide - Detailed installation instructions
- Quickstart - Your first pipeline in 5 minutes
- Core Concepts - Understanding pipelines, stages, and specifications
- Execution Modes - When to use sync, async, or streaming
- API Reference - Complete API documentation
Use Cases¶
Ondine excels at:
- Data cleaning and normalization (PII detection, standardization)
- Content enrichment (classification, tagging, summarization)
- Extraction tasks (structured data from unstructured text)
- Translation and localization at scale
- Synthetic data generation with cost controls
- Quality assurance (validation, scoring, feedback)
Why Ondine?¶
- Production-Grade: Checkpointing, auto-retry, budget controls, observability
- Developer-Friendly: Fluent API, YAML config, CLI tools, extensive examples
- Cost-Aware: Pre-run estimation, real-time tracking, budget limits
- Reliable: 99.9% completion rate in production workloads
- Flexible: Multiple providers, custom stages, extensible architecture
- Well-Tested: 95%+ code coverage, integration tests with real APIs
License¶
MIT License - see LICENSE for details.