Open source

AI agent evaluation for non‑profits

Existing AI evaluation tools are too complex or overly expensive for non‑profits to use. Calibrate is built by ML engineers with decades of experience to make AI evaluation accessible with best practices baked into every step

BUILT BY ARTPARK @ IIScFUNDED BY GOVERNMENT OF KARNATAKA

Evaluate the quality
of your AI responses

Define edge cases and evaluate the agent's response against custom criteria

Step

Add the conversation history as input to the agent

Text agents llm-input preview 1-1
Step

Define custom criteria to evaluate the agent's response given the conversation history

Text agents llm-evaluator preview 1-2
Step

Run the test and see whether the agent response passed the evaluation criteria

Text agents llm-output preview 1-3

Find the best LLM
for your agent

Compare different models on your tests to find the best one for your agent

Step

Pick the models to compare and run benchmarking on all tests

Text agents llm-multi-input preview 1-1
Step

Get a leaderboard across models and evaluators to pick the best LLM

Text agents llm-multi-output preview 1-2

Identify the best speech‑to‑text model
for your users

Calibrate uses evaluators that compare the meaning of the predicted transcriptions with the references beyond simple rule-based metrics to rank different models

Step

Upload your audios with reference texts

Voice agents stt-upload preview 1-1
Step

Select the language, models to compare and the evaluators for measuring transcription accuracy

Voice agents stt-config preview 1-2
Step

See the leaderboard across models for each metric

Voice agents stt-leaderboard preview 1-3
Step

For each model view row-by-row outputs along with evaluator scores and reasoning

Voice agents stt-rows preview 1-4

Select the perfect voice
for your agent

Calibrate uses AI models which lets you evaluate the generated audios against the reference texts on pronunciation, clarity, naturalness and more

Step

Add the reference texts to be spoken

Voice agents tts-texts preview 2-1
Step

Select the language, models to compare and the evaluators to measure the quality of the generated audios

Voice agents tts-config preview 2-2
Step

See the leaderboard across models for each metric

Voice agents tts-leaderboard preview 2-3
Step

For each model view the generated audios for each row along with evaluator scores and reasoning

Voice agents tts-rows preview 2-4

Simulate realistic conversations
with your agent

Catch bugs before deploying your agent to real users

Step

Create user personas to define who your users are

Simulations sim-personas preview 1-1
Step

Create scenarios to depict the purpose of the user's interaction with the agent

Simulations sim-purpose preview 1-2
Step

Run a simulation with personas and scenarios using custom evaluators and get performance metrics across all runs

Simulations sim-run preview 1-3
Step

Inspect each simulation run with full transcript and generated audios for the agent and the simulated user

Simulations sim-inspect preview 1-4

Proudly open source

What we open-source is what we use ourselves. Nothing hidden behind a paywall.

Self-hosting

We can help you run Calibrate on your infrastructure to ensure sensitive data stays in environments you control

No per-seat pricing. Ever.

No per-user fees. Add staff, partners, and consultants as your team grows

Auditable, end to end

The full codebase is on GitHub for pre-deploy review and real diligence

No vendor lock-in

Fork, adapt, and make changes as you wish

Works with any AI agent stack

Supports all major models with more coming soon

Supports integrations including Deepgram, ElevenLabs, OpenAI, Google, Cartesia, Anthropic, Groq, DeepSeek, Smallest AI, Claude, Gemini, Qwen, Meta, Mistral, Cohere, Sarvam, AI21, Baidu, NVIDIA, Amazon.

Join the community

Talk to the team building Calibrate to get your questions answered and shape our roadmap

Start Calibrating today

Become a team that ships trustworthy AI agents beyond vibe checks