Companies
Every company in the AI evals space we've reviewed. Independent — we don't accept vendor sponsorships, and reviews are updated as products change.
Arize AI
ML observability platform extended into LLMs, with the open-source Phoenix framework as a popular standalone trace viewer.
Braintrust
Eval-driven dev platform combining traces, datasets, scorers, and a playground in one product.
Comet (Opik)
Open-source LLM evaluation and observability from a mature MLOps team — credible Langfuse alternative.
Datadog
APM giant with bolted-on LLM observability for OpenAI and Anthropic calls.
DeepEval (Confident AI)
pytest-style LLM evaluation framework with synthetic dataset generation and CI/CD-native testing.
Evidently AI
Open-source ML and LLM evaluation framework with strong methodology docs — building blocks, not a finished platform.
Fiddler
Enterprise ML governance platform extended to LLMs and generative AI, with audit-ready traces and in-environment evaluations.
Galileo
Agent reliability platform with cheap, fast evaluators that can run on every request in production.
HUD
Open-source platform for building RL environments and evals for computer-use agents — used by frontier labs, ships its own benchmarks.
Label Studio
Open-source data annotation platform with rubric enforcement, escalation workflows, and audit trails — extended to LLM review.
Langfuse
Open-source LLM observability with evals, prompt management, and best-in-class tracing.
LangSmith
Observability and evaluation built by the LangChain team — best-in-class if your stack is LangChain or LangGraph.
LiteLLM
Open-source Python SDK and proxy that translates requests across 100+ LLM providers into the OpenAI format.
Maxim AI
AI quality evaluation platform with prebuilt and custom scorers, designed to plug into existing observability stacks.
MLflow
Open-source MLOps standard with LLM tracing, evaluation, and prompt management bolted on top.
OpenRouter
Single OpenAI-compatible endpoint to 500+ models across 60+ providers, billed pay-as-you-go.
Portkey
Full-stack AI gateway with the broadest model catalog, built-in guardrails, and enterprise-grade governance.
Promptfoo
Open-source CLI for evaluating LLM prompts and red-teaming applications, with YAML/JSON configs that live next to your code.
PromptHub
Git-style version control for prompts — branch, commit, merge, and CI-gate prompt changes.
PromptLayer
Visual prompt editor and version control built for non-technical teams.
RAGAS
Open-source evaluation framework purpose-built for RAG pipelines, with reference-free metrics that became the industry standard.
SuperAnnotate
Annotation platform with strong tooling for measuring and resolving disagreements between human reviewers and automated scorers.
Vellum
Visual workflow builder with built-in observability for low-code agent development.
Weights & Biases Weave
LLM tracing, evaluation, and prompt management embedded inside the Weights & Biases ML platform.
ZenML
Open-source MLOps and LLMOps framework for building reproducible, infrastructure-agnostic AI pipelines.