Companies
Every company in the AI evals space we've reviewed. Independent — we don't accept vendor sponsorships, and reviews are updated as products change.
Arize AI
ML observability platform extended into LLMs, with the open-source Phoenix framework as a popular standalone trace viewer.
Braintrust
Eval-driven dev platform combining traces, datasets, scorers, and a playground in one product.
Datadog
APM giant with bolted-on LLM observability for OpenAI and Anthropic calls.
Fiddler
Enterprise ML governance platform extended to LLMs and generative AI, with audit-ready traces and in-environment evaluations.
Galileo
Agent reliability platform with cheap, fast evaluators that can run on every request in production.
Helicone
Proxy-based LLM observability — drop in by changing the base URL, no SDK changes needed.
Langfuse
Open-source LLM observability with evals, prompt management, and best-in-class tracing.
LangSmith
Observability and evaluation built by the LangChain team — best-in-class if your stack is LangChain or LangGraph.
Maxim AI
AI quality evaluation platform with prebuilt and custom scorers, designed to plug into existing observability stacks.
Promptfoo
Open-source CLI for evaluating LLM prompts and red-teaming applications, with YAML/JSON configs that live next to your code.
PromptHub
Git-style version control for prompts — branch, commit, merge, and CI-gate prompt changes.
PromptLayer
Visual prompt editor and version control built for non-technical teams.
RAGAS
Open-source evaluation framework purpose-built for RAG pipelines, with reference-free metrics that became the industry standard.
Vellum
Visual workflow builder with built-in observability for low-code agent development.
Weights & Biases Weave
LLM tracing, evaluation, and prompt management embedded inside the Weights & Biases ML platform.
ZenML
Open-source MLOps and LLMOps framework for building reproducible, infrastructure-agnostic AI pipelines.