/companies

Companies

Every company in the AI evals space we've reviewed. Independent — we don't accept vendor sponsorships, and reviews are updated as products change.

Arize AI

ML observability platform extended into LLMs, with the open-source Phoenix framework as a popular standalone trace viewer.

7.2

observabilityML monitoringLLM evalsfreemium

Braintrust

Eval-driven dev platform combining traces, datasets, scorers, and a playground in one product.

9.1

LLM evalsobservabilityprompt managementfreemium

Datadog

APM giant with bolted-on LLM observability for OpenAI and Anthropic calls.

6.4

observabilityAPMpaid

Fiddler

Enterprise ML governance platform extended to LLMs and generative AI, with audit-ready traces and in-environment evaluations.

7.2

AI governanceagent observabilityenterprise

Galileo

Agent reliability platform with cheap, fast evaluators that can run on every request in production.

7.5

agent observabilityLLM evalsfreemium

Helicone

Proxy-based LLM observability — drop in by changing the base URL, no SDK changes needed.

7.5

observabilityproxy / gatewayfreemium

Langfuse

Open-source LLM observability with evals, prompt management, and best-in-class tracing.

8.4

observabilityLLM evalsprompt managementopen-source

LangSmith

Observability and evaluation built by the LangChain team — best-in-class if your stack is LangChain or LangGraph.

7.5

observabilityLLM evalsprompt managementfreemium

Maxim AI

AI quality evaluation platform with prebuilt and custom scorers, designed to plug into existing observability stacks.

6.8

LLM evalsfreemium

Promptfoo

Open-source CLI for evaluating LLM prompts and red-teaming applications, with YAML/JSON configs that live next to your code.

7.4

LLM evalsred-teamingopen-source

PromptHub

Git-style version control for prompts — branch, commit, merge, and CI-gate prompt changes.

6.8

prompt managementfreemium

PromptLayer

Visual prompt editor and version control built for non-technical teams.

7.0

prompt managementfreemium

RAGAS

Open-source evaluation framework purpose-built for RAG pipelines, with reference-free metrics that became the industry standard.

7.5

LLM evalsRAG evaluationopen-source

Vellum

Visual workflow builder with built-in observability for low-code agent development.

7.0

prompt managementagent observabilityfreemium

Weights & Biases Weave

LLM tracing, evaluation, and prompt management embedded inside the Weights & Biases ML platform.

6.8

observabilityLLM evalsprompt managementMLOpsfreemium

ZenML

Open-source MLOps and LLMOps framework for building reproducible, infrastructure-agnostic AI pipelines.

6.8

MLOpsLLM evalsfreemium