ai-evals.tools

What it is

Langfuse is an open-source observability and eval platform for LLM apps. Trace inference calls, attach scores, define datasets, run experiments, and manage prompts — all in one self-hostable service. Free if you self-host; cloud starts at $29/month with usage-based pricing.

Developer experience

SDKs in Python and TS, plus OpenTelemetry and OpenAI/LangChain integrations that "just work." Drilling into a multi-step agent run feels closer to a real APM than what most eval-first competitors offer.

import { Langfuse } from "langfuse";
 
const lf = new Langfuse();
const trace = lf.trace({ name: "triage" });
const gen = trace.generation({ name: "classify", model: "gpt-4o" });
gen.end({ output });

Where it shines

Self-hosting. Helm chart, docker-compose, and a SOC 2-compliant cloud offering — pick your flavor. This is the differentiator for teams that can't ship customer data to a third-party SaaS.
Tracing. Best-in-class for debugging real agent traffic. Session grouping connects related requests cleanly.
Pricing. OSS is OSS. Cloud is reasonable.

Where it falls short

Evals UX. Functional but less opinionated than Braintrust. You'll spend more time wiring things together to get a CI-gated eval flow.
Scale ops. Self-hosting at high trace volume needs a real ClickHouse story.

Bottom line

If self-hosting matters or you want OSS, Langfuse is the obvious choice and a credible alternative to closed-source incumbents. If you'd pay anything to skip the ops work and want the most polished eval flow out of the box, look at Braintrust first — but Langfuse is the one we'd start with for almost any team that takes data control seriously.

Langfuse

Verdict

What it is

Developer experience

Where it shines

Where it falls short

Bottom line

Related

Arize AI

Braintrust

Datadog