Editorial
Long-form analysis, opinionated takes, and listicles on the AI evals space — for engineers picking tools to put in production.
LLM evals and observability company acquisitions
Eight acquisitions in fourteen months — Langfuse, Humanloop, Helicone, Promptfoo, Velvet, Weights & Biases, Statsig, Galileo. Who bought what, the three buyer patterns behind the deals, and what it means if you're picking a tool right now.
How to reduce LLM costs in production
A practical guide to finding where your LLM bill is actually going, fixing the expensive parts, and keeping the savings in place — with notes on the tools we'd reach for at each step.
How to actually lower your LLM bill (without shipping worse output)
Why aggregate dashboards stop being enough once your AI app is real, and the workflow engineering teams use to find expensive workflow steps, replace them, and ship the change without breaking quality.
The best human-in-the-loop LLM eval tools (2026)
Eight platforms ranked by how well they handle the part of evaluation that automated scorers and LLM judges can't do alone — getting human judgment into the loop and out the other side.
The best LLM gateways (2026)
Four LLM gateways ranked for routing across providers, caching, failover, and the parts of governance that keep production traffic stable.
The best prompt management tools (2026)
Seven prompt management tools, ranked by what they actually solve — from no-code editors to Git-style versioning to eval-first platforms.
The best AI agent observability tools (2026)
Four tools we'd actually pick for monitoring multi-step agents in production — what they cover, where they break, and who each one is for.
Arize AI alternatives (2026)
Four platforms to consider if Arize's ML-first architecture isn't the right fit for an LLM-only workflow — and one honest case for sticking with Arize.
Galileo alternatives (2026)
Five platforms to consider if Galileo's monitoring-and-guardrails focus doesn't cover the full evaluation lifecycle your team needs.
The best LLM monitoring tools, ranked (2026)
Independent rankings of the tools developer teams actually use to monitor LLM apps in production — based on hands-on testing, not press releases.
Why evals are finally the bottleneck
Models stopped being the bottleneck. Evals took the slot — and most teams are still flying blind.