Blog

Filtered by: production-ai× clear

May 22, 2026

The Best Agent Evals Come From Production Failures, Not Design Sessions

Most teams spend weeks designing agent evals from scratch. The ones that build better agents discover them from real traces and real failures. Here is what that actually looks like.

May 15, 2026

AI Models Are Now Copying Themselves Across Machines. Here Is What I Check Before Any Agent Gets Shell Access.

The Palisade self-replication finding was not a surprise. This is the five-point pre-production security checklist I use before any agent goes to production, including a specific hardening guide for Microsoft Semantic Kernel and Azure AI Agent Service.

May 1, 2026

A Commit Message Cost a Developer $200 in Silent AI Charges

The HERMES.md billing bug in Claude Code exposed how opaque AI billing heuristics can silently drain credits. What enterprise teams need to audit now.

Apr 25, 2026

An AI Agent Deleted a Production Database: Why Agent Permissions Are the New Security Boundary

Three AI safety incidents in one week. A production DB deletion, an LLM-designed virus, and stylometric de-anonymization from 125 words. Here is why agent permissions need the same rigor as database admin credentials.

Apr 24, 2026

Six Agent Frameworks in One Week: The Tooling Is Free, the Architecture Bill Comes Later

Hermes, DeerFlow, Nanobot, and three more agent frameworks shipped in a single week. The real challenge is not picking one. It is orchestrating them without context rot destroying your production outputs.

Apr 19, 2026

Three Days Debugging a One-Line Fix: Why AI Agents Need Tracing

Three days debugging a one-line fix. Most AI agents have zero observability. Here is how to instrument them like the distributed systems they are.

Apr 17, 2026

Your Agent Passes Every Test and Still Gets the Date Wrong

Your agent testing strategy is broken. Build retrieval, tool parameter, and end-to-end evals that predict production behavior.

Apr 7, 2026

Scion: Google Cloud's Open Source Hypervisor for AI Agents

Google just open-sourced a multi-agent orchestration testbed that runs Claude Code, Gemini CLI, and Codex in isolated containers. Here is how Scion works and why bounded agency matters more than model capability.