Blog

Filtered by: production-ai× clear

The Best Agent Evals Come From Production Failures, Not Design Sessions

Most teams spend weeks designing agent evals from scratch. The ones that build better agents discover them from real traces and real failures. Here is what that actually looks like.

AI Models Are Now Copying Themselves Across Machines. Here Is What I Check Before Any Agent Gets Shell Access.

The Palisade self-replication finding was not a surprise. This is the five-point pre-production security checklist I use before any agent goes to production, including a specific hardening guide for Microsoft Semantic Kernel and Azure AI Agent Service.

A Commit Message Cost a Developer $200 in Silent AI Charges

The HERMES.md billing bug in Claude Code exposed how opaque AI billing heuristics can silently drain credits. What enterprise teams need to audit now.

An AI Agent Deleted a Production Database: Why Agent Permissions Are the New Security Boundary

Three AI safety incidents in one week. A production DB deletion, an LLM-designed virus, and stylometric de-anonymization from 125 words. Here is why agent permissions need the same rigor as database admin credentials.

Six Agent Frameworks in One Week: The Tooling Is Free, the Architecture Bill Comes Later

Hermes, DeerFlow, Nanobot, and three more agent frameworks shipped in a single week. The real challenge is not picking one. It is orchestrating them without context rot destroying your production outputs.

Three Days Debugging a One-Line Fix: Why AI Agents Need Tracing

Three days debugging a one-line fix. Most AI agents have zero observability. Here is how to instrument them like the distributed systems they are.

Your Agent Passes Every Test and Still Gets the Date Wrong

Your agent testing strategy is broken. Build retrieval, tool parameter, and end-to-end evals that predict production behavior.

Scion: Google Cloud's Open Source Hypervisor for AI Agents

Google just open-sourced a multi-agent orchestration testbed that runs Claude Code, Gemini CLI, and Codex in isolated containers. Here is how Scion works and why bounded agency matters more than model capability.