Heartbeat Architecture: How to Monitor an AI Agent That Never Stops Running

Context

When you run an AI agent as persistent infrastructure — scheduled jobs, always-on monitoring, around-the-clock availability — you immediately hit a problem that doesn’t exist in the demo world: how do you know the agent is actually working?

Not just “is the process running?” but: Is it producing useful output? Are its dependencies healthy? Are its jobs succeeding or silently failing? Is anything drifting from baseline?

The answer is a heartbeat system. Here’s how ours is designed, what it monitors, and what we learned building it.


What We Built

A heartbeat job runs every 15 minutes during active hours. It’s the agent monitoring itself — a lightweight check that scans infrastructure health, job status, and dependency availability, then surfaces only what changed.

The design principles:

Diff-only output. If everything is the same as last check, the heartbeat returns a single token: `HEARTBEAT_OK`. No output means no noise. Only changes surface.

Strict schema. Every non-OK heartbeat returns structured JSON — timestamp, status (green/yellow/red), top 3 items needing attention, whether human escalation is required, and which model should handle escalation if needed.

Cheap model, escalate when warranted. The heartbeat runs on a local model — free, fast, always available. If the status comes back yellow or red, it escalates to a frontier model for deeper analysis. Routine checks shouldn’t cost frontier API tokens.

Consecutive failure tracking. A single failure is noise. Two consecutive failures on the same check is a pattern. Three is an alert. The heartbeat tracks failure counts across intervals and escalates based on streak, not individual events.


What the Heartbeat Checks

Every interval, the heartbeat verifies:

Infrastructure health

  • Gateway process reachability
  • Memory plugin status
  • External API authentication (email, calendar, project tools)
  • SSL certificate validity and days-to-expiry
  • Qdrant / vector store health

Job health

  • Scheduled cron jobs — any consecutive failures?
  • Job output validation — did the last run produce meaningful output or empty results?
  • Token/auth expiry approaching?

Security posture

  • Are any monitored systems drifting from baseline?
  • Any anomalous access patterns since last check?
  • Credential files still at correct permissions?

Cost controls

  • Estimated token spend since last interval
  • Any single job burning disproportionate tokens?
  • Daily total trending toward alert threshold?

The checklist is designed to be exhaustive at the interval cadence — not because every check will find something, but because the ones that matter will fire when they matter.


The Output Schema

When something needs attention, the heartbeat returns:

“`json

{

“timestamp_pt”: “2026-03-20 14:30 PT”,

“status”: “yellow”,

“changes_detected”: true,

“top_3”: [

{

“item”: “Calendar auth token expiring in 6 hours”,

“impact”: “high”,

“action”: “Refresh OAuth token before next Morning Brief”,

“owner”: “agent”

}

],

“requires_human”: false,

“escalate_to”: “frontier-model”,

“notes”: “All other checks green. Token refresh can be handled autonomously.”

}

“`

Three items maximum. Impact scored low/med/high. Action specified. Owner assigned. Escalation model named if needed.

The constraint on three items is deliberate. If the heartbeat tries to surface everything, it surfaces nothing useful. Force-rank to three and the signal stays clear.


What Broke (or Almost Broke)

The Silent Failure Mode

The first version of the heartbeat checked whether jobs ran. It didn’t check whether jobs produced useful output. A cron job could execute successfully, produce empty results, and the heartbeat would report green.

This happened. A monitoring job was silently returning no data for three days before we caught it — not because the heartbeat missed it, but because the heartbeat wasn’t asking the right question.

The fix: added output validation as a first-class check. The heartbeat now verifies that scheduled jobs produced non-empty, structurally valid output — not just that the process exited with code 0.

Lesson: “Did it run?” and “Did it work?” are different questions. Monitor both.

The Model That Couldn’t Use Tools

Early heartbeat implementation used a local model. It was fast and free. It also couldn’t reliably call tools to actually check the systems it was supposed to monitor.

The model would generate a heartbeat response describing what it would check — without actually checking. Every interval came back green because the model was narrating a health check, not executing one.

The fix: heartbeat checks that require tool calls (API pings, file reads, database queries) run the checks first, then pass the results to the model for interpretation. The model’s job is to interpret pre-fetched data, not to fetch it autonomously.

Lesson: Don’t ask a language model to be an orchestrator when you need a data fetcher. Separate the fetching from the reasoning.

Alert Fatigue from Single-Event Escalation

The first escalation policy: any check failure triggers a human notification. Within a week, transient failures — a momentary API timeout, a slow response that crossed a threshold — were generating alerts for things that self-resolved in the next interval.

The fix: consecutive failure thresholds. One failure = log it. Two consecutive = yellow. Three consecutive = alert. The policy eliminates transient noise while catching real patterns.

Lesson: Alert on streaks, not events. Single failures are noise. Consecutive failures are signal.


What We Learned

A heartbeat is cheap insurance. The cost of running a lightweight local model every 15 minutes is negligible. The cost of a silent failure running undetected for three days is not.

Diff-only output is the right design. If every heartbeat produces a wall of status text, humans stop reading it. If it only surfaces changes, every message is signal.

The heartbeat should monitor the heartbeat. A heartbeat that fails silently is worse than no heartbeat. Build in a canary: if the heartbeat hasn’t fired in 30 minutes during active hours, something is wrong with the heartbeat itself.

Force-ranking to three items is a feature. It prevents the heartbeat from becoming a dashboard that nobody reads. Three items, ranked by impact, with a clear owner and action. That’s the format.


What We Changed

  • Added output validation as a first-class heartbeat check — not just execution status
  • Separated data fetching from model interpretation in the heartbeat pipeline
  • Implemented consecutive failure thresholds (1 = log, 2 = yellow, 3 = alert)
  • Added a meta-check: if heartbeat hasn’t fired in active hours, escalate
  • Moved cost monitoring into the heartbeat interval rather than running it separately

Takeaways

  • “Did it run?” and “Did it work?” are different questions. Monitor both.
  • Diff-only output keeps the signal clear. If everything is fine, one token is enough.
  • Alert on streaks, not events. Consecutive failures are signal. Single failures are noise.
  • The heartbeat should monitor itself. A silently broken monitor is worse than no monitor.
  • Force-rank to three items. More than that and nothing gets acted on.
  • Separate data fetching from model reasoning. The model interprets results — it doesn’t retrieve them.
  • Cheap model for routine checks, frontier model for escalation. Tier the cost to the complexity.

*AI Field Notes is a series documenting real-world AI deployments, governance experiments, and operational lessons from the field. All environments and details are generalized to protect organizational and individual privacy.*


Related Posts

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top