Field Notes February 11, 2025 8 min read

Thirty Days Running an Always-On AI Agent: The Honest Debrief

Thirty days of continuous AI agent operation. What broke, what surprised me, what it actually costs, and the operational lessons that benchmarks never show.

Context

Most AI content covers the first hour — the setup, the first prompt, the initial wow moment. Almost none of it covers what happens on Day 15, when the agent has been running cron jobs, managing workflows, and touching real systems around the clock.

This is that post. Thirty days of operating an AI agent as actual infrastructure — not a demo, not a side project — across multiple workstreams, real work, real data, and real consequences when things break.

Here’s what I didn’t expect.

What We Built

An AI agent running as a persistent process on local hardware, connected to:

Email (multiple accounts across providers)
Calendar (multiple systems)
Project management (multiple boards and projects)
External APIs (search, messaging, file storage)
Scheduled jobs (7+ cron tasks running on different cadences)
A live WordPress site
infrastructure access

The agent wasn’t just answering questions. It was executing tasks, writing files, calling APIs, publishing content, monitoring systems, and escalating to me when something needed human judgment.

Thirty days of that. Here’s what the experience actually taught me.

What I Learned

1. The Agent Doesn’t Get Tired — But the Infrastructure Does

The agent itself is stateless and consistent. It doesn’t have bad days. But the infrastructure it depends on does: APIs rate-limit, OAuth tokens expire, cron jobs fail silently, plugin dependencies break on updates.

By Day 10, I had a list of infrastructure failure modes I hadn’t anticipated. By Day 20, I had a monitoring framework to catch them. By Day 30, the monitoring was itself being monitored.

Lesson: An AI agent is only as reliable as its infrastructure. Treat the infrastructure like production, not a prototype.

2. Session Memory Is the Wrong Mental Model

I went in thinking of each session as a conversation that the agent would remember. It doesn’t — not natively. Every session starts fresh. If I didn’t write things down in persistent files, they didn’t exist the next day.

This forced a discipline that turned out to be valuable: every decision, configuration detail, discovered fact, and established pattern goes into a structured file before the session ends. The agent’s “memory” is its file system, not its context window.

Lesson: Design for amnesia. The agent you talk to today has no memory of yesterday unless you gave it one. Files are memory. Build the habit of writing to them.

3. Local Models Are Great — Until They’re Not

The initial setup used a local model for routine tasks to keep costs down. It worked well for simple summaries, status checks, and lightweight classification. Then I tried to use it for a cron job that needed to call tools and take action based on the output.

The local model produced beautiful prose describing what it would do. It did not actually do it. Tool calls came back as narrative text instead of executable function calls. The job “ran” and produced nothing.

The fix was straightforward once I understood the failure mode — route tool-dependent tasks to a frontier model. But it cost me a week of mysteriously empty cron outputs before I diagnosed it.

Lesson: Local models have a capability ceiling on tool use. If a task requires the model to call functions, not just generate text, test it explicitly with the local model before scheduling it. Don’t assume.

4. Cron Jobs Fail in Interesting Ways

A cron job that errors loudly is easy to fix. A cron job that runs, produces plausible-looking output, and silently does nothing useful is the hard one.

I had three of those in the first two weeks. The morning brief that populated with placeholder text instead of real data. The trend scanner that returned search results but never synthesized them. The monitoring job that ran green because it couldn’t reach the thing it was supposed to monitor and interpreted the silence as “all clear.”

Lesson: Validate cron output, not just cron execution. A successful run code means the process completed — it doesn’t mean the output was useful. Build output validation into the job.

5. The Human-in-the-Loop Requirement Is Real

I went in with an “automate everything” mindset. I came out with a more nuanced view: the agent should automate the execution, but the human should approve the direction.

Every external-facing action — emails, published content, client communications — needs explicit approval before it goes. Not because the agent can’t draft well, but because the judgment call about whether this is the right moment, right tone, and right framing requires context the agent doesn’t have access to.

The agents that create problems are the ones operating without an approval gate on outward-facing actions. The agents that create value are the ones that do the 90% of work and hand off the final 10% to a human who can decide.

Lesson: Automate the work, not the judgment. Approval gates on external actions aren’t friction — they’re the design.

6. Security Discipline Compounds

Every day I operated the agent, I found a new place where a credential could have leaked, a permission could have been over-granted, or a data boundary could have been crossed without intent.

None of them did — because the security architecture was built before the agent started operating, not retrofitted after. Keychain for credentials. Namespace isolation by domain. No sensitive data in prompts. No external actions without approval.

The discipline doesn’t feel necessary on Day 1. It feels essential by Day 30.

Lesson: Security architecture for AI agents needs to be designed before the agent starts, not added when something goes wrong.

7. The Agent Changes How You Work

By the end of 30 days, I had stopped doing a dozen things I used to do manually — and started spending that time on work that actually required human judgment. Not because I delegated blindly, but because I built a system I trusted enough to delegate to.

That trust was earned incrementally — small tasks first, then larger ones, with approval gates throughout. The agent didn’t replace judgment. It freed up time for it.

Lesson: The value isn’t the automation. It’s the reallocation of attention.

What Broke

Three cron jobs producing plausible-but-empty output for over a week before diagnosis
One OAuth token expiration that silently broke calendar access for two days
One plugin dependency failure that caused a crash loop lasting hours
One credential that ended up in the wrong place (caught and rotated immediately)
One local model routing decision that cost a week of investigation

All of these were recoverable. None of them were catastrophic. All of them were teachable.

What I Changed

Built explicit output validation into every scheduled job
Added a 15-minute heartbeat check that monitors the monitors
Moved all tool-dependent tasks to frontier models by default
Hardened credential handling — Keychain only, naming conventions enforced
Added consecutive failure tracking with automatic escalation thresholds
Wrote everything down — if it’s not in a file, it didn’t happen

Takeaways

Design for amnesia. The agent starts fresh every session. Files are its memory. Build the habit.
Infrastructure fails, agents don’t. Monitor the dependencies, not just the agent.
Local models have a tool-call ceiling. Test explicitly before scheduling.
Validate output, not just execution. A green run code means nothing if the output is empty.
Approval gates on external actions are the design, not a limitation.
Security architecture comes first. You can’t retrofit discipline after 30 days of operation.
The value is the reallocation of attention. Automate the work. Reserve judgment for humans.

The larger lesson: Thirty days of running an AI agent is not a benchmark — it is a stress test of every assumption you made about how the system would behave at scale, over time, against real work. The biggest surprises are not in performance. They are in operational overhead, failure visibility, and the discipline required to keep a self-hosted system trustworthy.

Related notes:

The Heartbeat Problem
Local vs. Cloud AI: What Belongs Where