Building a Crash-Resistant AI Gateway: Lessons from a Week-Long Incident

Context

The AI gateway had been running stably for weeks. Then it wasn’t. For nearly two weeks, the gateway entered a crash loop — restarting every few seconds, accumulating hundreds of orphaned session files, and producing no useful output while consuming enough CPU to make the host system sluggish.

The failure wasn’t caused by one thing. It was caused by four things that all happened to be present at the same time, in a combination that none of them would have triggered independently. Diagnosing and resolving it required understanding each layer separately.

This is the incident postmortem — and the architectural changes that came out of it.

What Happened

The Crash Loop

The gateway process was restarting every 7–10 seconds. Each restart generated a temporary session file. Over the course of the incident, hundreds of orphaned `.tmp` session files accumulated in the sessions directory. Each new startup tried to process these files, slowing the boot sequence further and making the loop harder to break.

From the outside, the gateway appeared to be running — the process was alive. From the inside, it was restarting before it could initialize enough to accept any work.

Root Cause 1: Broken Plugin API Auth

A memory plugin was configured to use an external API for embeddings. The API key had expired. On every startup, the plugin tried to initialize, received a 401 error, and the initialization failure cascaded into a SQLite database crash — not a clean error, but a crash that the gateway’s error handling didn’t recover from gracefully.

The plugin was designed to handle API errors at query time. It wasn’t designed to handle API errors at initialization time. The initialization path had no fallback.

Root Cause 2: Incomplete Plugin Installation

A security plugin had been installed and its hook registered in the config. The hook pointed to a dependency file that didn’t exist in the plugin directory — the installation was incomplete. On every startup, the gateway tried to load the hook, couldn’t find the file, and crashed.

Neither the installation process nor the gateway surfaced a clear error about the missing file. The crash looked identical to the API auth crash from the outside.

Root Cause 3: Orphaned Session Files

By the time the crash loop was diagnosed, hundreds of `.tmp` session files had accumulated. These files are created at session start and deleted on clean exit. A crash loop that never reaches clean exit accumulates them indefinitely.

The files themselves weren’t causing the crash loop — but they were making each restart slower, and they were a clear diagnostic signal that something had been crash-looping for a long time before anyone noticed.

Root Cause 4: Missing GUI Session

The gateway runs as a system service under a dedicated user account. On the affected system, that user account had no active GUI session — no desktop login. A critical bootstrap step required a GUI session context to complete. Without it, the LaunchAgent returned a bootstrap error and the service failed to start cleanly.

This failure was masked by the crash loop. The service appeared to be “running” (the process kept restarting), but it was never actually initialized.

The Diagnosis Process

Step 1: Establish that it’s a crash loop, not a single failure.

The session files were the tell. hundreds of files meant hundreds of restarts. This wasn’t a one-time crash — it was a sustained loop.

Step 2: Isolate the startup failure from the runtime failure.

The gateway was crashing before it could accept any connections. This meant the failure was in the initialization sequence, not in request handling.

Step 3: Examine each plugin and hook independently.

With initialization as the suspected failure point, each registered plugin and hook was examined in turn. The incomplete plugin installation was found by looking at the directory contents, not the error logs.

Step 4: Address the underlying causes, not just the symptoms.

Restarting the gateway without addressing the root causes would have just resumed the crash loop. Each cause needed to be resolved before the restart: broken plugin disabled, incomplete plugin unregistered, orphaned files cleaned, GUI session established.

Step 5: Verify clean startup before declaring resolution.

After fixes, a clean startup with no `.tmp` file accumulation and no error log growth was the confirmation that the loop was broken.

What We Changed

Plugin Isolation and Dependency Verification

Every plugin installation now includes a post-install check: verify that the files the hook references actually exist in the plugin directory. Registration of a hook without the corresponding dependency files is blocked.

For plugins that require external API auth, initialization failures are caught and logged — the plugin disables itself gracefully rather than crashing the gateway.

Session File Monitoring

The heartbeat now checks for `.tmp` file accumulation in the sessions directory. More than 10 `.tmp` files in the sessions directory is a yellow flag. More than 50 is a red flag that triggers immediate escalation — it means something is crash-looping and accumulating files faster than cleanup is running.

Startup Validation

A BOOT.md startup validator runs on every gateway initialization. It checks:

Gateway port reachability (can the process accept connections?)
Workspace read/write permissions
Cron job config validity (JSON parseable, referenced files exist)
A test message delivery (can the gateway actually reach its messaging channel?)

A startup that fails any of these checks surfaces immediately, rather than appearing healthy while actually being broken.

GUI Session Requirement Documentation

The requirement for an active GUI session on the service account is now documented as a first-class infrastructure requirement. Auto-login is configured for the service account on the host machine. Any future host configuration change that affects the GUI session triggers a gateway health verification.

Known-Good Config Snapshots

A known-good config snapshot is maintained and tested monthly. Any config change that touches plugin registration, hook configuration, or extension points requires a rollback plan before being applied.

Warning Signs: What to Watch For

These are the early indicators that a crash loop may be developing or has already started:

| Signal | What It Means |

|——–|—————|

| `.tmp` files accumulating in the sessions directory | Something is crash-looping and not cleaning up |

| `SQLITE_CANTOPEN` in the error log | A plugin’s database initialization is failing |

| Gateway process restarting frequently | Initialization is failing before stable operation |

| Bootstrap error 125 in service logs | Service account lacks a GUI session |

| Hook reference errors on startup | A plugin was registered but its files are missing |

Any two of these signals together is grounds for immediate investigation. All of them together is a full incident.

What We Learned

Crash loops are harder to diagnose than clean failures. A crash loop masks its own root cause by producing noise — the same symptom (gateway unavailable) for potentially different reasons across different restart cycles. Diagnosis requires stopping the loop, not just examining the running process.

Plugin installation is not the same as plugin readiness. A hook can be registered in config without the files it references actually being present. The gap between “installed” and “functional” is where this category of failure lives.

Initialization failures and runtime failures need separate error handling. A plugin designed to handle API errors at query time is not necessarily designed to handle them at initialization. These are different code paths that need separate error handling.

Orphaned files are a diagnostic signal, not just a cleanup task. A large accumulation of temporary session files is evidence of a sustained crash loop. The number of files tells you how long the loop has been running.

A healthy-looking process can be non-functional. The gateway process was “running” for nearly two weeks while producing nothing useful. Process uptime is not the same as service availability. Monitor actual function, not just process status.

Takeaways

`.tmp` file accumulation is a crash loop indicator. Monitor the sessions directory.
Plugin registration ≠ plugin readiness. Verify directory contents after every install.
Initialization failures need separate handling from runtime failures.
A crash loop masks its own root cause. Stop the loop before diagnosing.
Process uptime ≠ service availability. Monitor actual function.
Startup validation catches initialization failures before they become crash loops.
GUI session requirements for service accounts are infrastructure requirements. Document them.
Known-good config snapshots enable fast rollback. Maintain one.

*AI Field Notes is a series documenting real-world AI deployments, governance experiments, and operational lessons from the field. All environments and details are generalized to protect organizational and individual privacy.*