Builder Log February 25, 2025 7 min read

The Hook System Trap: Three Gateway Crashes and One Design Lesson

Three real crashes from the same misconfigured hook pattern — and the architecture insight that finally stopped them. The lesson applies to any event-driven system.

Every complex system has a failure mode that looks like a configuration problem until you understand the system well enough to see it is an architecture problem. For my AI gateway, that failure mode was the hook layer. Three crashes, each from the same root cause, each taking hours to trace. The final fix took ten minutes. The lesson applies to any event-driven system.

Context

Extending an AI agent’s startup behavior seems simple: run a script when the gateway starts, validate the environment, send a smoke test notification. Standard stuff.

What’s not obvious until you crash the gateway is that the extension mechanism matters enormously — and using the wrong one doesn’t fail gracefully. It crashes the process on every startup, silently, in a loop.

Here’s what happened, why it happened three times, and how the correct pattern actually works.

What We Were Trying to Build

A startup validator: a script that runs every time the AI gateway initializes, checks that critical dependencies are alive, validates the workspace, and sends a confirmation message. If something’s wrong at startup, catch it immediately rather than discovering it six hours later when a cron job silently fails.

The requirements were simple:

Runs automatically on every gateway start
Checks: port reachability, workspace read/write, cron job config validity, messaging delivery
Sends a pass/fail notification

The implementation turned out to be the complicated part.

Crash #1: The Config Key That Doesn’t Exist

The first attempt: set a config key to point to the startup script.

“`json

{

“hooks”: {

“internal”: {

“gateway:startup”: “scripts/startup-validator.py”

}

“`

Looked reasonable. Applied the config. Gateway crashed on next start. Crashed again on restart. Crashed again.

The config key `hooks.internal[“gateway:startup”]` doesn’t exist in the schema. The gateway tried to initialize with an invalid config structure, couldn’t resolve the hook reference, and entered a crash loop — restarting, failing, restarting.

The fix was straightforward: revert the config, remove the invalid key, restart. But diagnosing it required understanding that the gateway was crashing at the config validation stage, not at runtime — which wasn’t obvious from the error output.

Lesson: Invalid config keys don’t fail with a helpful error. They crash the process. Always validate config schema before applying, and always have a known-good config to revert to.

Crash #2: The Wrong Hook Namespace

After learning that `hooks.internal` wasn’t the right mechanism, the next attempt targeted a different hook path that seemed more promising based on documentation.

Same result. Different key, same crash loop.

The pattern: the gateway’s hook system has specific, enumerated extension points. Guessing at key names — even plausible-sounding ones — produces the same outcome as using invalid keys. The schema validates strictly, and invalid extension points are fatal.

Lesson: Don’t guess at config key names. Verify against the actual schema before writing any config change. Use schema lookup tools before applying anything that touches extension points.

Crash #3: The Plugin That Had Missing Dependencies

By this point, the startup validator was working via the correct mechanism (more on that below). But a second crash loop emerged from a different source: a security plugin that had been installed and activated, but whose dependency files were missing from the installation directory.

The plugin hook was registered. The gateway tried to load it on startup. The dependency file wasn’t there. Crash.

The tricky part: the crash looked identical to the earlier config crashes. Same symptom, different cause. Diagnosing it required inspecting the plugin directory and confirming the missing file — not just looking at the gateway error log.

Lesson: Plugin installation doesn’t guarantee plugin completeness. After installing any hook or plugin, verify the directory structure matches what the loader expects. A registered hook with missing dependencies is a timed crash waiting for the next restart.

The Correct Pattern: BOOT.md

The actual correct mechanism for running code at gateway startup is a workspace file called `BOOT.md`.

Instead of config keys or hook registrations, you write a markdown file in the workspace root with instructions that execute on startup. The gateway reads this file as part of its boot sequence, interprets the instructions, and executes them in the agent context.

The advantages:

It fails gracefully. If the BOOT.md instructions have an error, the gateway logs it and continues — it doesn’t crash. It’s workspace-local. The boot instructions live with the workspace, not in the gateway config. They survive config resets and are portable. It’s readable. BOOT.md is a markdown file with instructions, not a JSON config key pointing to a script path. You can read it, edit it, and understand it without knowing the config schema. It’s the documented path. It’s the mechanism designed for this use case, which means it’s maintained and schema-valid by definition.

The startup validator, once moved to BOOT.md, worked on the first try. No crash loops. No config rollbacks. No missing dependency errors.

What We Learned

The extension mechanism matters more than the extension logic. A perfect startup script using the wrong hook mechanism is worse than no startup script — because it crashes the thing you’re trying to validate. Invalid config keys don’t warn you. They crash the process. There’s no “unknown key, ignoring” behavior. If the schema doesn’t recognize a key, the gateway fails at initialization. This is strict-mode behavior, and it has real consequences. Schema lookup before config edit is mandatory, not optional. Every config change that touches hook registration, plugin activation, or extension points needs a schema validation step first. Treat it like a migration review. The documented path exists for a reason. The temptation to be clever with config structure — to find a “better” way to wire something — is real. In most cases, the documented path is documented because it’s the one that was designed to work reliably. BOOT.md was the answer. It was just not the obvious one. Same symptom, different cause requires deeper diagnosis. Three crash loops with the same surface behavior had three different root causes. When the same symptom recurs, don’t assume the same fix applies. Dig to the actual cause each time.

What We Changed

Moved all startup extension logic to BOOT.md — no hook registration, no config keys
Added schema validation as a mandatory pre-step before any config change
Added a plugin directory check after every plugin install — verify the expected files exist
Built a config rollback habit: maintain a known-good config snapshot, test changes against it before applying to production
Added crash loop detection to the heartbeat: if the gateway restarts more than twice in 5 minutes, escalate immediately

Takeaways

Invalid config keys crash the gateway. They don’t warn you.
Schema lookup before config edit. Every time.
Plugin registration doesn’t mean plugin completeness. Verify the directory.
BOOT.md is the correct startup extension mechanism — use it.
Same symptom ≠ same cause. Diagnose each crash independently.
Keep a known-good config snapshot. Config rollback is faster than debugging.
Crash loops are the loudest signal that something is fundamentally wrong. Don’t try to power through them.

The larger lesson: Event-driven systems fail in ways that sequential systems do not. When the failure is in the hook configuration, every downstream behavior becomes unreliable in ways that are hard to isolate. Build the event layer carefully. Log every hook transition. Test failure modes, not just happy paths.

Related notes:

What a Week-Long Incident Taught Me About Reliability
The Heartbeat Problem