Decision Notes February 4, 2025 7 min read

Local vs. Cloud AI: What Belongs Where in an Agentic System

The local vs. cloud question is not about cost. It is about which workloads you trust to run unattended, data sovereignty, and where human oversight fits. A decision framework from real operations.

Context

Most organizations approach AI tool selection as a binary: cloud AI (Copilot, ChatGPT, Claude) or nothing. The idea that you’d run AI locally — on hardware you own, with models that never phone home — doesn’t enter the conversation.

Once you’ve actually built a stack that uses both, the question isn’t “local or cloud?” It’s “which layer of the problem does each one solve best?” They’re not competitors. They’re complements — if you design it that way.

Here’s how I think about the routing decision, built from operating both in parallel on real workloads.

What We Built

A hybrid architecture with two tiers:

Local tier: A small language model running on local hardware via Ollama. Handles frequent, low-stakes, privacy-sensitive, or cost-sensitive tasks. Always available, zero marginal cost per call, no data leaves the building. Cloud tier: Frontier models (Claude, GPT-4-class) accessed via API. Handles complex reasoning, long-form writing, nuanced judgment, multi-step planning. Higher capability ceiling, per-token cost, data leaves local infrastructure.

The agent has a routing policy that determines which tier handles each task. Most decisions are automatic. Some are explicit.

The Routing Framework

Use Local When:

The task is frequent and low-complexity.

System monitoring, log scanning, status checks, classification tasks, simple summarization — these run dozens or hundreds of times a day. Running them against a frontier API would be expensive and unnecessary. A local 8B or 14B model handles them well at zero marginal cost.

The data is sensitive.

Anything touching internal financial data, client information, personnel details, or anything that would be uncomfortable if it appeared in a training dataset — run it locally. The data physically cannot leave if the model never calls an external endpoint.

Latency matters more than quality.

Local inference is fast. If you need a near-instant response for a lightweight classification or routing decision, local wins on speed regardless of cost.

You need predictability and offline capability.

Cloud APIs go down. Rate limits get hit. Local models are always available and always consistent.

Use Cloud When:

The task requires genuine reasoning depth.

Complex strategy, nuanced writing, multi-document synthesis, ambiguous judgment calls — these benefit from the capability ceiling of frontier models. A local 14B model will give you an answer. A frontier model will give you a better one.

The output is high-stakes or external-facing.

Anything that goes to a client, gets published, or drives a major decision warrants the best available model. The marginal cost of a frontier API call is trivial compared to the cost of a bad output in the wrong context.

Context window matters.

Local models have limited context windows. Long documents, multi-document analysis, or tasks requiring extensive conversation history need the larger context available from frontier providers.

You need the latest capabilities.

Frontier models are updated continuously. Local models are static until you update them. For tasks where state-of-the-art matters, cloud wins by definition.

What Broke (or Almost Broke)

Routing Everything to Frontier Models

Before the hybrid architecture was formalized, the default was to use frontier models for everything. It worked. It was also unnecessary and expensive.

Heartbeat checks running every 15 minutes. Log scans running on a schedule. Routine status summaries. None of these required the reasoning depth of a frontier model, but they were all going to one anyway.

The fix: explicit routing policy. Define which task categories belong to which tier before you start, not after you see the bill.

Lesson: Default routing to the cheapest model that can do the job acceptably well, not the best model available.

Assuming Local Quality Would Be Sufficient for Complex Tasks

The opposite failure: trying to run genuinely complex reasoning tasks through a local model to save cost. The outputs were plausible but subtly wrong in ways that weren’t always obvious on first read.

For internal summaries and quick checks, “plausible but imprecise” is acceptable. For strategy documents, client deliverables, or anything that drives decisions — it isn’t.

Lesson: Know the quality floor your use case requires. Local models have a real capability ceiling. Don’t pretend otherwise.

No Privacy Boundary Enforcement

Early on, routing decisions were made case-by-case without a systematic privacy check. Sensitive data occasionally ended up routed to cloud APIs that it shouldn’t have touched.

The fix: a standing rule, enforced in the routing policy, that any task containing client-identifiable data, internal financial information, or personnel details routes to local only — regardless of complexity. If the task is too complex for local to handle well, the data gets summarized locally first, then the sanitized summary goes to cloud.

Lesson: Privacy routing rules need to be systematic, not ad-hoc. Build the boundary into the policy before the failure, not after.

What We Learned

Local and cloud AI are complements, not competitors. The question is always “which tier for which task?” — never “which one do we use?” Cost optimization and privacy protection point to the same solution. Run sensitive, frequent, low-complexity tasks locally. Both objectives are served by the same routing decision. The routing policy is the real product. The models are commodities. The decision framework that determines which model handles which task — that’s what you actually build and maintain. “Good enough” is a legitimate quality bar. For a heartbeat check, a local model that’s 90% as accurate as a frontier model at zero marginal cost is the right choice. Perfection isn’t always the goal.

What We Changed

Formalized a written routing policy: task category → model tier, with explicit exceptions
Added a privacy check as a mandatory pre-routing step for any task touching sensitive data
Moved all scheduled, recurring tasks (monitoring, log checks, status reports) to local by default
Reserved frontier models for writing, reasoning, strategy, and external-facing outputs

Takeaways

Default to the cheapest model that meets the quality bar. Escalate when needed.
Frequent + low-complexity + sensitive = local. Always.
High-stakes + external-facing + complex reasoning = cloud. Always.
Privacy routing is a policy, not a judgment call. Systematize it.
The routing framework is the real infrastructure. Build it deliberately.
Sanitize locally, reason in cloud. When sensitive data needs complex analysis, summarize it locally first, then pass the summary to the frontier model.

The larger lesson: The local vs. cloud question is not a binary. It is a per-workload decision based on sensitivity, reliability requirements, cost tolerance, and context window demands. The right framework is not which is better — it is which belongs here.

Related notes:

Thirty Days Running an Always-On AI Agent
What Background Jobs Taught Me About Reliable AI Agents
What AI Agents Should Remember