Back to Building

How We Cut Claude Token Usage 60% With a Multi-LLM Delegation System

We hit 92% of our Claude Max weekly quota mid-week. Instead of upgrading, we built a delegation system: Gemini reads docs, Codex writes code, Claude orchestrates. Here's exactly how.

Agentic BusinessClaude CodeAI Tools

Wednesday morning. The Claude Max usage dashboard read 92%. The weekly quota resets Friday. Two days of work remained — a new feature build, three blog posts, an SEO audit, and the daily briefings that keep the business running. The math didn't work.

The obvious answer: upgrade. Pay more. Get more tokens. The less obvious answer, and the one that turned into a permanent architectural change: stop feeding Claude work it doesn't need to do.


The Token Audit

Before building anything, the team ran a token audit. Which skills consumed the most Claude tokens?

Research-heavy skills — /brief, /plan, /seo-optimize, /learn — consumed 30,000 to 50,000+ tokens per invocation. Not because the output was long, but because the input was massive. These skills read the entire docs/ folder to build context: product requirements, business plans, decision logs, API documentation. Roughly 850,000 tokens of internal documentation, consumed every time an agent needed to understand the project before doing its actual work.

The daily briefing alone consumed 40,000 tokens of context before generating a 3,000-token summary. The actual thinking — the synthesis, the judgment calls, the recommendations — used maybe 5,000 tokens.

90% of token usage was reading. 10% was reasoning.


The Mode Switch

Normal operations run everything through Claude Opus. It's the best model for reasoning, security review, brand voice, and architectural decisions. There's no reason to add complexity when quota is comfortable.

But when quota runs tight — and it does, every week during heavy sprints — the team switches to multi-LLM mode with a single command: /mode multi.

Configuration lives in .claude/llm-mode.json — gitignored, per-machine:

{
  "mode": "multi",
  "gemini_available": true,
  "codex_available": true
}

Every skill that supports delegation checks this file before routing work. If the mode is single (the default), skills run everything through Claude. If multi, bulk reading and code generation get delegated to cheaper models while Claude handles what it does best.

The design rule is non-negotiable: no hard dependencies on external tools. If Gemini CLI breaks in an update, if Codex changes its API, the system falls back to Claude-only operation silently. Multi-LLM delegation is a quota optimization, not a requirement.

/mode single switches back instantly. Most sessions never leave single mode.


The Cognitive Tier Model

The token audit revealed something useful. Not all LLM work requires the same capability tier.

Reading 850K tokens of documentation and producing a structured summary is bulk processing. It requires a large context window, but not Claude's judgment or creativity. Generating boilerplate code from a detailed specification is pattern matching — it needs code competence, not architectural vision.

Only certain tasks need Claude's full capability: synthesizing multiple sources into strategy, reviewing code for security implications, making architectural decisions with subtle tradeoffs, writing content that matches a specific brand voice.

Match the tool to the cognitive tier:

Cognitive TierTask TypeBest ToolWhy
Bulk readingDoc summarization, context buildingGemini CLI1M token context, free
Bulk writingCode generation from specsCodex CLIWorkspace write access, free
OrchestrationSynthesis, judgment, review, securityClaudeStrongest reasoning, brand voice

The Architecture

Two delegation targets. One orchestrator.

Gemini CLI — The Reader

Google's Gemini CLI runs locally with a 1M token context window. The entire docs/ folder — every product doc, decision log, session log, business plan — fits in a single Gemini context window with room to spare.

The delegation pattern: Claude constructs a reading prompt, pipes it to Gemini CLI with the target folder, and reads back a compressed summary. Instead of Claude consuming 50,000 tokens of raw documentation, it consumes a 3,000-token summary that Gemini produced.

Gemini is available through a Google subscription the team already pays for. Marginal cost: zero.

Codex CLI — The Developer

OpenAI's Codex CLI runs gpt-5.3-codex with workspace write access. It reads the codebase, generates code, and writes files directly — fast, competent, and available through an existing OpenAI subscription.

The delegation pattern: Claude reads the codebase, decides what needs to be built, writes a hyper-specific implementation prompt (exact file paths, exact function signatures, exact acceptance criteria), and hands it to Codex. Codex writes the code. Claude reviews the diff.

This is the Manager/Developer pattern. Claude is the senior engineer who architects and reviews. Codex is the junior developer who types fast and follows instructions precisely.

Claude — The Orchestrator

With bulk reading and bulk writing delegated, Claude's role narrows to what it does best:

  • Synthesis. Combining Gemini's summaries with codebase knowledge into actionable plans.
  • Judgment. Choosing between approaches when multiple valid options exist.
  • Review. Examining Codex's generated code for correctness, security, and architectural fit.
  • Security-sensitive code. Authentication, payment processing, RLS policies — anything where a subtle bug has outsized consequences. These never get delegated.
  • Brand voice. Content requiring specific tone and nuance.

How Delegation Works in Practice

The /brief Skill — Before and After

Before (single mode): Claude reads the full docs/ folder (~850K tokens), processes it, generates a briefing. Token usage: ~50,000.

After (multi mode):

  1. The skill checks .claude/llm-mode.json — mode is multi.
  2. It constructs a Gemini-specific prompt and executes: echo "prompt" | npx @google/gemini-cli -p "" @docs/ -o text > .gemini_temp_briefing.md
  3. Claude reads the ~3,000-token summary.
  4. Claude spot-checks 2-3 file paths mentioned in the summary to verify accuracy.
  5. Claude enriches the briefing with its own analysis and presents the result.

Token usage: ~5,000-8,000. Savings: ~40,000-45,000 per invocation.

The tradeoff is latency. Gemini takes 60-90 seconds to process the docs folder. Claude would consume those tokens instantly. When quota is tight, 90 seconds of waiting beats running out of tokens on Thursday.

Codex — The Four-Layer Review

When Claude delegates code generation to Codex, trust but verify:

  1. Prompt construction. Hyper-specific: exact file paths, function signatures, TypeScript types, test cases. Implementation specs, not product requirements.
  2. Execution. codex exec -m gpt-5.3-codex -s workspace-write "prompt"
  3. Four-layer verification. Exit code check → Git diff review → npm run type-check → Claude code review.
  4. On failure. Auto-revert to safety commit. Claude writes the code itself. Silent, automatic.

The hard rule: Security-sensitive code never gets delegated. Authentication, payment processing, RLS policies, webhook signature verification — Claude writes these directly. The cost of a subtle security bug exceeds any token savings.


Results

Across a typical work day with delegation active:

SkillInvocations/DayTokens Saved/InvocationDaily Savings
/brief5-8~40,000-50,000~200,000-400,000
/plan2-4~30,000-40,000~60,000-160,000
/seo-optimize2-3~20,000-30,000~40,000-90,000
/learn3-5~25,000-35,000~75,000-175,000
Code generation (Codex)5-10~10,000-20,000~50,000-200,000
Total estimated~425,000-1,025,000/day

Over a five-day work week, that's roughly 2-5 million tokens saved. The ~60% reduction in Claude token usage is a conservative estimate based on the first two weeks of operation.

The delegation failure rate runs about 8-12%. When delegation fails, the fallback to Claude is automatic and silent.


What We Learned

Gemini hallucinates file paths. Summaries are generally accurate, but Gemini occasionally references files that don't exist. The spot-check step catches this reliably. Without it, hallucinated references would propagate into plans.

Codex needs surgical prompts. High-level intent produces reasonable-looking code that fails type checks. Implementation specs — exact file paths, exact function signatures, exact imports — produce reliable code. The more specific the prompt, the better the output.

"Free" isn't neutral. Each delegation adds latency (60-90 seconds for Gemini, 30-60 seconds for Codex) and complexity. The engineering cost of building the delegation infrastructure is real. Worth it at our token usage level; wouldn't be for a project that fits comfortably within quota.

The cognitive tier model works. The core insight — different tasks need different capability levels — has held up across two weeks of daily use. Reading docs is low-judgment work that a large-context model handles well. Code generation from specs is pattern-matching. Synthesis, judgment, and review genuinely need Claude. The tiers map to real differences in task complexity.


The Broader Pattern

The multi-LLM delegation system is an instance of a broader principle: use the cheapest tool that can do the job well enough.

The AI industry's default assumption is that one model should do everything. That's convenient, and often correct. But when you hit the ceiling — a token quota, a cost budget, a latency requirement — the answer isn't always a bigger model. Sometimes it's a smarter distribution of work across models that are each good at different things.

Wednesday's quota crisis forced the team to build something that should have existed from the start. The best infrastructure decisions often start that way.


See the System in Action

The multi-LLM delegation system is one layer of the agentic business architecture behind MyWritingTwin.com. Every Style Profile — the AI Writing Profile that captures your unique communication style for use with ChatGPT, Claude, Gemini, or any AI — is generated by the same infrastructure described in these building-in-public posts.

Curious what systematic style extraction looks like? Get your Style Profile and see the methodology applied to your writing.