I Hit the Rate Limit So Many Times I Built a System to Route Around It

I racked up $280 in Claude charges in 45 minutes.

I was in the middle of a complex build - multiple files, deep architectural reasoning, the kind of work where you genuinely need the best model available. I was pushing hard, dispatching task after task, and I wasn't paying attention to the meter. When I finally checked, I'd burned through what should have been days of allocation in less than an hour.

That was a Tuesday. I still had the rest of the week ahead of me.

Most people hit a rate limit and do one of two things. They wait for the reset. Or they switch to a different tool and lose all their context. Both of those are fine if you're building one thing on the side. But I'm not building one thing. I'm running 16 products, autonomous workflows, research agents, and iterating on code - all from the terminal. Stopping for 6 hours because I've burned through my allocation isn't an option.

So I did what I always do when the same problem hits me three times. I built a system.

What I Was Doing Before

The honest answer is: wasting tokens.

I was sending everything to Claude Opus. Complex multi-file builds that required deep architectural reasoning? Opus. Formatting a test output? Also Opus. Generating a simple config file? Opus again. Every task got the same model, the same cost, the same token burn - regardless of whether the work actually needed that level of reasoning.

The first time I hit the rate limit mid-build, I waited. Sat there for hours doing nothing. The second time, I opened ChatGPT in a browser tab and tried to copy-paste my context over. It was painful - lost half the context, had to re-explain everything, and the output quality dropped. The third time, I stopped and thought about the actual problem.

It wasn't that I needed to manage the limit better. It was that I needed to stop thinking about individual models entirely and start thinking about capacity as a pool that gets allocated based on what the work actually needs.

The Router

The model router is a Python module that sits between every task in my system and the AI model that executes it. Nothing gets dispatched directly to Claude or Gemini or anything else. Everything goes through the router first.

The core idea is simple. Every type of task has a ranked list of models, best to worst. Complex builds go to Claude Opus first - it has the best multi-file reasoning by far. Research tasks go to Gemini first because of the million-token context window and strong web search. Mechanical tasks - running tests, formatting outputs, generating configs - go to Ollama first because it runs locally on my machine and costs literally nothing.

But the part that actually solved the $280 problem is burn rate tracking. The router knows how far through each billing cycle I am and how much of each pool I've used. If I'm 30% through my Claude week but I've already burned 60% of the allocation, the router knows the burn rate is unsustainable. It immediately starts shifting work down the hierarchy - moving tasks that would normally go to Claude onto Gemini or Ollama instead, preserving the expensive tokens for work that genuinely needs them.

There's also an emergency override system for when you absolutely need a specific model regardless of what the router thinks. But it requires explicit approval through a Telegram notification pipeline. No silent bypasses. The friction is deliberate - if I'm overriding the router, I want to be conscious about it.

Right now the hierarchy table has 94 different task types, each with a ranked preference list across 8 models spanning 4 platforms: Claude (Opus, Sonnet, Haiku), OpenAI (GPT-5.4, Codex-5.3), Gemini 2.5 Pro, and a local Ollama instance running Qwen 2.5 32B.

What This Actually Looks Like Running

My system runs a brain loop - a background process that wakes up every few minutes, checks for work, and dispatches tasks autonomously. Watch how the router handles three different tasks from the same system.

An idle check - "is there anything to do?" - goes to Ollama. Free. No tokens burned.

A morning digest that synthesises overnight activity? That needs decent writing quality, so it goes to Sonnet. Good enough, and it preserves Opus for the heavy lifting.

An escalation diagnosis where something has gone wrong and needs root cause analysis? That goes straight to Opus. Deep reasoning. No compromise.

Same system, three different models, three different cost profiles. All automatic. I don't think about it.

Research is another good example. Web scanning and data gathering goes to Gemini because it's genuinely excellent at processing large amounts of information and it has separate quota from Claude. The final analysis and synthesis? That comes back to Opus because the quality of reasoning matters for the output.

The pattern is simple. Match the model to the cognitive demand of the task. Don't waste reasoning power on mechanical work. Don't cheap out on work that requires genuine judgment.

Why I Build Like This

I think about this constantly: if you're solving the same problem twice, you should be building a system.

The router is a perfect example. I spent a few days building it, and now it handles thousands of routing decisions without me thinking about it. It's not glamorous work. Nobody's going to see a model router and think "wow, what a cool product." But it's the kind of infrastructure that makes everything else possible.

I have the same approach to the entire operating system I've built. The products I build are the visible output. But the machine underneath - 5 AI agents, workflows, a memory system, scheduler, approval gates, the model router - that's what actually compounds. Every improvement to the machine makes every future product cheaper and faster to build.

Most people optimise their workflow. I'd rather optimise the system that runs the workflow. Get the machine right and everything scales.

The $280 Lesson

That $280 in 45 minutes was genuinely the best money I wasted. It made the problem visceral in a way that abstract "manage your tokens better" advice never could. I felt it. And because I felt it, I built something that solved the problem permanently instead of just being more careful next time.

The router took me from "I'm stuck, Claude's out of tokens again" to "94 task types automatically distributed across 4 platforms based on real-time burn rates." I haven't manually switched models in months. The router handles it.

That's the systems-first approach. Don't solve the problem. Solve the category of problem. Then go build something else while the infrastructure handles itself.

If you want the full story of how one prompt changed everything, or why I think the vibe coding backlash is missing the point, those are worth reading next.

What I Was Doing Before

The Router

What This Actually Looks Like Running

Why I Build Like This

The $280 Lesson

Build log, delivered weekly